How to Scrape in Another Language and Location

How to Scrape in Another Language or Location

In the world of data, context is vital for business growth. If you want to uncover insights and accomplish new goals, you need to ensure you’re collecting and analyzing the correct datasets.

Let’s say you’re a business that wants to enter new markets. To do so, you will need to collect localized data to help your teams understand new audiences, their needs, buying journeys, etc. However, collecting localized data through web scraping requires a series of tools like IP rotation systems to mimic the behavior of a local searcher. For example, sending your requests through French-based IPs to collect data from a French site – this is what we call geotargeting.

A common use case for geotargeting is scraping websites that display different results based on your IP location, like search engines and eCommerce sites. So, in this tutorial, we’ll show you how to use ScraperAPI’s localized IP pools to extract local data from NewChic, a marketplace serving over 20 countries in 10+ languages, using Python.

Quick Code Snippet

For those with experience, here’s the final code snippet. You can change the country_code parameter to a specific country using any supported country codes from our documentation.

For this project to work, we’ll use three main tools:

  • Requests – pip install requests
  • Pandas – pip install pandas
  • bs4 – pip install bs4
import requests
from bs4 import BeautifulSoup
import pandas as pd

product_data = []

payload = {
   'api_key': 'YOUR_API_KEY',
   'country_code': 'us',
   'url': 'https://www.newchic.com/kitchen-tools-and-gadgets-c-5037/'
}

response = requests.get('https://api.scraperapi.com', params=payload)
soup = BeautifulSoup(response.content, 'html.parser')

products = soup.select('div.mb-lg-32')
for product in products:
   product_name = product.find('a', class_='product-item-name-js').text
   product_data.append({
       'product': product_name
   })

df = pd.DataFrame(product_data)
df.to_json('test.json', indent=2)

Keep reading to learn how we built it!

1. Understanding Multi-Language Websites

The first thing we need to do is get to know our target site.

Although all websites are built using pretty much the same technologies (HTML, CSS, and JavaScript), the truth is that each site is a unique puzzle that’ll require some research before being able to solve it.

For example, some websites can add all the page’s content dynamically, which would require rendering JavaScript to be able to access the page, while others could be routing traffic based on their IP location to subdomains.

Checking For Dynamic Content

Before we start figuring out how to extract the content, let’s look at the source code and make sure we’ll be able to access its content from the HTML file.

How to Scrape in Another Language and Location

For this exercise, we’ll want to extract the name of all products within the page, so let’s copy the name of the third product and look for it in the source code.

How to Scrape in Another Language and Location

Perfect! If the name is in the source code, that means we’ll be able to get it from the HTML file.

Showing Different Languages on the Page

The second thing to learn is how the URL structure of the pages changes with languages. Currently, we’re on the US version of the site, and this is the URL shown:

https://www.newchic.com/kitchen-tools-and-gadgets-c-5037/?country=223&NA=1

When clicking on the country menu, we have the following options:

How to Scrape in Another Language and Location

Let’s change shipping to Italy and see how the URL reacts.

https://www.newchic.com/kitchen-tools-and-gadgets-c-5037/?country=105&WE=1

The new URL has different country and WE parameters but doesn’t change the language of the products on the page.

How to Scrape in Another Language and Location

Let’s put US shipping back and set the language to Italian, and see what happens:

How to Scrape in Another Language and Location

Now that did the trick. The products’ names are now in Italian, but we’re also in an IT subdomain:

https://it.newchic.com/kitchen-tools-and-gadgets-c-5037/?country=223&NA=1

Overall, this experiment tells us two things:

  • The parameters at the end of the URL determines the shipping information
  • The subdomain changes the language of the page

With these two pieces of information, we can now go ahead and start coding a beat!

2. Testing CSS Selectors

It sounds logical that multi-language websites use the same HTML and CSS structure across all their site versions. Sadly, that’s not always the case, so testing our CSS selectors on each subdomain will make our lives that much easier.

Let’s begin by inspecting the US site version to spot where in the structure are the product names.

How to Scrape in Another Language and Location

As you can see in the image above, the product name is inside a <a> tag with the class="lg text-ellipsis d-block text-hover-underline text-secondary-hover font-small-12 text-grey product-item-name-js".

After checking several products, we can confirm they all share the same class. For practicality, though, let’s try and use the product-item-name-js part of the class and see if we can pick all product names through the browser’s console.

How to Scrape in Another Language and Location

It returns a node list with 60 items, which is the exact number of products per page. If we expand it and hover over one item, we’ll see it highlights the name of the product on the page.

How to Scrape in Another Language and Location

Now, let’s do the same thing on the Italian and French versions.

How to Scrape in Another Language and Location

Both times worked. The same class is used across all subdomains, and we can now use it to target the products’ names.

3. Sending Our Requests Through ScraperAPI

Remember we said we’d be using ScraperAPI to access its pool of localized IPs? Well, there’s one more reason for it.

NewChic uses heavy anti-scraping mechanisms that can easily spot any bots trying to scrape their site. In fact, without ScraperAPI, we got a 403 forbidden response every time, no matter the HTTP headers we were sending.

However, ScraperAPI uses machine learning and years of statistical analysis to smartly pick the right combination of IPs and headers and rotate them as necessary, giving us access to the site without hardcoding every system necessary to bypass their anti-scraping mechanisms.

To start using ScraperAPI, create a free account to retrieve your API key – plus 5,000 API credits.

From there, let’s create a new file (we named ours newchic_scraper.py), import Requests, and create a payload to send along your get() request:

import requests

payload = {
   'api_key': 'YOUR_API_KEY',
   'url': 'https://www.newchic.com/kitchen-tools-and-gadgets-c-5037'
}

response = requests.get('https://api.scraperapi.com', params=payload)
print(response)

If you run your code, it should return a 200 successful response.

4. Building Our Parser

Taking a closer look at the page, you can see that every product is like a card:

How to Scrape in Another Language and Location

And each card is represented by a <div> with class="product-item product-item-js mx-lg-15 position-relative product-item-wider product-item-show-color mb-lg-32" in the HTML structure:

How to Scrape in Another Language and Location

To extract the information, we can parse the HTML response using BeautifulSoup and create a list with all these <div> tags – we’ll be using the shortened version of the class as no other element uses the mb-lg-32 bit of it.

soup = BeautifulSoup(response.content, 'html.parser')

products = soup.select('div.mb-lg-32')

Note: Remember to add from bs4 import BeautifulSoup at the top of the file.

With all the elements inside products, we can now loop through each product and extract the products’ name using the same CSS selector from before:

for product in products:
   
   product_name = product.find('a', class_='product-item-name-js').text

Before running the script, let’s give it a little order to see better what happens.

5. Sending the Scraped Data to a JSON File

Before the payload, create an empty array where we’ll store each product’s name:

product_data = []

Now, we can use the .append() method on our array to store the names from within our for loop:

product_data.append({
       'product': product_name
   })

After the entire for loop runs, all information will be formatted inside product_data.

Finally, let’s import Pandas at the top of the file to create a data frame we can easily export as a JSON file using the .to_json() method:

df = pd.DataFrame(product_data)
df.to_json('us-products.json', indent=2)

6. Running Our Code Without Geotargeting

So far, your script should look like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd

product_data = []

payload = {
   'api_key': 'YOUR_API_CODE',
   'url': 'https://www.newchic.com/kitchen-tools-and-gadgets-c-5037/'
}

response = requests.get('https://api.scraperapi.com', params=payload)
soup = BeautifulSoup(response.content, 'html.parser')

products = soup.select('div.mb-lg-32')
for product in products:
   product_name = product.find('a', class_='product-item-name-js').text
   product_data.append({
       'product': product_name
   })

df = pd.DataFrame(product_data)
df.to_json('us-products.json', indent=2)

If we run it, it’ll create a JSON file, with 60 product names, inside the project’s directory:

How to Scrape in Another Language and Location

Note: The last item is number 59 because the first item is numbered as 0.

7. Enabling Geotargeting

To use ScraperAPI’s geotargeting, the only thing we have to do is adding the country_code parameter to our payload and set it to our country of choice – you can find all codes inside the documentation.

Let’s start by setting the parameter to it and changing the name of our JSON output to italy-products.json:

payload = {
   'api_key': 'YOUR_API_KEY',
   'country_code': 'it',
   'url': 'https://www.newchic.com/kitchen-tools-and-gadgets-c-5037/'
}

Here’s the result after running the script:

How to Scrape in Another Language and Location

One thing you might have noticed is that we’re sending the request targeting the US version of the site, but because we’re telling ScraperAPI to use an Italy-based IP, NewChic is redirecting our script to the Italian subdomain.

The same happens if, for example, we set country_code=fr:

How to Scrape in Another Language and Location

Just like that, we can use scraped localized data in any language and from any location without getting blocked by anti-scraping mechanisms.

Wrapping Up

You might wonder why we would use geotargeting if we could just target the corresponding subdomain. After all, if we send our request to https://it.newchic.com/kitchen-tools-and-gadgets-c-5037, we would get the same list of products with Italian names, right? Well… kinda.

As we said before, every website is different, so there are little nuances you need to keep in mind when collecting localized data.

For example, when using a different IP location (e.g., es) to send a request to the main version (which is in English), the server recognizes we’re in another country and routes us to the corresponding subdomain.

However, if we do the same but target a localized subdomain, that’s not what happens. No matter if we send a request with country_code=fr to the https://it.newchic.com/ version, we’ll always get data in Italian.

Although for NewChic itself doesn’t matter too much, in other scenarios, you could use this interaction to send a US request to the Italian subdomain of your target page and, more likely than not, get different results.

That’s because eCommerce sites want to sell you their products. So an Italian site will show you the products it can ship to the US.

A good example of this is Amazon. We can use ScraperAPI to send an IT request to Amazon’s Canada TLD (amazon.ca), and it’ll show us the best results from its Canadian catalog that has available shipping to Italy.

If you understand how sites work and the context of the data you need, geotargeting with ScraperAPI will be a powerful tool to get accurate results every time.

Until next time, happy scraping!

About the author

Leonardo Rodriguez

Leonardo Rodriguez

Leo is a technical content writer based in Italy with experience in Python and Node.js. He’s currently working at saas.group as a content marketing manager and is the lead writer for ScraperAPI. Contact him on LinkedIn.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.