Turn webpages into LLM-ready data at scale with a simple API call

How to Scrape Geo-Restricted Data Without Getting Banned

How to Scrape Geo-Restricted Data Without Getting Banned

While the internet is often considered free and open to all, some geographical restrictions are still placed on some websites.

Sometimes, the changes are subtle, such as automatically switching languages. In other cases, entirely different content is served (e.g., Netflix) for people from different countries or regions.

At the extremes, some websites are entirely inaccessible unless your IP address is from a specified country. While all of these restrictions do serve a proper purpose, they also make web scraping significantly more difficult.

There are several ways to bypass geographical restrictions while scraping, such as using proxies within your own scraper or using pre-built solutions that take care of the hassle.

Using Proxies to Bypass Geo-restrictions

Residential proxies are, in fact, the way most scrapers bypass geo-restrictions. Since you get an IP address from a device that’s physically located in a country of your choice. When a proxy relays requests from your machine to a website, it’ll think that the true source of the request is from within that country.

While there are numerous other proxy types, residential proxies are generally regarded as your best bet for most web scraping tasks, especially those that involve geographical restrictions. Datacenter proxies, while fast and cheap, have a limited range of locations and are more easily detected.

ISP proxies would work perfectly fine as they are as legitimate as residential proxies and as fast as datacenter proxies, but they’re one of the most expensive options available. Additionally, the pool of IPs is usually quite limited.

While purchasing proxies directly from a provider and integrating them into your scraper is definitely efficient, it has one caveat: you still need to build the scraper itself. Extensive programming knowledge is required for any scraping project that has a decent scope.

Constant updates to the scraping solution will also be required as minor changes in layouts, website code, or anything in between will cause it to either break completely or return improper results.

Then there’s the headache of data parsing and storage, both of which are complicated topics on their own.

So, while buying proxies from a provider can be a good solution for some, it’s usually reserved for those who can build a scraper on their own.

Using ScraperAPI

ScraperAPI manages the entire scraping pipeline, from proxies to data delivery, for its users. There’s no need to build something from the ground up: you can start scraping as soon as you get a plan and write some basic code.

We’ll be using Python to send requests to the ScraperAPI endpoint to retrieve data from websites.

Preparation

First, you’ll need an IDE such as PyCharm or Visual Studio Code to run your code. Then you should register for an account with ScraperAPI.

Note: ScraperAPI’s free trial is enough to test out geotargeting as it provides access to all of the premium features. Once that expires, however, you’ll need one of the paid plans unless US/EU geotargeting is enough for your use case.
Once you have everything set up, we’ll be using the requests library to send HTTP requests to the ScraperAPI endpoint. Since requests is a third-party library, we’ll need to install it first:

pip install requests

That’s the only library you’ll need since ScraperAPI does all the heavy lifting for you. All you need to do is begin writing code and getting the URLs you want to scrape.

Sending a simple request

It’s often best to start simple and increase code complexity as you go. We’ll start by sending a GET request to a website that restricts EU users to verify what happens if we do not use residential proxies or ScraperAPI:

import requests
resp = requests.get('https://www.chicagotribune.com/')
print(resp.text)

We’ll be using the requests library throughout, so we’ll have to import it. Sending a GET request is extremely simple – call the module with the GET method and pass in the URL as a string (double or single quotes required) as the argument.

Then we simply use our resp object as the argument in the print command and include text as the method. 

You’ll receive an error message, as attempting to get any response from The Chicago Tribune while using an EU IP address sends the same error message every time:

Screenshot of an error message

If you were to use a US IP address with an EU-locked website, you’d get a similar response. They all differ slightly; however, the end result is the same.

import requests
resp = requests.get('https://www.rte.ie/player/')
print(resp.text)

RTE restricts users to EU only, so with a US IP address, you get:

Screenshot of RTE restricts users to EU only

So, using either ScraperAPI or residential proxies will be necessary to access some websites. Let’s start by sending a request through ScraperAPI:

import requests
payload = {'api_key': ‘YOUR-API-KEY-HERE', 'url': 'https://httpbin.org/ip'}
resp = requests.get('https://api.scraperapi.com', params=payload)
print(resp.text)

As always, start by importing the necessary library (requests in our case). Then, define a dictionary object that has two key:value pairs – the API key (required for authentication) and the URL, which is the website you want to scrape.

We then create a response object that will store the answer retrieved from the website. You’ll need two arguments, the first of which is always the ScraperAPI endpoint, the second of which is the payload dictionary.

For now, we simply print the response. Running the code should just retrieve the origin IP address and print it in the standard output screen.

Selecting a geographical location

We’ll now switch to scraping websites that show data based on location, such as displaying different prices, currencies, or content in general.

Let’s start by implementing a country code in our ScraperAPI code to visit The Chicago Tribune and see if we get a response.
All you need to do is add an additional key:value pair to your payload dictionary. It’ll be country_code as the key and the country code in the 2-letter ISO 3166-1 format.

import requests
payload = {'api_key': 'YOUR-API-KEY-HERE', 'url': 'https://www.chicagotribune.com/', 'country_code': us}
resp = requests.get('https://api.scraperapi.com', params=payload)
print(resp.text)

You should get a large HTML response showcasing lots of data. Our screenshot is truncated for demonstration purposes:

Screenshot of large HTML response

Parsing data with BeautifulSoup

We’ll start by installing BeautifulSoup:

pip install beautifulsoup4

Now we’ll need to make some modifications to our code:

  • We’ll put the response text (the full HTML file) into a BeautifulSoup object that will be used for parsing.
  • Then, a list will be created to store all the article titles.
  • For the output, we’ll run another loop that prints each title on a new line.
import requests
from bs4 import BeautifulSoup
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
for t in unique_titles:
    print(t)

Note that we also create a unique_titles object that’s both sorted and turned into a set (from a list). Sets in Python do not store duplicate values, so it’s an easy way to remove duplicate titles from our original list.

You should get a response that’s similar to:

Screenshot of response

Finally, some websites display the same page with different data for many geographical locations. Most ecommerce businesses do that to make prices more transparent for users.

Storing data with Pandas

You’ll likely want to do more than just print out data. Otherwise, you’ll lose everything after closing your IDE or any other program.
Usually, the pandas library is more than enough for basic scraping projects. Start by installing:

pip install pandas

We’ll also import the default datetime library to add time stamps to our CSV file, which is highly useful if you need to return to it later.

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
df = pd.DataFrame({"Headline": unique_titles})
today = datetime.now().strftime("%Y-%m-%d")
outfile = f"chicago_tribune_headlines_{today}.csv"
df.to_csv(outfile, index=False, encoding="utf-16")
print(f"✔ Saved {len(df)} headlines → {outfile}")

Running the code will now create a CSV file and print a success message. The underlying code is quite simple – a dataframe is created that starts with the row “Headline” and then each other row is one of the titles.

To add a timestamp to the file, we use datetime.now() and turn it to a string using strftime and provide the format in the argument.

Finally, the dataframe is outputted into a CSV file.

Note: We use “utf-16” encoding, as “utf-8” doesn’t translate all the characters correctly.

Your CSV file should look a little like this:

Screenshot of CSV file

Further considerations

Scraping a single website with ScraperAPI is ultimately a little too simple for any real-world project, although it serves as a great starting point. You can improve your scraping code in two primary ways.

One is that you can use ScraperAPI to scrape the homepage of a website, collect all the URLs, and continue in such a fashion, building your list of URLs.

Alternatively, you can create a list object manually and input all the URLs you want to scrape, then run a loop iterating over each element as the URL.

Here’s the full code block that you can work upon:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
payload = {
    "api_key": "YOUR-API-KEY-HERE",
    "url": "https://www.chicagotribune.com/",
    "country_code": "us",
}
resp = requests.get("https://api.scraperapi.com", params=payload)
soup = BeautifulSoup(resp.text, "html.parser")
titles = [
    a.get_text(strip=True)
    for a in soup.select("a.article-title")
]
unique_titles = sorted(set(titles))
df = pd.DataFrame({"Headline": unique_titles})
today = datetime.now().strftime("%Y-%m-%d")
outfile = f"chicago_tribune_headlines_{today}.csv"
df.to_csv(outfile, index=False, encoding="utf-16")
print(f"✔ Saved {len(df)} headlines → {outfile}")

 

About the author

Picture of Leonardo Rodriguez

Leonardo Rodriguez

Leo is a technical content writer based in Italy with experience in Python and Node.js. He’s currently ScraperAPI's content manager and lead writer. Contact him on LinkedIn.

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.