Why Is My Scraper Getting Blocked?

Website owners can detect and block your scrapers. They do this by identifying patterns in the frequency of your requests, the headers you’re sending, the IP you’re using, or through CAPTCHAs.

To bypass these techniques, you can use a web scraping API to rotate IPs, handle CAPTCHAs and choose the right headers dynamically.

If you want to build more resilient scripts, follow these 10 web scraping tips to avoid getting blocked or blacklisted.

Using ScraperAPI to Avoid Anti-Scraping Techniques

ScraperAPI is a data collection tool designed to automate the most technically demanding challenges when building automated web scrapers. To send requests through our servers, you’ll need to add a couple of lines to your scripts.

We’ll write this example in Python, but you can use any programming language and framework you prefer.

First, you’ll need to create a free ScraperAPI account to access your API key and pick your target URL. In this case, we want to retrieve product data from https://www.burton.com/it/en/c/mens-snowboard-boots.

import requests

payload = {
   'api_key': 'YOUR_API_KEY',
   'url': 'https://www.burton.com/it/en/c/mens-snowboard-boots'
}

response = requests.get('https://api.scraperapi.com', params=payload)
print(response.text)

ScraperAPI will rotate years of statistical analysis and machine learning to choose the right combination of HTTP headers and IPs, and rotate them when needed to ensure a successful response and return the raw HTML for you to parse.

Use the Async Scraper Service For Difficult-to-Scrape Websites

Some websites use more complex and advanced anti-scraping techniques. This makes it harder to bypass without managing retries, sessions, and timeouts.

For these types of websites, we’ve created an additional endpoint for you to submit web scraping jobs to ScraperAPI through a post() request.

import requests

submit_job = requests.post(
   url = 'https://async.scraperapi.com/jobs',
   json={ 'apiKey': 'YOUR_API_KEY', 'url': 'https://www.burton.com/it/en/c/mens-snowboard-boots'
   })
  
print(submit_job.text

After submitting the job, ScraperAPI will handle the request and any challenge that comes up during the task – like rotating IPs after a request fails, retrying any failed request, handling CAPTCHAs and anti-scraping techniques, etc., – and return the HTML data within a JSON response.

You can access this JSON response through the statusUrl endpoint provided when printing the text of the post() request:

{
   "id": "aab8a3a8-569e-4e70-9a30-0ab7b105905f",
   "status": "running",
   "statusUrl": "https://async.scraperapi.com/jobs/aab8a3a8-569e-4e70-9a30-0ab7b105905f",
   "url": "https://www.burton.com/it/en/c/mens-snowboard-boots"
}

Preferably, you can set a Webhook callback, and the Async Scraper will send the data to it once it’s done running.

Resource: Learn how to submit scraping jobs in bulk to the Async Scraper.

Pro Tip 💡

If you’re scraping large domains like Amazon, Google Search, or Twitter, you can use our Structured Data Endpoints to turn these websites into easy-to-use APIs and retrieve structured JSON data from any page without generating complicated URLs.

For example, the Amazon-specific API just requires you to send your API key and your target product ASIN number in a get() request:

import requests

payload = {
   'api_key': '5bc85449d28e162fb0416d6c5b4ac5b0',
   'asin': 'B08LNZVQ1J',
   'country': 'us'
   }

r = requests.get('https://api.scraperapi.com/structured/amazon/product', params=payload)
product = r.json()
print('The price of ', product['name'], ' is ', product['pricing'])

We’re printing the result this way as an example:

The price of  HB Shower Speaker Illumination 2.0 – Bluetooth Shower Speaker – IPX7 Waterproof, Shockproof, Dustproof with LED Lights – Bluetooth 4.0 Pairs with Phones, Tablets, Computer and Radio.  is  $29.99

You can find a full Amazon product response here.

Learn more

Table of Contents

Talk to an expert and learn how to build a scalable scraping solution.