5 Tips For Web Scraping Without Getting Blocked or Blacklisted

Updated 2019-08-22 by The Scraper API Team
Web scraping can be difficult, particularly when most popular sites actively try to prevent developers from scraping their websites. However, there are many strategies that developers can use to avoid blocks and allow their web scrapers to be undetectable. Here are a few quick tips on how to crawl a website without getting blocked:

1. IP Rotation

The number one way sites detect web scrapers is by examining their IP address, thus most of web scraping without getting blocked is using a number of different IP addresses to avoid any one IP address from getting banned. To avoid sending all of your requests through the same IP address, you can use Scraper API or other proxy services in order to route your requests through a series of different IP addresses. This will allow you to scrape the majority of websites without issue. For sites using more advanced proxy blacklists, you may need to try using residential or mobile proxies, if you are not familiar with what this means you can check out our article on different types of proxies here

2. Set a Real User Agent

Some websites will examine User Agents and block requests from User Agents that don't belong to a major browser. Most web scrapers don't bother setting the User Agent, and are easily detected by checking for missing User Agents. Don't be one of these developers! Remember to set a popular User Agent for your web crawler (you can find a list of popular User Agents here). For advanced users, you can also set your User Agent to the Googlebot User Agent since most websites want to be listed on Google and therefore let Googlebot through. It's important to remember to keep the User Agents you use relatively up to date!

3. Set other headers

Real web browsers will have a whole host of headers set, any of which can be checked by careful websites to block your web scraper. In order to make your scraper appear to be a real browser, you can navigate to https://httpbin.org/anything, and simply copy the headers that you see there (they are the headers that your current web browser is using). Things like "Accept", "Accept-Encoding", "Accept-Language", and "Cache-Control" being set will make your requests look like they are coming from a real browser so you won't get your web scraping blocked.

4. Set random intervals in between your requests

It is easy to detect a web scraper that sends exactly one request each second 24 hours a day! No real person would ever use a website like that, and an obvious pattern like this is easily detectable. Use randomized delays (anywhere between 2-10 seconds for example) in order to build a web scraper that can avoid being blocked. Also, remember to be polite, if you send requests too fast you can crash the website for everyone, if you detect that your requests are getting slower and slower, you may want to send requests more slowly so you don't overload the web server (you'll definitely want to do this to help frameworks like Scrapy avoid being banned).

5. Use a headless browser (advanced web scraping tip)

The trickiest websites to scrape may detect subtle tells like web fonts, extensions, browser cookies, and javascript execution in order to determine whether or not the request is coming from a real user. In order to scrape these websites you may need to deploy your own headless browser (or have Scraper API do it for you!). Tools like Selenium and Puppeteer will allow you to write a program to control a real web browser that is identical to what a real user would use in order to completely avoid detection. While this is quite a bit of work to make Selenium undetectable or Puppeteer undetectable, this is the most effective way to scrape websites that would otherwise give you quite some difficulty.

Hopefully you've learned a few useful tips for scraping popular websites without being blacklisted or IP banned, if you have a web scraping job you'd like to talk to us about helping your web scraper avoid detection please fill out this form and we'll get back to you within 24 hours. Happy scraping!
Our Story
Having built many web scrapers, we repeatedly went through the tiresome process of finding proxies, setting up headless browsers, and handling CAPTCHAs. That's why we decided to start Scraper API, it handles all of this for you so you can scrape any page with a simple API call!