How Does Web Scraping Work?

Every website is a unique puzzle, and so is every scraper to a certain degree.

In general, a web scraper will be given a URL or list of URLs, request the page’s HTML file, parse the response to generate a node tree, and extract the data needed by targeting elements using HTML tags or CSS/Xpath sectors. Then, it can export the scraped data into a database, spreadsheet, JSON file, or any other structured format required.

How Does a Web Scraper Work?

A web scraper is an automated tool that extracts data from a list of URLs based on pre-set criteria. They are used to gather information to find names, prices, addresses, and other data, then export that information into a usable format like a spreadsheet or a database.

They’re used for use cases like finding property listings for real estate, conducting market research, gathering intelligence on competitors, etc.

There are two main components to a web scraper, the web crawler and the scraper itself:

  • Web Crawler: the web crawler is functionally similar to a search engine bot in that it follows a list of links and catalogs the information, then it visits all the links it can find within the current page and subsequent pages until it hits a specified limit (or there are no more links to follow). Once it’s given a seed list of URLs, it goes down the list one by one.
  • Web Scraper: once the program visits the web page, it parses the code on the web page to get the information it needs. Most web scrapers will parse the HTML code on the page, but more advanced web scrapers will also fully render the CSS and Javascript on the page. Once it extracts the data it needs, it exports that data and stores it – usually on a .sql, .xls, or .csv file.

Of course, this is easier said than done. Extracting data from an HTML page just requires a few lines of code. The real challenge is scaling your project to the thousands or even millions of pages you’ll need for your project.

In those scenarios, you’ll have to manage IP addresses, browsers, and request templates, and be always ready to pick the right combination to fetch each page; that’s only mentioning a few of the road blocks you’ll face.

Managing all of these complexities is almost impossible for single developers or even for undedicated teams.

For those reasons, there are essential features you need to consider when choosing or building your web scraper:

Features to Look For in a Web Scraper

Web scraping is tricky to pull off well and takes careful consideration and planning to execute successfully. Bad actors have misused web scraping tools to mine the web for personal data and other sensitive information, and in response, websites have become more sophisticated in how they filter out their traffic. This, in turn, makes it harder for web scrapers to extract data, even for ethical and legitimate purposes.

With this in mind, here are some of the features you should look out for in a web scraper, whether you subscribe to a service or build one yourself:

IP Rotation

To detect web scrapers, servers use several mechanisms to distinguish your script from a human user:

  • Servers measure the HTTP requests frequency; if it’s going faster than humans can, then it’s a strong signal for servers to block your scraper.
  • Using behavior analysis, servers collect several data points for your user-agent like the order pages are visited, what the agent fetches and what doesn’t, and mouse movement. Once it has enough evidence, the server can cut your access to the website.
  • Servers also use IP reputation checkers to ensure that only reputable IP addresses connect to the site and block every IP known to do spammy actions (this is used a lot by tools like email clients).

And these are just three systems a lot of websites have in place.

Note: Learn more about anti-scraping techniques.

IP rotation lets your web scraper cycle through different IP addresses, reducing the likelihood that they’ll be blocked while allowing for more simultaneous scrapes at a time.

CAPTCHA Handling

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) tests determine whether a visitor to a website is a human or a bot to detect malicious behavior and protect websites from harmful activity. CAPTCHA tests use problem-solving challenges that are easier for humans than bots or automated programs.

CAPTCHA tests are one of the biggest challenges for effective web scraping. The most common types of CAPTCHA tests include:

  • Text-Based
  • Image-based
  • Audio-based
  • Social media logins

To deal with CAPTCHA requests, your web scraper could:

  • Slow down the web scraping process to simulate human behavior
  • Use proxy servers and rotate IP addresses
  • Avoid honeypot traps like CSS elements that have their visibility turned off

Proxy Management

A proxy server is a 3rd-party server that lets you use a different IP address. VPN networks use proxy servers to enable their users to browse the internet anonymously and to keep their web activity and private data secure from hackers, governments, and advertisers. They also let users unlock gated content that’s restricted to a region.

Web scrapers use proxy management to access websites in different geographical locations, avoid honeypot traps, and solve CAPTCHA problems.

HTTP Header Optimization

HTTP Headers enable the client and server to exchange information and metadata. They give added context to the information being sent between the server and the client, such as age, cache policy, and what type of data is being sent. It makes for more secure and better quality data. However, it also controls what kinds of data can be extracted from a website.

Using the optimal HTTP headers makes it less likely that your web scraper will be blocked by a given website. The best HTTP headers to use for web scraping are:

  • User-Agent Indicates what devices are making the request, operating system, version, etc
  • Referrer:Gives the previous web page’s URL
  • Accent-Language: Tells the server which languages the client-side browser understands, and which language to send back

Javascript Rendering

Most web scrapers are able to parse and scrape HTML code because traditionally, most websites are built using only HTML and CSS, while Javascript is used to make web pages more interactive and add to their functionality.

However, more and more web browsers use Javascript as a front-end technology thanks to the rising popularity of Javascript frameworks like Angular, React, and Vue. Ideally, your web scraper should be able to render a web page fully, including content behind Javascript scripts.

Quick Web Scraper Breakdown

When defining what’s a web scraper, we built a simple Quotes scraper using Python. Let’s break it down to have a visual representation of everything we’ve talked about so far.

1. First, we need to provide our scraper with the tools and URLs it will use to extract the data.

import requests
from bs4 import BeautifulSoup

url = 'https://quotes.toscrape.com/'

2. Then, we send an HTTP get request to download the page’s HTML file and then pass it to our parser to generate the node/parse tree – in this case, we’re using Beautiful Soup as our parsing library.

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

Lastly, we use the elements’ CSS selectors to traverse the node tree and pick only the elements we want to extract.

all_quotes = soup.find_all('div', class_= 'quote')
for quote in all_quotes:
   quote_text = quote.find('span', class_= 'text').text
   print(quote_text)

If we run our code as it is, it will definitely work. As its name suggests, Quotes to Scrape is a website built to test and practice web scraping. So it’s a super easy website to access with our half-baked scraper.

However, most data-heavy websites use strong anti-scraping systems, making it impossible to access their data without implementing all the above features (IP rotation, CAPTCHA handling…). Otherwise, the server will quickly recognize our scraper as a bot and block our IP (in most cases forever) from accessing the site.

To prevent blacklists and bans while scraping, we’ll send our request through ScraperAPI’s server and let it handle these complexities for us:

Integrating ScraperAPI to Your Web Scraper

When you send your HTTP requests through ScraperAPI’s endpoint, it’ll automatically rotate your IP address between requests, use machine learning and years of statistical analysis to choose the right combination of headers and handle all CAPTCHAs that might popup.

The best part, all that’s needed for it to work is to add the following string to your URL:

'https://api.scraperapi.com?api_key={Your_API_Key}&url={Your_URL}'

Here’s how your code would look like:

import requests
from bs4 import BeautifulSoup

url = 'http://api.scraperapi.com?api_key={Your_API_Key}&url=https://quotes.toscrape.com/'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

all_quotes = soup.find_all('div', class_='quote')
for quote in all_quotes:
   quote_text = quote.find('span', class_='text').text
   print(quote_text)

Customize Your API

Although ScraperAPI adds a lot of automatic functionalities, you can also customize it to meet your requirements using a few parameters inside the request URL:

1. Scrape Dynamic Pages

A lot of websites use JavaScript nowadays to add content to the page, making impossible for your scraper to access the data as it’ll return the source HTML before rendering happen. To handle these pages, you can ask ScraperAPI to render the page before returning the HTML data by adding the &render=true parameter.

'http://api.scraperapi.com?api_key={Your_API_Key}&render=true&url=https://quotes.toscrape.com/'

2. Scrape Location Sensitive Data

If you’re in the US but need to scrape real estate listings in France, you can use the country_code parameter and set it to fr.

'http://api.scraperapi.com?api_key={Your_API_Key}&country_code=fr&url=https://quotes.toscrape.com/'

And you can do that for 50+ countries around the globe. Here’s the full country code list.

3. Use Customized Headers

We’ve seen an increase in scraping potential for lesser-known but very well-protected websites. Because we don’t have enough data on them, it’s harder for ScraperAPI to find the right header combination.

For those cases, you can use the keep_headers=true parameter to tell ScraperAPI to use the headers you specified in your code.

⚠️ Please Note

Using custom headers is what we call a “Jesus take the wheel” scenario as:

  1. It effectively overrides our bypass techniques
  2. We’ll forward every single header you pass, even those that you don’t even know of (e.g. the client’s default user agent)
'http://api.scraperapi.com?api_key={Your_API_Key}&keep_headers=true&url=https://quotes.toscrape.com/'

As a best practice, we should add all extra parameters before the URL to avoid any problems with special characters.

Note: Check our documentation for an in-depth walkthrough of ScraperAPI customizable functionalities.

Learn more

Table of Contents

Talk to an expert and learn how to build a scalable scraping solution.