How to Grab HTTP Headers and Cookies for Web Scraping

headers-and-cookies-for-web-scraping

Are your scripts getting blocked constantly, even when using proxies? You might be forgetting an important piece of the puzzle: custom HTTP headers

In today’s article, we’ll dive deeper into HTTP headers, why they are important for web scraping, and how we can grab and use them in our code.

What are HTTP Headers?

According to MDN “ An HTTP header is a field of an HTTP request or response that passes additional context and metadata about the request or response” and consists of a case-sensitive name (like age, cache-control, Date, cookie, etc.) followed by a colon (:) and then its value.

In simpler terms:

  • The user/client (usually a browser) sends a request containing request headers providing more details to the server.
  • Then, the server responds with the data requested in the structure that fits the specifications contained in the request header.

For clarity, here’s the request header our browser is sending Prerender at the time of writing this article:

</p>
authority: in.hotjar.com
:method: POST
:path: /api/v2/client/sites/2829708/visit-data?sv=7
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9,it;q=0.8,es;q=0.7
content-length: 112
content-type: text/plain; charset=UTF-8
origin: https://prerender.io
referer: https://prerender.io/
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="101", "Microsoft Edge";v="101"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47
<p>

What are Web Cookies?

Web cookies, also known as HTTP cookies or browser cookies, are a piece of data sent by a server (HTTP response header) to a user’s browser for later identification.

In a later request (HTTP header request), the browser will send the cookie back to the server, making it possible for the server to recognize the browser.

Websites and web applications use cookies for session management (like keeping you logged in), personalization (to keep settings and preferences), tracking, and, in some cases, security.

Let’s see what a cookie would look like in LinkedIn’s request header:

Web Cookies

Why are Headers Important for Web Scraping?

A lot of website owners know their data will be scraped one way or another, so they use many different tools and strategies to identify bots and block them from their sites. And there are many valid points for them to do so, as badly optimized bots can slow down or even break websites.

Note: You can avoid overwhelming web servers and decrease your chances of getting blocked by following these web scraping best practices.

However, one way that doesn’t get a lot of attention is the use of HTTP headers and cookies.

Because your browser/client sends HTTP headers in its requests, the server can use this information to detect false users, thus blocking them from accessing the website or providing false/bad information.

But the same can work the other way around.

By optimizing the headers sent through our bot’s requests, we can mimic the behavior of an organic user, reducing our chances of getting blacklisted and, in some cases, improving the quality of the data we collect.

Most Common HTTP Headers for Web Scraping

There is a big list of HTTP headers we could learn and use in our requests, but in most cases, there are only a few that we really care about for web scraping:

1. User-Agent

This is probably the most important header as it identifies “the application type, operating system, software vendor or software version of the requesting software user agent,” making it the first check most servers will run.

For example, when sending a request using the Requests Python library, the user-agent field will show the following information – depending on your Python version:

</p>
user-agent: python-requests/2.22.0
<p>

Which is easy to spot and block by the server.

Instead, we want our user-agent header to look more like the one shown in our first example:

</p>
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47
<p>

2. Accept-Language

Although it’s not always necessary, it tells the server what language version of the data to provide.

When there’s a big discrepancy between each request from a language standpoint, it could tell the server there’s a bot involved.

However, technically speaking, “the server should always pay attention not to override an explicit user choice,” so if the URLs you’re scraping already have specific languages, it can still be perceived as an organic user.

Here’s how it appears in our example request:

</p>
accept-language: en-US,en;q=0.9,it;q=0.8,es;q=0.7
<p>

3. Accept-Encoding

Just as its name implies, it tells the server which compression algorithm “can be used on the resource sent back,” saving up to 70% of the bandwidth needed for certain documents, thus reducing the stress our scripts put upon servers.

</p>
accept-encoding: gzip, deflate, br
<p>

4. Referer

The Referer HTTP header tells the server the page which the user comes from. Although it’s used mostly for tracking, it can also help us mimic an organic user’s behavior by, for example, telling the server we’ve come from a search engine like Google.

</p>
referer: https://prerender.io/
<p>

5. Cookie

We’ve already discussed what cookies are. However, we might haven’t stated clearly why cookies are important for web scraping.

Cookies allow servers to communicate using a small piece of data, but what happens when the server sends a cookie but then the browser doesn’t store it and sends it back in the next request? There’s the trick.

Cookies can also be used to identify if the request is coming from a real user or a bot.

Vice versa, we can use web cookies to mimic the organic behavior of a user when browsing a website by sending the cookies back to the server after every interaction.

By changing the cookie itself, we can also tell the server we’re a new user, making it easier for our scraping scripts to avoid getting blocked.

How to View HTTP Headers? [Including Cookies]

Before we can use headers in our code, we need to be able to grab them from somewhere.

Let’s use our browser and navigate to the target website.

For example, go to google.com > right-click > inspect to open the developer tools.

Network tab from DevTools

From there, we’ll navigate to the Network tab and, back on Google, search for the query “web scraping headers.”

Google search results for the query "web scraping headers"

As the page loads, we’ll see the network tab populate. Take a closer look at the Fetch/XHR tab, where we’ll be able to find the documents that are being fetched by the browser and, of course, the HTTP headers used in the request.

Although there’s no single standard name for the file we’re looking for, it’s usually a relevant name for what we’re trying to do on the page, or it’s the file that provides the data being rendered.

Clicking on the file, we’ll open, by default, the Headers tab, and by scrolling down, we’ll see the Request Headers section.

Request Headers section

Now we can copy and paste the header fields and their value into our code.

How to Use Custom Headers for Web Scraping

How you use custom headers will depend on the programming language you’re using. Still, we’ll try to add as many examples as possible.

You’ll notice that each method is different, but they share the same logic, so no matter what codebase you’re using, it’ll be very easy to translate.

For all our examples, we’ll send our request to https://httpbin.org/headers, a website designed to show how the response and requests look from the server’s points of view.

Here’s what appears if we open the link above on our browser:

headers returned by HTTPbin

The website responds with the headers that our browser is sending, and we can now use the same headers when sending our request through code.

Using Custom Headers in Python

Before using our custom headers, let’s first send a test request to see what it returns:

</p>
import requests

url = 'https://httpbin.org/headers'

response = requests.get(url)

print(response.text)
<p>

Phyton Headers

As you can see, the User-Agent being used is the default python-requests/2.26.0, making it super easy for a server to recognize our bot.

Knowing this, let’s move to the next step and add our custom headers to our request:

</p>
import requests

url = 'https://httpbin.org/headers'

headers = {
	'accept': '*/*',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53',
	'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
	'referer': 'https://www.google.com/',
	'cookie': 'DSID=AAO-7r4OSkS76zbHUkiOpnI0kk-X19BLDFF53G8gbnd21VZV2iehu-w_2v14cxvRvrkd_NjIdBWX7wUiQ66f-D8kOkTKD1BhLVlqrFAaqDP3LodRK2I0NfrObmhV9HsedGE7-mQeJpwJifSxdchqf524IMh9piBflGqP0Lg0_xjGmLKEQ0F4Na6THgC06VhtUG5infEdqMQ9otlJENe3PmOQTC_UeTH5DnENYwWC8KXs-M4fWmDADmG414V0_X0TfjrYu01nDH2Dcf3TIOFbRDb993g8nOCswLMi92LwjoqhYnFdf1jzgK0'
}

response = requests.get(url, headers=headers)

print(response.text)
<p>

  • First, we create a dictionary with our headers. Some of them we got from the HTTPbin, but the cookies are from MDN documentation.
  • Then, we added Google as the referer because, in most cases, users will come to the site by clicking a link from Google.

Here’s the result:

Web Scraping

Theoretically, you could use any custom combination you’d like as values to your custom headers.

However, some websites won’t allow you access to them without sending a specific set of headers.

When using custom headers while scraping, it’s better to see what headers your browser already sends when navigating the target website and use those.

Using Custom Headers in Node.JS

We’re going to do the same thing but using the Node.JS Axios package to send our request:

</p>
const axios = require('axios').default;

const url = 'https://httpbin.org/headers';

const headers = {
	'accept': '*/*',
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53',
	'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
	'referer': 'https://www.google.com/',
	'cookie': 'DSID=AAO-7r4OSkS76zbHUkiOpnI0kk-X19BLDFF53G8gbnd21VZV2iehu-w_2v14cxvRvrkd_NjIdBWX7wUiQ66f-D8kOkTKD1BhLVlqrFAaqDP3LodRK2I0NfrObmhV9HsedGE7-mQeJpwJifSxdchqf524IMh9piBflGqP0Lg0_xjGmLKEQ0F4Na6THgC06VhtUG5infEdqMQ9otlJENe3PmOQTC_UeTH5DnENYwWC8KXs-M4fWmDADmG414V0_X0TfjrYu01nDH2Dcf3TIOFbRDb993g8nOCswLMi92LwjoqhYnFdf1jzgK0',
};

axios.get(url, Headers={headers})
.then((response) => {
console.log(response.data);
}, (error) => {
console.log(error);
});

<p>

Here’s the result:

Header

Using Custom Headers in ScraperAPI

A great feature of ScraperAPI is that it uses machine learning and years of statistical analysis to determine the best header combinations for each request we send.

However, lately, we’re seeing more and more small websites improving their security measures, and there is not enough data for the API to determine the optimal combination automatically.

If you’re noticing a high failure rate, it’s worth trying to use custom headers, as we’ve shown above.

In our documentation, you’ll be able to find the full list of coding examples for Node.JS, PHP, Ruby web scraping, and Java. But for the sake of time, here’s a full Python example for using ScraperAPI with custom headers:

</p>
import requests

headers = {
'accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53',
'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
'referer': 'https://www.google.com/',
'cookie': 'DSID=AAO-7r4OSkS76zbHUkiOpnI0kk-X19BLDFF53G8gbnd21VZV2iehu-w_2v14cxvRvrkd_NjIdBWX7wUiQ66f-D8kOkTKD1BhLVlqrFAaqDP3LodRK2I0NfrObmhV9HsedGE7-mQeJpwJifSxdchqf524IMh9piBflGqP0Lg0_xjGmLKEQ0F4Na6THgC06VhtUG5infEdqMQ9otlJENe3PmOQTC_UeTH5DnENYwWC8KXs-M4fWmDADmG414V0_X0TfjrYu01nDH2Dcf3TIOFbRDb993g8nOCswLMi92LwjoqhYnFdf1jzgK0'
}

payload = {
'api_key': '51e43be283e4db2a5afb62660xxxxxxx',
'url': 'https://httpbin.org/headers',
'keep_headers': 'true',
}

response = requests.get('http://api.scraperapi.com', params=payload, headers=headers)

print(response.text)
<p>

So using custom headers in Python is quite simple:

  • We need to create our dictionary of headers like before and right after create our payload
  • The first element we’ll add to the payload is our API Key which we can generate by creating a free ScraperAPI account
  • Next, we’ll add our target URL – in this case https://httpbin.org/headers
  • The last parameter will be keep_headers, set to true, to tell ScraperAPI to keep our custom headers
  • Lastly, we’ll add all elements together right inside our requests.get() method

The combination of all of these will result in a URL like this:

</p>
http://api.scraperapi.com/?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=http://httpbin.org/headers&keep_headers=true
<p>

Wrapping Up

Although HTTP headers can help you bypass certain sites’ anti-scraping mechanisms, they are not enough to collect data from hard-to-scrape websites or at an enterprise scale.

To simplify your infrastructure and never get blocked again, use ScraperAPI and enjoy a 99.99% success rate, in just one line of code.

Our API will choose and send you the right headers + IP combinations, freeing your mind from the task, so you can focus on what matters: data!

If you’re working with large projects (100k+ or millions of requests), you can use our Async Scraper to submit all your requests concurrently. This will allow ScraperAPI to work on each scraping job individually and return the data to your preferred webhook.

No coding experience? DataPipeline allows you to build and automate scraping projects in just minutes without writing a single line of code.

Try ScraperAPI for free and get access to all our tools with 5,000 free API credits.

Until next time, happy scraping!

About the author

Zoltan Bettenbuk

Zoltan Bettenbuk

Zoltan Bettenbuk is the CTO of ScraperAPI - helping thousands of companies get access to the data they need. He’s a well-known expert in data processing and web scraping. With more than 15 years of experience in software development, product management, and leadership, Zoltan frequently publishes his insights on our blog as well as on Twitter and LinkedIn.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.

The Secret to Scalable Growth

Learn how to use publicly available data to your advantage. Read this handbook to learn the basics of web scraping.

It’s free. No payment needed.