How to Scrape Amazon using Python and Beautiful Soup

Tutorial on scraping Amazon with Python and BeautifulSoup

Amazon hit nearly $514 billion in worldwide net sales revenue in 2022. It is the biggest e-commerce store in the world and a goldmine of data.

In this article, you’ll learn how to scrape data from Amazon’s search result pages using Python with a little help from BeautifulSoup and requests.

Towards the end, you will also learn about ScraperAPI’s Structured Data endpoint, which turns this:

Searching "Iphone 15" on Amazon

Into this:


	{
		"results":[
		   {
			  "type":"search_product",
			  "position":1,
			  "asin":"B0CD6YLMBK",
			  "name":"iPhone 15 Charger, [Apple Certified] 20W USB C Wall Charger Block with 6.6FT USB C to C Fast Charging Cord for 15 Pro/15 Pro Max/15 Plus, iPad Pro 12.9/11, iPad 10th Generation, iPad Air 5/4, iPad Min",
			  "image":"https://m.media-amazon.com/images/I/51Mp6Zpg7qL.jpg",
			  "has_prime":true,
			  "is_best_seller":false,
			  "is_amazon_choice":false,
			  "is_limited_deal":false,
			  "stars":4.3,
			  "total_reviews":34,
			  "url":"https://www.amazon.com/iPhone-Charger-Certified-Charging-Generation/dp/B0CD6YLMBK/ref=sr_1_1?keywords=iphone+15+charger&qid=1697051475&sr=8-1",
			  "availability_quantity":null,
			  "spec":{
				 
			  },
			  "price_string":"$16.99",
			  "price_symbol":"$",
			  "price":16.99
		   },
		 # ... truncated ...
	 ]

With a simple API call

Collect Structured Amazon Data in Seconds

ScraperAPI allows you to extract data from Amazon search results and product pages without writing a single line of code.

So without wasting any more time, let’s begin!

Requirements

This tutorial is based on Python 3.10, but it should work with any Python version from 3.8 onwards. Make sure you have a supported version installed before continuing.

Additionally, you also need to install Requests and BeautifulSoup.

  • The Requests library will let you download Amazon’s search result page
  • BeautifulSoup will let you traverse through the DOM and extract the required data.

You can install both of these libraries using PIP:

	$ pip install requests bs4

Note: You can also scrape Amazon using Scrapy, which allows you to build and manage several spiders from a single codebase.

Now create a new directory and a Python file to store all of the code for this tutorial:

	$ mkdir amazon_scraper
	$ touch amazon_scraper/app.py

With the requirements sorted, you are ready to head on to the next step!

Deciding What to Scrape

With every web scraping project, it is very important to decide early on what you want to scrape. This helps in planning the plan of action. For this particular tutorial, you will be learning how to scrape the following attributes of each product result:

  1. Name
  2. Price
  3. Image

The screenshot below has annotations for where this information is located on a search results page:

Highlighting the elements we will scrape in this tutorial

We’ll take a look at how to extract each of these attributes in the next section. And just to make things a bit spicy, we will sort the data in ascending order according to the price before scraping it.

Fetching Amazon Search Page

Let’s start by fetching the Amazon search page.

This is a typical search page URL: https://www.amazon.com/s?k=iphone+15+charger.

However, if you want to sort the results according to the price, you must use this URL: https://www.amazon.com/s?k=iphone+15+charger&s=price-asc-rank.

You can use the following Requests code to download the HTML:

	import requests

	url = "https://www.amazon.com/s?k=iphone+15+charger&s=price-asc-rank"
	html = requests.get(url)
	print(html.text)

However, as soon as you run this code, you will realize that Amazon has put some basic anti-bot measures in place. You will receive the following response from Amazon:

“To discuss automated access to Amazon data please contact api-services-support@amazon.com. —trucated—“

You can bypass this initial anti-bot measure by sending a proper user-agent header as part of the request:

	headers = {
		'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
		'Accept-Language': 'en-US, en;q=0.5'
	}
	html = requests.get(url, headers=headers)

 

Unlock eCommerce Data at Scale

ScraperAPI helps you bypass any anti-scraping mechanism, handle CAPTCHAs and render JavaScript with a simple API call.

If you inspect the html variable, you will see that this time you received the complete search result page as desired.

While we are at it, let’s load the response in a BeautifulSoup object as well:

	from bs4 import BeautifulSoup

	soup = BeautifulSoup(html.text)

Sweet! Now that you have the complete response in a BeautifulSoup object, you can start extracting relevant data from it.

Scraping Amazon Product Attributes

The easiest way to figure out how to scrape the required data is by using the Developer Tools, which are available in almost all famous browsers.

The Developer Tools will let you explore the DOM structure of the page.

Search result for iphone cable in Amazon

The main goal is to figure out which HTML tag attributes can let you uniquely target an HTML tag.

Most of the time, you will rely on the id and class attributes. Then you can supply these attributes to BeautifulSoup and ask it to return whatever text/attribute you want to extract from that particular tag.

Extracting the Product Name

Let’s take a look at how you can extract the product name. This will give you a good understanding of the general process.

Right-click on the product name and click on Inspect:

Inspecting Amazon search results

This will open up the developer tools:

Extracting product names from Amazon search results

As you can observe in the screenshot above, the product name is nested in a span with the following classes: a-size-medium a-color-base a-text-normal.

At this point, you have a decision to make: you can either ask BeautifulSoup to extract all the spans with these classes from the page, or you can extract each result div and then loop over those divs and extract data for each product.

I generally prefer the latter method as it helps identify products that might not have all the required data. This tutorial will also showcase this same method.

Therefore, now you need to identify the div that wraps each result item:

Scraping Amazon listings

According to the screenshot above, each result is nested in a div tag with the data-component-type attribute of s-search-result.

Let’s use this information to extract all the result divs and then loop over them and extract the nested product titles:

	results = soup.find_all('div', attrs={'data-component-type': 's-search-result'})
	for r in results:
	   print(r.select_one('.a-size-medium.a-color-base.a-text-normal').text)

Here’s a simple breakdown of what this code is doing:

  • It uses the find_all() method provided by BeautifulSoup
  • It returns all the matching elements from the HTML
  • It then uses the select_one() method to extract the first element that matches the CSS Selector passed in

Notice that here we append a dot (.) before each class name. This tells BeautifulSoup that the passed-in CSS Selector is a class name. There is also no space between the class names. This is important as it informs BeautifulSoup that each class is from the same HTML tag.

If you are new to CSS selectors, you should read our CSS selectors cheat sheet, we go over the basics of CSS selectors and provide you with an easy-to-use framework to speed up the process.

Extracting the Product Price

Now that you have extracted the product name, extracting the product price is fairly straightforward.

Follow the same steps from the last section and use the Developer Tools to inspect the price:

Scraping pricing data from Amazon

The price can be extracted from the span with the class of a-offscreen. This span is itself nested in another span with a class of a-price. You can use this knowledge to craft a CSS selector:

	for r in results:
        # -- truc --
    	print(r.select_one('.a-price .a-offscreen').text)

As you want to target nested spans this time, you have to add a space between the class names.

Extracting the Product Image

Try following the steps from the previous two sections to come up with the relevant code on your own. Here is a screenshot of the image being inspected in the Developer Tools window:

Scraping product images from Amazon

The img tag has a class of s-image. You can target this img tag and extract the src attribute (the image URL) using this code:

	for r in results:
    	# -- truc --
    	print(r.select_one('.s-image').attrs['src'])

Note: Extra points if you do it on your own!

Complete Scraper Code

You have all the bits and pieces to put together the complete scraper code.

Here is a slightly modified version of the scraper that appends all the product results into a list at the very end:

	import requests
	from bs4 import BeautifulSoup
	
	url = "https://www.amazon.com/s?k=iphone+15+charger&s=price-asc-rank"
	headers = {
		'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
		'Accept-Language': 'en-US, en;q=0.5'
	}
	html = requests.get(url, headers=headers)
	soup = BeautifulSoup(html.text)
	results = soup.find_all('div', attrs={'data-component-type': 's-search-result'})
	
	result_list = []
	for r in results:
	   result_dict = {
		  "title": r.select_one('.a-size-medium.a-color-base.a-text-normal').text,
		  "price": r.select_one('.a-price .a-offscreen').text,
		  "image": r.select_one('.s-image').attrs['src']
	   }
	   result_list.append(result_dict)

You can easily repurpose this result_list to power different APIs or store this information in a spreadsheet for further analysis.

Using Structured Data Endpoint

This tutorial did not focus too much on the gotchas of scraping Amazon at scale.

Amazon is notorious for banning scrapers and making it difficult to scrape data from their websites.

You already saw a glimpse of this during the very beginning of the tutorial, where a request without the proper headers was blocked by Amazon.

Luckily, there is an easy solution to this problem. Instead of sending a request directly to Amazon, you can send the request to ScraperAPI’s Structured Data endpoint, and ScraperAPI will respond with the scraped data in nicely formatted JSON.

Amazon structured data endpoint from ScraperAPI

This is a very powerful feature offered by ScraperAPI, as by using this, you do not have to be worried about getting blocked by Amazon or keeping your scraper updated with the never-ending changes in Amazon’s anti-bot techniques.

The best part is that ScraperAPI provides 5,000 free API credits for 7 days on a trial basis and then provides a generous free plan with recurring 1,000 API credits to keep you going. This is enough to scrape data for general use.

You can quickly get started by going to the ScraperAPI dashboard page and signing up for a new account:

ScraperAPI signup page

After signing up, you will see your API Key:

Highlighting where your API key is

Now you can use the following code to access the search results from Amazon using the Structured Data endpoint:

	import requests

	payload = {
	   'api_key': 'API_KEY,
	   'query': 'iphone 15 charger',
	   's': 'price-asc-rank'
	}
	 
	response = requests.get('https://api.scraperapi.com/structured/amazon/search', params=payload)
	print(response.json())

Note: Don’t forget to replace API_KEY in the code above with your own ScraperAPI API key.

As you might have already observed, you can pass in most query params that Amazon accepts as part of the payload. This means all the following sorting values are valid for the s key in the payload:

  1. Price: High to low = price-desc-rank
  2. Price: Low to high = price-asc-rank
  3. Featured = rerelevanceblender
  4. Avg. customer review = review-rank
  5. Newest arrivals = date-desc-rank

If you want the same result_list data as from the last section, you can add the following code at the end:

	result_list = []
	for r in response.json()['results']:
	   result_dict = {
		  "title": r['name']
		  "price": r['price_string'],
		  "image": r['image']
	   }
	   result_list.append(result_dict)

You can learn more about this endpoint over at the ScraperAPI docs.

Wrapping Up

This tutorial was a quick rundown of how to go about scraping data from Amazon.

  • It taught you a simple bypass method for the bot detection system used by Amazon.
  • It showed you how to use the various methods provided by BeautifulSoup to extract the required data from the HTML document.
  • Lastly, you learned about the Structured Data endpoint offered by ScraperAPI and how it solves quite a few problems.

If you are ready to take your data collection from a couple of pages to thousands or even millions of pages, our Business Plan is a great place to start.

Need 10M+ API credits? Contact sales for a custom plan, including all premium features, premium support, and an account manager.

Until next time, happy scraping!

About the author

Yasoob Khalid

Yasoob Khalid

Yasoob is a renowned author, blogger and a tech speaker. He writes regularly on his personal blog and has authored the Intermediate Python and Practical Python Projects books. He is currently working on Azure at Microsoft.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.