Amazon, as the largest e-commerce corporation in the United States, offers the widest range of products in the world. Their  product data can be useful in a variety of ways, and you can easily extract this data with web scraping. This guide will help you develop your approach for extracting product and pricing information from Amazon, and you’ll better understand how to use web scraping tools and tricks to efficiently gather the data you need.

The Benefits of Scraping Amazon

Web scraping Amazon data helps you concentrate on competitor price research, real-time cost monitoring and seasonal shifts in order to provide consumers with better product offers. Web scraping allows you to extract relevant data from the Amazon website and save it in a spreadsheet or JSON format. You can even automate the process to update the data on a regular, weekly or monthly basis.

There is currently no way to simply export product data from Amazon to a spreadsheet. Whether it’s for competitor testing, comparison shopping, creating an API for your app project or any other business need we’ve got you covered. This problem is easily solved with web scraping.

Here are some other specific benefits of using a web scraper for Amazon:

  • Utilize details from product search results to improve your Amazon SEO status or Amazon marketing campaigns
  • Compare and contrast your offering with that of your competitors
  • Use review data for review management and product optimization for retailers or manufacturers
  • Discover the products that are trending and look up the top-selling product lists for a group

Scraping Amazon is an intriguing business today, with a large number of companies offering goods, price, analysis, and other types of monitoring solutions specifically for Amazon. Attempting to scrape Amazon data on a wide scale, however, is a difficult process that often gets blocked by their anti-scraping technology.  It’s no easy task to scrape such a giant site when you’re a beginner, so this step-by-step guide should help you scrape Amazon data, especially when you’re using Python Scrapy and ScraperAPI.

First, Decide On Your Web Scraping Approach

One method for scraping data from Amazon is to crawl each keyword’s category or shelf list, then request the product page for each one before moving on to the next. This is best for smaller scale, less-repetitive scraping. Another option is to create a database of products you want to track by having a list of products or ASINs (unique product identifiers), then have your Amazon web scraper scrape each of these individual pages every day/week/etc. This is the most common method among scrapers who track products for themselves or as a service.

Scrape Data From Amazon Using ScraperAPI with Python Scrapy 

ScraperAPI allows you to scrape the most challenging websites like Amazon at scale for a fraction of the cost of using residential proxies. We designed anti-bot bypasses right into the API, and you can access  additional features like IP geotargeting (&country code=us) for over 50 countries, JavaScript rendering (&render=true), JSON parsing (&autoparse=true) and more by simply adding extra parameters to your API requests. Send your requests to our single API endpoint or proxy port, and we’ll provide a successful HTML response.

Start Scraping with Scrapy

Scrapy is a web crawling and data extraction platform that can be used for a variety of applications such as data mining, information retrieval and historical archiving. Since Scrapy is written in the Python programming language, you’ll need to install Python before you can use pip (a python manager tool). 

To install Scrapy using pip, run:

pip install scrapy

Then go to the folder where your project is saved (Scrapy automatically creates a web scraping project folder for you) and run the “startproject” command along with the project name, “amazon_scraper”. Scrapy will construct a web scraping project folder for you, with everything already set up:

scrapy startproject amazon_scraper

The result should look like this:

├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created

Scrapy creates all of the files you’ll need, and each file serves a particular purpose:

  1. Items.py – Can be used to build your base dictionary, which you can then import into the spider.
  2. Settings.py – All of your request settings, pipeline, and middleware activation happens in settings.py. You can adjust the delays, concurrency, and several other parameters here.
  3. Pipelines.py – The item yielded by the spider is transferred to Pipelines.py, which is mainly used to clean the text and bind to databases (Excel, SQL, etc).
  4. Middlewares.py – When you want to change how the request is made and scrapy manages the answer, Middlewares.py comes in handy.

Create an Amazon Spider

You’ve established the project’s overall structure, so now you’re ready to start working on the spiders that will do the scraping. Scrapy has a variety of spider species, but we’ll focus on the most popular one, the Generic Spider, in this tutorial.

 

Simply run the “genspider” command to make a new spider:

# syntax is --> scrapy genspider name_of_spider website.com 
scrapy genspider amazon amazon.com[/javascript]
<span style="font-weight: 400;">Scrapy now creates a new file with a spider template, and you’ll gain a new file called </span><b>“amazon.py”</b><span style="font-weight: 400;"> in the spiders folder. Your code should look like the following:</span>

&nbsp;
<h3><b>Don’t Forget to Clean Up Your Data With Pipelines</b></h3>
<b>As a final step, clean up the data using the <code>pipelines.py</code> file when the text is a mess and some of the values appear as lists.</b>
class TutorialPipeline:
    def process_item(self, item, spider):
        for k, v in item.items():
            if not v:
                item[k] = ''  # replace empty list or None with empty string
                continue
            if k == 'Title':
                item[k] = v.strip()
            elif k == 'Rating':
                item[k] = v.replace(' out of 5 stars', '')
            elif k == 'AvailableSizes' or k == 'AvailableColors':
                item[k] = ", ".join(v)
            elif k == 'BulletPoints':
                item[k] = ", ".join([i.strip() for i in v if i.strip()])
            elif k == 'SellerRank':
                item[k] = " ".join([i.strip() for i in v if i.strip()])
        return item

The item is transferred to the pipeline for cleaning after the spider has yielded a JSON object. We need to add the pipeline to the settings.py file to make it work:

## settings.py
ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}

Now you’re good to go and you can use the following command to run the spider and save the result to a csv file:

scrapy crawl amazon -o test.csv

How to Scrape Other Popular Amazon Pages

You can modify the language, response encoding and other aspects of the data returned by Amazon by adding extra parameters to these urls. Remember to always ensure that these urls are safely encoded. We already went over the ways to scrape an Amazon product page, but you can also try scraping the search and sellers pages by adding the following modifications to your script.

Search Page

  • To get the search results, simply enter a keyword into the url and safely encode it
    • Format:
    • https://www.amazon.com/s?<SEARCH KEYWORD> 
  • You may add extra parameters to the search to filter the results by price, brand and other factors.

Sellers Page

  • Instead of a dedicated page showing what other sellers offer a product, Amazon recently updated these pages so that now a component slides in. You must now submit a request to the AJAX endpoint that populates this slide-in in order to scrape this data.
    • Format: https://www.amazon.com/gp/aod/ajax/ref=dp_aod_NEW_mbc?asin=<ASIN>
    • Example: https://www.amazon.com/gp/aod/ajax/ref=dp_aod_NEW_mbc?asin=B087Z6SNC1
  • You can refine these findings by using additional parameters such as the item’s state, etc.
    • Example: https://www.amazon.com/gp/aod/ajax/ref=tmm_pap_new_aod_0?filters={"all":true,"new":true}&condition=new&asin=1844076342&pc=dp

Forget Headless Browsers and Use the Right Amazon Proxy

99.9% of the time you don’t need to use a headless browser. You can scrape Amazon more quickly, cheaply and reliably if you use standard HTTP requests rather than a headless browser in most cases. If you opt for this,  don’t enable JS rendering when using the API. 

Residential Proxies Aren’t Essential

Scraping Amazon at scale can be done without having to resort to residential proxies, so long as you use high quality datacenter IPs and carefully manage the proxy and user agent rotation.

Don’t Forget About Geotargeting

Geotargeting is a must when you’re scraping a site like Amazon. When scraping Amazon, make sure your requests are geotargeted correctly, or Amazon can return incorrect information. 

Previously, you could rely on cookies to geotarget your requests; however, Amazon has improved its detection and blocking of these types of requests. As a result, proxies located in that country must be used to geotarget a particular country. To do this with ScraperAPI, for example, set country_code=us.

If you want to see results that Amazon would show to a person in the U.S., you’ll need a US proxy, and if you want to see results that Amazon would show to a person in Germany, you’ll need a German proxy. You must use proxies located in that region if you want to accurately geotarget a specific state, city or postcode.

Scrape Data From Amazon With the Help of ScraperAPI

Web scraping software such as ScraperAPI is easy to use and efficient enough to handle even the most complex scraping requirements. Scraping Amazon doesn’t have to be difficult with this guide, no matter your coding abilities, scraping needs and budget. You will be able to obtain complete data and make good use of it thanks to the numerous scraping tools and tips available.