Scrapy is an open-source Python framework designed for web scraping at scale. It gives us all the tools needed to extract, process, and store data from any website.
The beauty of this framework is how easy it is to build custom spiders at scale and collect specific elements using CSS or XPath selectors, manage files (JSON, CSV, etc.), and maintain our projects. If you’ve ever wanted to build a web scraper but wondered how to get started with Scrapy, you’re in the right place.
In this Scrapy tutorial, you’ll learn how to:
- Install Scrapy on your machine
- Create a new project
- Use Scrapy Shell to test selectors
- Build a custom spider
- Extracting specific bits of data
- Import your scraped data to a JSON or a CSV file
Although it would be good to have some previous knowledge of how Python works, we’re writing this tutorial for complete beginners. So you can be sure you’ll be able to follow each step of the process.
Note: If you don’t feel comfortable going through this article without some knowledge of Python syntax, we recommend W3School’s python tutorial as a starting point.
1. How to Install Scrapy on Your Machine
The Scrapy team recommends installing their framework in a virtual environment (VE) instead of system-wide, so that’s exactly what we’re going to do.
Open your command prompt on your desktop (or the directory where you want to create your virtual environment) and type python -m venv scrapy_tutorial
.
The venv
command will create a VE using the path you provided – in this case, scrapy_tutorial – and install the most recent version of Python you have in your system.
Additionally, it will add a few directories inside with a copy of the Python interpreter, the standard library, and various supporting files.
If you want to verify it was created, enter dir
in your command prompt and it will list all the directories you have.
To activate your new environment, type scrapy_tutorial\scripts\activate.bat
and run it.
Now that we’re inside our environment, we’ll use pip3 install scrapy
to download the framework and install it within our virtual environment.
And that’s it. We’re now ready to start our project.
2. Create a Scrapy Project
On your command prompt, go to cd scrapy_tutorial
and then type scrapy startproject scrapytutorial
:
This command will set up all the project files within a new directory automatically:
- scrapytutorial (folder)
- Scrapy.cfg
- scrapytutorial/
- Spiders (folder)
- _init_
- Items
- Middlewares
- Pipelines
- Setting
3. Use Scrapy Shell to test selectors
Before jumping into writing a spider, we first need to take a look at the website we want to scrape and find which element we can latch on to extract the data we want.
Loading Scrapy Shell
For this project, we’ll crawl https://www.wine-selection.com/
shop to collect the product name, link, and selling price.
To begin the test, let’s run scrapy shell
and let it load.
This will allow us to download the HTML page we want to scrape and interrogate it to figure out what commands we want to use when writing our scraper script.
After the shell finishes loading, we’ll use the fetch command and enter the URL we want to download like so: fetch('https://www.wine-selection.com/shop')
, and hit enter.
It should return a 200 status, meaning that the website is working, and save it within the response
variable.
Note: We can check this by typing response
on our command line.
Inspecting the page
Perfect. Our page is now ready for inspection.
We can type view(response)
and it will open the downloaded page on our default browser or just open our browser and navigate to our target page. Either way is fine.
By using the inspection tool in Chrome (ctrl + shift + c), we identify the classes or IDs we can use to select each element within the page.
Upon closer look, all the information we want to scrape is wrapped within <div>
with class="txt-wrap”
across all product cards.
Let’s use this class as a selector by typing response.css('div.txt-wrap')
and it will return all the elements that match this class – which to be honest is really overwhelming and won’t be of much help.
So to take it a bit further, let’s type this command again but adding the .get( )
command at the end of the string.
It has returned all the HTML code within that first div that meets the criteria, and we can see the price, the name, and the link within that code.
Using CSS selectors in Scrapy
To make our process more efficient, we’ll save this last response as a variable. Just enter wines = response.css('div.txt-wrap')
and now we can call this variable in the next line.
Because we want to get the name of the product, we need to check where the name is being served again.
It seems that the title of the product is contained in a link tag without class. So what can we do now?
The good news is that there’s only one link within our div
, so we can just add wines.css('a::text').get()
to tell our program to bring the text that’s inside the <a></a>
tags.
Note: when using CSS selectors, we can add ::text
after our selector and it’ll return just the text inside the element. For XPath, add /text()
– e.g. wines-xpath('//*[@id="content"]/div[1]/div/div[2]/h2/a/text()').get()
.
Now we can do the same process for the rest of our elements:
- Getting the price:
wines.css('strong.price::text').get()
- Getting the product link:
wines.css('a').attrib['href']
Tip: you might not want to have the space and dollar symbol ($) in your data. To quickly eliminate this add .replace('$ ', '')
.
4. Create a Custom Spider
First, open the project folder on VScode (or your preferred code editor) and create a new file within the spider folder called winespider.py
.
In the file write the following code:
</p>
import scrapy
class WinesSpider(scrapy.Spider):
name = "winy"
start_urls = ['https://www.wine-selection.com/shop']
def parse(self, response):
for wines in response.css('div.txt-wrap'):
yield {
'name': wines.css('a::text').get(),
'price': wines.css('strong.price::text').get().replace('$ ',''),
'link': wines.css('a').attrib['href'],
}
<p>
Let us break this code down:
- We imported Scrapy to our project at the top of the file.
- We defined a new class (spider) and added the subclass Spider.
- We then gave it a unique name (winy) that cannot be the same as any other spider within the project.
- To give our spider a target page, we used
start_urls = ['https://www.wine-selection.com/shop']
. We could have added a list of URLs separated by commas, but we’re going to make our spider move through the website’s pagination later on. So we just provided the first page. - Last, we told Scrapy what information we wanted it to find within the HTML. If you noticed, we used the same logic we defined in Scrapy Shell before and used the
parse()
function to handle the download page.
5. Run Your Scraper and Save the Data on a JSON.file
To run your scraper, exit Scrapy Shell and move to the project folder on your command prompt and type: scrapy crawl
and your spider’s name:
If everything is working, the data scraped will be logged into your command line:
Now that we know it’s working, we can go ahead and run it again, but this time with -o winy.json
to tell Scrapy to store the scraped data into a JSON file called winy.
Scrapy then handles everything for you so you don’t have to worry about writing your own output configuration.
6. Make Your Scraper Crawl the Pagination
If you’ve been following along, congratulations, you just wrote your first web crawling spider with Scrapy! That’s impressive.
Let’s add some more functionality to our spider by making it follow the next page of the pagination.
This page has a “next” button – which could be good news. After inspection, it seems like the target link our spider needs to follow is wrapped between link tags inside a list.
Before adding it to our main file we’ll test the selector using Scrapy Shell to verify our logic.
As stated above, we’ll need to fetch()
the URL and then use the response to extract our data, but there’s one catch: all the links in the pagination use the same classes.
So if we try to write:
</p>
next_page = response.css('span.page-link > a').attrib['ahref']
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
<p>
It will use the first page it finds using the path provided. Thus making our scraper go in circles.
Here is the good news: if we pay close attention to the structure of the button, there’s a rel = next
attribute
that only this button has. That has to be our target!
Time to update our code one last time…
</p>
import scrapy
class WinesSpider(scrapy.Spider):
name = "winy"
start_urls = ['https://www.wine-selection.com/shop']
def parse(self, response):
for wines in response.css('div.txt-wrap'):
yield {
'name': wines.css('a::text').get(),
'price': wines.css('strong.price::text').get().replace('$ ',''),
'link': wines.css('a').attrib['href'],
}
next_page = response.css('a[rel=next]').attrib['href']
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
<p>
… and run it using scrapy crawl winy -o winy.csv
to save our field in an easier to use format.
As you can see, it now returns 148 products.
Tip: If you want to add more information to an existing file, all you need to do is to run your scraper and use a lower-case “-o” (e.g. scrapy crawl -o winy.csv
). If you want to override the entire file, use a capital “-O” instead (e.g scrapy crawl -O winy.csv
).
ScraperAPI and Scrapy Integration
Great job! You just created your first Scrapy web scraper.
Of course, there’s no one-size-fits-all scraper. Every website is structured differently, so you’ll have to find your way around them.
However, thanks to Scrapy, you can create custom spiders for every page and maintain them without affecting the rest. Scrapy makes your projects easier to manage and scale.
That said, if you are planning to build a large project, you’ll quickly find that just using Scrapy isn’t enough.
For some pages, you’ll need to tell your spiders how to handle bans, CAPTCHAs, execute JavaScript, and apply geotargeting.
These challenges add additional coding time and, in some cases, can make a simple project an absolute nightmare.
That’s what ScraperAPI was built for. Using years of statistical analysis, machine learning, huge browser farms, and 3rd party proxies, it prevents your scraper from getting blocked by anti-scraping techniques.
There are several ways to use ScraperAPI. The easiest way is by using a cURL to send your request. Something like this:
start_urls = ['http://api.scraperapi.com?api_key={yourApiKey}&url=https://www.wine-selection.com/shop']
This will send your request through ScraperAPI servers where the tool will choose which header is the more appropriate, rotate the IP address between every request and handle CAPTCHAs automatically.
To get your API key, just create a free ScraperAPI account. You’ll find your key in your dashboard and 5000 free API credits to test the full functionality of the tool.
However, this method won’t work for our scraper because we need it to scrape several URLs by following the links. Using this logic will only work for the initial request but not for the rest of them.
The first step for this integration is constructing the URL.
Define the get_scraperapi_url()
Method
To construct our new URL, we’ll need to use the payload and the urlencode function.
We’ll start by adding a new dependency to the top of our file and, underneath it, define a constant called API_KEY:
</p>
import scrapy
from urllib.parse import urlencode
API_KEY = '51e43be283e4db2a5afb62660fcxxxxx'
<p>
Note: Remember to use your API key when writing your scraper.
Now we’re ready to define our new method:
</p>
def get_scraperapi_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
<p>
This method will tell our scraper how to construct the URL for the request, adding our API key and the target URL.
Send a Request Through ScraperAPI Servers
We used the start_urls
function to send our request before, which automatically stores the returned information in a response variable.
In this case, we’ll need to define a method to use the previous method and add the additional strings to our URL when sending the request.
</p>
def start_requests(self):
urls = ['https://www.wine-selection.com/shop']
for url in urls:
yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)
<p>
As we can see, our scraper is using the values in get_scraperapi_url(url)
and the URLs inside the urls variable to send the request. So after our spider runs through all the code and finds a new URL, it will loop back and construct the URL in the same way for each new request.
The rest of our code stays the same:
</p>
import scrapy
from urllib.parse import urlencode
API_KEY = '51e43be283e4db2a5afb62660xxxxxxx'
def get_scraperapi_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
class WinesSpider(scrapy.Spider):
name = "winy"
def start_requests(self):
urls = ['https://www.wine-selection.com/shop']
for url in urls:
yield scrapy.Request(url=get_scraperapi_url(url), callback=self.parse)
def parse(self, response):
for wines in response.css('div.txt-wrap'):
yield {
'name': wines.css('a::text').get(),
'price': wines.css('strong.price::text').get().replace('$ ',''),
'link': wines.css('a').attrib['href'],
}
next_page = response.css('a[rel=next]').attrib['href']
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
<p>
If we run our scraper now, we’ll be getting the same results as before, but without risking our machine getting ban or blacklisted.
How cool’s that?
Looking for a way to scrape LinkedIn with Python? We have an easy guide you can follow.