Why scrape Etsy? Since being founded in 2015, Etsy has become a $30billion dollar eCommerce company with over 4.4 million active sellers and 81.9 buyers according to Statista. That’s a lot of users.
Although less well-known than the ecommerce juggernaut Amazon, Etsy is one of the largest marketplaces for non-conventional and creative goods there is.
For us, it also means that there’s a lot of information we can use to create informed decisions.
Whether you’re trying to break into a specific industry by collecting pricing information, or you want to analyze product trends to find profitable niches, building an Etsy web scraper can save you a lot of time and potentially make you good money.
Another potential use of a script like this is to help sellers and store owners better understand the competitive landscape, find industry shifts in price or demand, and much more.
If you want to supercharge your Etsy or ecommerce presence with customer data, then this is the tutorial for you!
Step-by-Step Guide to Build a Rvest Etsy Scraper
For this tutorial, we will scrape Etsy’s costumes section for boys using the Rvest framework and Dplyr.
Before we start writing our script, we need to install two things:
Once you’ve installed and opened R Studio, it’s time to download and install our dependencies.
1. Install Rvest and Dplyr
In R Studio, let’s create a new directory called rvest-etsy-scraper and save it into an accessible folder in your machine, just to make it easier to find when needed.
With our new project open, we can create a new script and call it whatever we want. In our case, we called it etsy-rvest for simplicity.
Now, we can install our dependencies with two simple lines of code:
</p>
install.packages("rvest")
install.packages("dplyr")
<p>
After installation is complete, we can make a little change to our lines to import the packages into the project:
</p>
library("rvest")
library("dplyr")
<p>
2. Download Etsy’s HTML for Parsing
Without HTML, there is no parsing. With that in mind, let’s send an HTTP request to the server and store the response in a variable called ‘page’.
</p>
link = "https://www.etsy.com/c/clothing/boys-clothing/costumes?ref=pagination&explicit=1&page=1"
page = read_html(link)
<p>
Notice that we’re using a different category page’s URL version.
If you navigate to the website manually, you’ll see that this is the original version of the URL:
</p>
https://www.etsy.com/c/clothing/boys-clothing/costumes?ref=catnav-10923
<p>
However, when we navigate through the pagination – which we’ll definitely want to do to scrape all pages from this category – we’ll need to give access to each page to our script.
To make it more visual, here’s what the second URL of the pagination looks like:
</p>
https://www.etsy.com/c/clothing/boys-clothing/costumes?ref=pagination&page=2
<p>
Therefore, we can access page 1 simply by changing the ‘page’ parameter within the URL.
We’ll go deeper into that later on in this tutorial but for now, let’s try to get some data out of this first page.
3. Finding CSS Properties to Target
There are two common ways to define the CSS selectors we’ll need to extract the product’s name, price, and URL.
The first way is to manually go to the page itself and use the Inspector Tool. You can open it with CTRL + Shift + C on windows, or CMD + Shift + C on Mac.
Or we can use a tool like SelectorGadget to select the correct CSS class for the element we want to scrape.
After some testing, we’ve determined that there are 64 product listings per page, and these are the CSS classes for each element:
- Product name:
.v2-listing-card h3 - Product price:
.wt-text-title-01 .wt-text-title-01 .currency-value - Product URL:
li.wt-list-unstyled div.js-merch-stash-check-listing a.listing-link
If you’re following along, you might wonder how we got the properties for the product URL. That’s’s where things get tricky.
Using SelectorGadget for Logic Testing
Not every website is built the same way, so there’s no way to get the CSS properties directly from SelectorGadget, because the URL is not in a particular element every time.
For example, the product’s URL tends to be inside the same element as the product’s name on many websites. That’s not the case for Etsy, though.
However, there’s another way we can use SelectorGadget to find CSS elements like URLs: testing the properties.
With SelectorGadget open, we first found the highest level element for the product cards themselves.
Inside the <li>
element, we want to find the main <div>
where the product’s URL is contained.
Next, we know we want to target the <a>
tag within that div to find the URL. We used SelectorGadget to verify that we were picking 64 elements to verify that our logic was correct.
That’s why it’s so important to understand how web pages are structured when web scraping.
Tip: You can use our CSS selectors cheat sheet to learn more about CSS selectors and to help you speed up scraper development time.
Now that we know how to select each element to get the data we want, let’s test our script.
4. Testing Our Parsing Script
For this test, we want to run the whole script to extract the name of the 64 products on the page.
Here’s where Dplyr will come in handy, as it will allow us to use the Pipe Operator (>) to make the process a lot easier.
In simple terms, the Pipe Operator (>) takes whatever value is on the left, computes it, and passes the result as the first argument to the function that is after the pipe.
Note: If you want to learn more about the Pipe Operator, check our beginner R web scraping tutorial.
</p>
product_name = page > html_nodes(".v2-listing-card h3") > html_text()
<p>
As you can see, we’re passing the value stored in the ‘page’ variable – which is the HTML we downloaded before – to our ‘html_nodes()’ function.
The result of that operation is then passed to the following function – ‘html_text()’ – so we only extract the text inside the element without bringing tags and everything else with it.
To test it out, type ‘product_names’ in the console.
There you go, 64 product names pulled in just a few seconds.
Unfortunately, there is a lot of white space on the page, and those annoying \n’s at the beginning and end of each string.
To clean those up, let’s add an extra function to our ‘product_name’ variable:
</p>
> stringr::str_trim()
<p>
Here’s the result:
Cleaner and better-looking product names that anyone can read without the extra code clutter.
5. Get the Rest of the Elements Using the Same Logic
Alright, now that we know our script is working, let’s scrape the price and URL of each of these elements on the page.
</p>
product_price = page > html_nodes(".wt-text-title-01 .wt-text-title-01 .currency-value") > html_text()
<p>
</p>
product_url = page > html_nodes("li.wt-list-unstyled div.js-merch-stash-check-listing a.listing-link") > html_attr("href")
<p>
In the case of attributes, instead of using the html_text() function, we used the html_attr(“href”) to extract the value inside the attribute selected – in this case, ‘href’.
6. Sending Our Data to a Data Frame
Alright, so far our scraper is working perfectly. However, having the data logged into our console might not be the best way to handle it. Let’s create a data frame to organize our data for later analysis.
To do this, we’ll call the Rvest data.frame() function and use our variables as columns.
</p>
costumes_ideas = data.frame(product_name, product_price, product_url, stringsAsFactors = FALSE)
<p>
We can type View(costumes_ideas) to look at the data frame we just created.
Note: the View command is case sensitive, so make sure you capitalize the V.
7. Making Our Script Navigate Paginated Pages
Now that we’ve tested our scraper on one page, we’re ready to scale things a little by letting it move to other pages and extract the same data.
Although there are several ways to do this, we’ll create a for loop to scrape two pages first. Of course, you can scale it to bigger projects that scrape many many pages, but for the sake of time and simplicity, let’s keep the scope small.
A few things to consider when writing a for loop:
- Analyze the URL of the paginated pages to determine how many of them are there – in our example category, there are 194 pages total.
- Take a look at how numbers are increasing – in our case the pagination increases by one in the URL with each page, but it’s not the case for every website out there.
With these two things in mind, we can move on to building the loop.
</p>
costumes_ideas = data.frame()
for (page_result in seq(from = 1, to = 2, by = 1)) {
link = paste0("https://www.etsy.com/c/clothing/boys-clothing/costumes?ref=pagination&explicit=1&page=", page_result)
page = read_html(link)
product_name = page > html_nodes(".v2-listing-card h3") > html_text() > stringr::str_trim()
product_price = page > html_nodes(".wt-text-title-01 .wt-text-title-01 .currency-value") > html_text()
product_url = page > html_nodes("li.wt-list-unstyled div.js-merch-stash-check-listing a.listing-link") > html_attr("href")
costumes_ideas = rbind(costumes_ideas, data.frame(product_name, product_price, product_url, stringsAsFactors = FALSE))
}
<p>
Alright, there’s a lot going on in here, so let’s go over each part:
- We created an empty data frame outside the for loop to avoid overwriting existing data from a previous loop with the new data from the next loop.
- Then we wrapped our old data frame with a rbind() function that uses the empty data frame as the first argument. Instead of overwriting it, this will take every round of data and add it to the existing one.
- page_result will start with a value of 1 and then increase by one until hitting 2. This will be our primary way to move through the navigation.
- The paste0() function will glue together our URL with the page_result value while deleting any whitespace automatically for us.
The rest of our script will stay the same; we just cut it and paste it into the for loop.
Time for a test run!
There we go! 128 elements were scraped (64 by two pages) in a matter of seconds.
However, scraping 2 pages out of 194 isn’t that amazing, is it? Let’s scale our project even further.
8. Using ScraperAPI for Scalability
Websites are meant to be used by users, not robots. For that reason, many websites have different methods to keep scripts from accessing and extracting too many pages in a short amount of time.
To avoid these measures, we would have to create a function that changes our IP address, have access to a pool of IP addresses for our script to rotate between, create some way to deal with CAPTCHAs, and handle javascript pages – which are becoming more common.
Or we could just send our HTTP request through ScraperAPI’s server and let them handle everything automatically.
We’ll only need to create a free ScraperAPI account to redeem 5000 free API credits and get access to our API key.
Then, we’ll make three little adjustments to our code.
</p>
library("rvest")
library("dplyr")
costumes_ideas = data.frame()
for (page_result in seq(from = 1, to = 194, by = 1)) {
link = paste0("http://api.scraperapi.com?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=https://www.etsy.com/c/clothing/boys-clothing/costumes?ref=pagination&explicit=1&page=", page_result, "&country_code=us")
page = read_html(link)
product_name = page > html_nodes(".v2-listing-card h3") > html_text() > stringr::str_trim()
product_price = page > html_nodes(".wt-text-title-01 .wt-text-title-01 .currency-value") > html_text()
product_url = page > html_nodes("li.wt-list-unstyled div.js-merch-stash-check-listing a.listing-link") > html_attr("href")
costumes_ideas = rbind(costumes_ideas, data.frame(product_name, product_price, product_url, stringsAsFactors = FALSE))
}
<p>
First, we increased the total number of pages we’ll be scraping to 194 pages which is the total number of pages for this category.
Then we pass our target URL as a parameter of the ScraperAPI request:
“http://api.scraperapi.com?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=https://www.etsy.com/c/clothing/boys-clothing/costumes?ref=pagination&explicit=1&page=”
Finally, we added the “&country_code=us” parameter after page_result to tell ScraperAPI to send our request from US IP addresses.
Because at the moment of writing this article, our script was sending the request from an Italian IP address, the results were going to be different than they would be were we to search from within the US – plus the price was in Euros and not in USD, which is an important distinction to make.
Let’s run one last test to see what we’re getting – but remember this will take a while as our script needs to move to each page of the 194 pages and scrape each of the elements:
You’ll also notice the activity on your ScraperAPI dashboard:
Awesome! We now have 12,415 data points, but we might want to save it as more than just a data frame for easier analysis, right?
9. Pushing Our Data Into an Excel File
This is one of the reasons we love R. Exporting your data frame as an Excel file is as simple as installing a package.
</p>
install.packages("writexl")
<p>
Note: Type the installation command into your console.
Add our new dependency to the top of our script.
</p>
library("writexl")
<p>
Create a new function outside the loop, where the first argument is the name of our data frame, and the second one is the path and file’s name.
</p>
write_xlsx(costumes_ideas,"/Users/lyns/Documents/Coding/costumes-data.xlsx")
<p>
Note: Here’s a quick guide on how to use the write_xlsx() function.
Finally, we just run the two new lines of our scraper (no need to re-run the entire script), and that’s it.
If we go to the path we specified inside the function, we’ll find our new Excel file inside.
We can now take a breath because it worked without hiccups!
Wrapping Up
Thank you so much for following along with us on this exciting project. R is a simple yet powerful language and combined with all the capabilities of ScraperAPI we can create efficient and virtually unstoppable scripts to scrape any website.
In addition to preventing us from getting blocked or banned, ScraperAPI also adds resting times to our scripts to avoid overloading servers or damaging the websites we’re crawling and scraping.
If you want to learn more about the other applications of our API, our documentation is ready with useful code snippets and everything you need to create the best bots possible at a fraction of the cost of other web scraping tools.
If you want to learn out to scrape other websites and build scrapers in different languages, here are a few of our top resources to keep improving your craft:
- How To Scrape Amazon Product Data
- Scrape Data from Google Search Using Python and Scrapy [Step by Step Guide]
- Build a Python Web Scraper Step by Step Using Beautiful Soup
- Web Scraping Best Practices: ScraperAPI Cheat Sheet
Until next time, happy scraping!