eBay is one of the largest eCommerce platforms in the world. With over 1 billion active listings on its site, it is also one of the largest data gold mines for pricing analysis, online purchase trends, and more. However, before analyzing their data, you need to extract it.
Today, we’ll build an eBay web scraper using Node.JS and Cheerio and show you the step-by-step process behind it – from idea to execution.
Is It Legal to Scrape eBay?
Yes, scraping eBay is totally legal if you’re not accessing data behind login walls or personal data without consent.
There’s still a lot of discussion around the legality of web scraping, and the fact that there are so many conflicting interests makes it hard to find reliable information. However, as it is today, it all depends on the type of data you’re scraping, how you’re extracting it, and its end-use.
In this article and all our tutorials, we’ll only show you ethical web scraping processes, so you can be confident you can apply these strategies without repercussions.
It’s important to know and understand the legal nuances of web scraping, so we build a complete guide to ensure legal and ethical web scraping practices.
Now, let’s start coding, shall we?
Scrape eBay Product Data with Cheerio
loop and the network tab in chrome’s DevTools.
Note: You don’t need to read those first to understand this tutorial, but it might help to get a clearer picture of our thought process.
To build on top of that, we’ll create an async function to scrape the name, price, and link of 4k TVs on eBay and then export the data into a CSV using the Object-to-CSV package.
1. Understanding eBay’s URL Structure
Let’s go to https://www.ebay.com/ and search for “4k smart tv” on the search bar to grab our initial URL.
It sends us to the following URL:
If we were to scrape just this page, we could stop at this point and start writing our code. However, we want to see how it changes when moving through the pagination to understand how we can tell our script to do the same.
At first glance, it seems like the _sacat parameter stores the page number, so if we change it, it would be enough. Because eBay uses a numbered pagination, we can just click on the “Next” button and see how the URL changes.
Here’s the resulting URL:
This is quite different from what we had before. So let’s go back and see if it returns to the previous version.
No, it uses the “new” version of the URL when we use the pagination. This is great! All we need to do is change the
_pgn parameter, and it will move to the next page. We confirmed this by just changing the number in the addressed bar.
Awesome, we’ll use this new version as our base URL for the HTTP request and later on to allow us to scrape every page in the series.
Go to the page you want to scrape, right-click and click on “View Page Source”.
It will show you the site’s source code before any AJAX injection. We’ll copy the name and look for it in the Page Source for the test.
And next is the price.
We did the same with a few other products just to be sure, and we could find the element every time.
3. Sending Our HTTP Request with Axios
The fun part begins! Let’s create a new folder for our project, open it on VScode (or your favorite editor) and start it with npm
init -y to create the initial package.json file. From there, we can install Axios, a great and simple tool to send HTTP requests with Node.js, with
npm install axios.
To send and test the request, let’s create a new file called index.js (original, we know), require Axios at the top, and create an
async function. Inside it, we’ll send our request using Axios and store the response’s data inside a variable called html for clarity.
html variable to verify that our request is working:
Ah yes! A bunch of nonsense, as expected.
4. Parsing the Raw HTML with Cheerio
Before extracting our elements, we need to parse the raw data we downloaded to give it a structure we can navigate. That’s where Cheerio comes in!
We’ll create a new variable and pass
html to cheerio using
From there, let’s head back to the website to find our selectors.
Note: Don’t forget to install Cheerio with
npm install cheerio before using it.
5. Picking the Right Selectors
The first thing we’re looking for is the element that contains all the data we’re looking for. So every product seems to be contained within a card, right?
We want to grab the element that contains all cards so we can then iterate through the list and extract the information we want (name, price, and URL – to refresh your memory).
ul element wraps all product cards, so it is a great starting point to explore the HTML structure a little bit more.
Like we thought, every product card is a
li element within the
ul. All we need to do is grab all the
li elements and assign them to a variable, effectively creating an array we can then go through to extract the data.
For testing, let’s open the browser’s console and use the
li element’s class and see what gets returned:
Unlike Python’s Scrapy, Cheerio doesn’t have a built-in shell for testing, but we can use the console to test the selectors without having to send a request every time. We did the same thing with the rest of the elements.
We’ll pick the only h3 tag inside of each element within the list for the name.
In the case of the price, it’s wrapped within a span element with the class
Lastly, for the URL we needed to do something a little different. Although the a tag had a class we could use, it was shared by other elements outside our list. Notice how it returned 64 nodes instead of 59, which is the correct number of
Technically it would still work for us because we’ll be telling our scraper to look for the element inside the list. However, just to be sure, we’ll be going up in the hierarchy and grabbing the
div containing the URL and then moving down to the a tag itself like this:
'div.s-item__info.clearfix > a'.
6. Extracting eBay Data
So the logic would be like this:
Let’s put it all together now, as we already know the selectors:
However, we’re not done yet. We need to add a new method at the end of each string, otherwise, we’ll be getting the entire HTML information which isn’t what we want.
We want the text inside the element for the name and price, so all we need to do is add the
text() method at the end. For the URL, we want the value stored inside the href attribute, so we use the
attr() method and pass the attribute we want the value from.
We could log each variable to the console but we would be getting a lot of messy data. Instead, let’s give it some structure before testing the scraper.
7. Pushing the Extracted Data to an Empty Array
This is actually quite an easy process that will help us to organize the data and making it ready to export.
First, we’ll create an empty array outside our function.
From inside the function, we can call this variable and use the
push() method to add all elements to our empty array. We’ll add the following snippet of code inside
tvs.each(), right after the
Run the test with a
console.log(scrapedTVs) and see what we get:
Nothing can beat the feeling of our code working! Our data is structured and clean. In perfect shape to be exported.
8. Exporting Our Data to a CSV
Exporting data into a CSV is made super simple with the ObjectsToCsv package. Just
npm i objects-to-csv and add it to the dependencies at the top.
ObjectsToCsv has an easy syntax:
To create initiate the export, we need to create a new
ObjectsToCsv() instance and pass it our dataset. Then, we’ll await the promise to resolve and save the result into a CSV file by giving it the path. We’re also setting append to true (it’s false by default) because we’re going to be adding more data to it from each page of the pagination.
For testing, we’ll log “Saved to CSV” to the console:
9. Scrape eBay’s Pagination
We already know we can scrape all pages inside the pagination by changing the
_pgn parameter in the URL. So for this project, we can implement a
for loop that changes this number after every iteration.
But we also need to know when to stop. Let’s head back to the website and see how many pages the pagination has.
It caps at 21. However, if we push the URL and add a 22 in the parameter, the page still responds with a page, but it loads the last page of the series, in other words, page 21.
We now can write the three statements for the for loop and put everything inside of it:
Here’s how it should look like:
If we put all our previous code inside this
for loop (which is inside of our async function), it’ll keep running until it meets the condition and breaks. Still, there are two changes we need to make before we call it for the day.
The second change we’ll want to make is sending our request through ScraperAPI servers to handle IP rotation and headers automatically. To do so, we’ll need to create a free ScraperAPI account. It’ll provide us with an API key and the string we’ll need to add to the URL for it to work:
This way we can avoid any kind of anti-scraping mechanism that could block our script.
10. eBay Web Scraper Finished Code
Here’s the finished code ready to use:
Note: Keep in mind that you’ll need to replace the API key in the code for your own key for it to work.
Great work! You now have a fast and effective eBay web scraper ready to be deployed.
To make it even more powerful, you could use the _nkw parameter inside the URL to make it easier to add a new term to the search, and with ScraperAPI by your side, you won’t have to worry about your IP getting blocked. However, that’s something we’ll leave to your imagination.
We hope you enjoyed this tutorial and learnt a trick or two. Don’t hesitate to contact us if you have any questions and tag us to your projects on social media, we’re always interested to see what awesome things you build.