Web Scraping Glossary​

There are many moving parts involved in web scraping—including jargon-filled terms and confusing methods. 

 

With new technologies, strategies, and terms popping up daily, we created a shared vocabulary guide to make communicating with colleagues and stakeholders easier. Now you can avoid the back and forth with our detailed A-Z scraping glossary.

 

*Bookmark this page to stay up to date on our terminology.

scraperapi-web-scraping

A

Web scraping terms that start with A

A unique identifier that allows access to an application programming interface (API). It’s a secret token – a snippet of string – that must be included in API requests to authenticate and authorize – similar to a password – the requester.

These are a type of currency that some API providers use to limit the usage or charge for access to their API. Each API call may require a certain number of credits, and users may need to purchase more credits to continue using the API.

A request for information or action sent from one software application to another through an API. It’s like asking a question or petition to another computer program.

Specific URL within an API that can be used to access a particular resource or perform a specific action. It’s the target destination for a request sent to an API.

For example, ScraperAPI has many endpoints:

These are a series of values that are passed along with a request to an API endpoint. They can be used to specify options or criteria for the request, such as filtering, sorting, or pagination. For example, when sending a request to ScraperAPI, you can use these parameters:
  • country_code – to set the location you want the request to be made from – it can be US, CA, FR, etc.
  • render – to tell ScraperAPI to render the website before returning the HTML data.
  • api_key – to authenticate yourself and get access to the tool.

It’s a popular JavaScript library for making HTTP and API requests from a Node.js environment, allowing you to get access to static websites programmatically.

Any type of data collected from non-traditional sources such as satellite imagery, social media activity, weather patterns, etc. It’s often used in quantitative finance and investment research to gain a competitive advantage, but its use has grown in industries like marketing and machine learning.

Unlike traditional APIs, these APIs allow for asynchronous communication between client and server, letting the client send requests without having to wait for a response – this is especially useful for reducing data collection complexity by sending a massive number of requests simultaneously.

An array is a collection of values or objects stored in a single variable. It allows for easy manipulation and organization of collected data and simplifies exporting data to other systems and databases in a structured format.

These are methods websites or APIs use to prevent automated scraping of their content. It may include measures such as rate limiting, IP blocking, or CAPTCHAs and is designed to protect against programmatic access to data.

B

Web scraping terms that start with B

It’s a Python library that helps you extract useful information from websites programmatically. It was designed to make it easy to find and parse important parts of a website, like text, links, and images, so that they can be used for other purposes.

Browser behavior profiling is a technique websites use to track and analyze their visitors’ behavior. By monitoring things like which pages someone visits, how long they stay on each page, and how fast they move through the site, websites can build a profile of each visitor and use that information to block bots and unwanted visitors.

C

Web scraping terms that start with C

It’s an interactive test used by websites to verify that the user-agent trying to access the site is a human, not a bot or script, by asking them to complete a simple task, such as typing in letters or clicking on pictures.
Cascading Style Sheets (CSS) are used to tell the browser how to display an HTML element by creating rules for a specific set of elements. CSS targets specific HTML tags or attributes like class and ID (also known as CSS selectors) to specify which elements are affected by which rules.

These are specific HTML elements’ attributes or tags we can use for targeting data points on a website using CSS rules established on the site. Here’s an example code targeting a <div> with class s-item__info.clearfix

 
 
 
html
const tvLink = $(element).find('div.s-item__info.clearfix')

Attributes that can be added to HTML elements to provide additional information about the element’s purpose or styling. For example:

html
<p class="highlight">This paragraph will be highlighted.</p>
<p class="highlight">So will this one.</p>
It’s a fast and efficient library used for parsing and interpreting HTML content for later processing in a Node.js environment. It’s designed to parse and manipulate HTML or XML documents in a simple and easy-to-understand way, letting us extract specific pieces of data targeting HTML tags, CSS selectors, or XPath selectors.

Simple to use web scraping library for Golang, letting you extract data programmatically with an easier-to-understand syntax and improved efficiency.

In web scraping, concurrencies refer to the ability to perform multiple tasks or requests at the same time. APIs will need to allocate more resources to your projects in order to handle more requests simultaneously, so these usually set concurrency limits based on your subscription plan.

Also known as spiders, they are computer programs designed to scan the web by following links within web pages. These are designed to visit multiple pages across multiple domains based on a set of rules. Crawlers usually use scrapers to work effectively – e.g., using a scraper to find all links within a page

A CSV file stores tabular data by storing a table row as a single line, where every element is a  while the line is separated by commas that represent columns of the table.This format is widely used because it is easy for machines to understand, allowing to use CSVs to export information from one system to another.

A way of building websites where most of the work of loading and rendering the page is done on the user’s computer rather than on the server. It’s like having a website that can update itself on the fly without needing to refresh the page.

To scrape a website using client-side rendering, you might need to use a headless browser or a scraping tool like ScraperAPI to render the page before returning the HTML file – which would be empty before rendering happens.

Side note: Another very common technique of scraping such pages is to understand what network requests they do in the background and imitate them. This way, the user usually receives structured data responses out of the box, as those background requests usually invoke some structured data endpoints.

A function or piece of code that’s called when a certain event happens, like an API returning the data requested through an async API call, primarily used for asynchronous programming.

D

Web scraping terms that start with D

Refers to the process of examining and interpreting data in order to extract useful insights and information. This can involve techniques such as statistical analysis, visualization, and machine learning and is used in a variety of fields such as business, science, and social research – web scraping is widely used to collect the data necessary for this process.

Refers to taking data that has already been collected and using it for a different purpose. This can involve analyzing the data in a new way, combining it with other data sources, or using it for a different application or research question – it’s also one of the main uses of scraped data.

The Document Object Model (DOM) is a programming interface for HTML and XML documents. It provides a way for programs to interact with and manipulate the structure and content of a web page.

It’s a hierarchical structure that represents the elements and content of a web page in a tree-like format. Each node in the tree represents the web page’s element, attribute, or text content.

Libraries like Cheerio and Colly parse the raw HTML of a website to generate a DOM tree, allowing you to target specific elements using CSS selectors or targeting the order the elements are represented in the DOM tree.

These are a type of proxy server that uses IP addresses from data centers – these IP addresses are not associated with residential addresses (home addresses) or internet service providers (ISPs) – to mask the user’s real IP address. They are commonly used for web scraping and other online activities that require anonymity and bypassing IP restrictions.

Data center proxies tend to be more affordable and faster than residential proxies, but they are also more likely to be detected and blocked by websites and services.

A two-dimensional table-like data structure that stores data in rows and columns – similar to a spreadsheet – where each row represents an observation or instance, and each column represents a variable or attribute.

Dataframes are commonly used in data analysis and manipulation tasks, as they provide a convenient way to organize and analyze large and complex datasets. They can be easily created and manipulated using data analysis tools and programming languages like Python, R, and MATLAB.

Refers to a component or library imported to a project for the program/script to function properly or to perform more complex tasks and simplify processes. Libraries like BeautifulSoup and Cheerio are common dependencies of web scraping projects.

Webpage that can change its content and appearance in response to user interactions or other events. These are typically harder to scrape because the initial HTML doesn’t have the data you can see on the browser.

To scrape these pages, you’d need to use a headless browser – which increases the complexity of the code – or a scraping tool like ScraperAPI to render the page’s content and have access to the data you need.

Note: For more context, see Client-Side Rendering.

F

Web scraping terms that start with F

It’s a programming conditional statement that allows you to repeat a set of instructions a certain number of times until a certain criteria is met. This could be, for example, going through a list of pages and extracting a specific set of information.

Refers to the process of retrieving data or information from a particular source or location. When a fetch request is made, the client (in this case, a scraper) sends a request to the server, specifying the URL of the resource to be fetched and any other necessary parameters or headers. The server then responds with the requested data, which can include text, HTML, JSON, or other formats.

G

Web scraping terms that start with G

By checking your IP, servers can know from which country or region you’re sending your request so that they can show you the most relevant information.

In web scraping, we use geotargeting to extract data that is relevant to specific regions or countries (localized data), such as local news articles, weather forecasts, or product prices, by using IP addresses (proxies) based on those locations, making us look like a local visitor.

It’s the most popular web scraping library for PHP. It allows developers to extract data from HTML and XML documents and provides a simple and intuitive interface for navigating web pages and selecting elements to scrape.

Request method that is used to retrieve data from a server.

Note: Although a get() request is the most common method in web scraping, there are other request methods, like Post, Delete, Patch, etc., that are usually used to send data modification requests to a database or server.

H

Web scraping terms that start with H

It’s a piece of code that is used to define the structure and content of a web page. HTML tags are enclosed in angle brackets, and they can be used to define headings, paragraphs, images, links, and other elements – because of this, we can target these tags and their attributes to extract specific bits of information programmatically.

html
<h1>This is a heading</h1>
<p>This is a paragraph</p>

A modifier that is applied to an HTML tag to provide additional information about the element. For example, the “href” attribute in an anchor tag specifies the URL that the link should point to.

html
<a href="https://example.com/">this is a link pointing to example.com</a>

Message sent from a client (in this case, a web scraper) to a server, requesting a specific action or resource. These can be, for example, a GET request (used to retrieve data from a server) or a POST request (used to submit data to a server).

Additional pieces of information that are sent along with an HTTP request or response. Headers can be used to provide information about the client or server, specify the content type or encoding of the data being sent, or control caching and security settings.

These are tools or libraries used to analyze and interpret unstructured data, such as HTML or XML, and transform them into a parse tree (think of a DOM tree), which it’s easier to read, understand and navigate, allowing scrapers to target specific data points for extraction.

Web browser without a graphical user interface (GUI) that is commonly used for web scraping or automated testing. This means that the browser runs in the background without displaying the web page being accessed.

In web scraping, these are mostly used to extract data from dynamic pages requiring the script to interact with the page in order to access certain data points.

Ruby gem that is used for making HTTP requests from Ruby code. It provides a simple and intuitive interface for making GET and POST requests, handling authentication and redirects, and parsing JSON and XML responses.

NET library that is used for parsing HTML documents in C#. It provides a simple and flexible API for navigating and manipulating HTML elements, and it can be used for web scraping, screen scraping, and data mining.

Traps set up on a web page to detect and block web scraping bots or malicious activity. Honeypots can be implemented using hidden form fields and links or JavaScript challenges.

Secure version of the HTTP protocol that is used to transfer data between a web server and a client (such as a web browser). HTTPS uses encryption to ensure that data is transmitted securely, protecting it from interception or modification by third parties.

ScraperAPI supports both HTTP and HTTPS, but we recommend always using the HTTPS version.

I

Web scraping terms that start with I

Special attribute added to an HTML element to give it a unique name. Think of it like a special name tag for a specific element – unlike the class attribute, IDs must only be used once per page.

Unique numerical identifier that is assigned to every device that connects to the internet. This includes computers, smartphones, and other devices.

Technique used by websites to restrict access to certain users or devices based on their IP address. This is often done to prevent web scraping or to protect against malicious activity, such as DDoS attacks.

J

Web scraping terms that start with J

Acronym for JavaScript Object Notation (JSON). It’s a structured data format used for exchanging data between different systems. It is a lightweight, human-readable format that can be easily understood and parsed by both machines and humans, making it one of the most commonly used formats for web scraping.

Here’s an example of a JSON file:

json
{
       "name": "LOW SNEAKERS",
       "price": "€ 59,99",
       "url": "https://www.jackjones.com/nl/en/jj/shoes/trainers/low-sneakers-12203668.html?cgid=jj-shoes-trainers\u0026dwvar_colorPattern=12203668_White_915599"
   }

M

Web scraping terms that start with M

IP addresses associated with actual mobile devices that can be used to send HTTP requests, sending information to the server like screen resolution, user agent string, and other mobile-specific attributes. These proxies are usually sourced from actual people who resell their unused mobile traffic to a provider that then resells this service as mobile proxies.

Using mobile proxies allows you to collect data from sites that behave differently or provide different data depending on the type of device accessing them.

Note: There are unethically sourced mobile proxies too, where the mobile user is not aware that their phone and internet connection is used this way, which is highly illegal. Always make sure you’re buying from a trusted provider to avoid issues.

It’s a technique for extracting data from websites that uses asynchronous programming, allowing for multiple requests to be made simultaneously, improving the speed and efficiency of the scraping process.

Note: For this to be effective, you’ll need a scraping tool like ScraperAPI to avoid getting your IP blocked due to the high number of requests sent.

N

Web scraping terms that start with N

Refers to data that contains errors, inaccuracies, or inconsistencies.

Ruby gem used for web scraping and parsing HTML and XML documents. It provides tools and methods for accessing and manipulating HTML and XML data, making extracting data from web pages easier.

P

Web scraping terms that start with P

Websites like eCommerce or blogs have a large number of items for visitors to see, but showing the entire dataset on one page would be too overwhelming and counterproductive, so websites split their catalog of items across multiple pages. Each page will typically contain a subset of the data, with links or buttons to navigate to the next or previous pages.

Block of data that is sent from one system to another, typically as part of an HTTP request or response. The payload may contain various types of data, such as user input, configuration settings, or scraped data.

You can use a payload to set the ScraperAPI’s parameters to simplify get() requests, make your code easier to read, and avoid setup errors because of conflicting parameters.

python
payload = {
   'api_key': 'YOUR_API_KEY',
   'asin': 'B08LNZVQ1J',
   'country': 'us'
   }

r = requests.get('https://api.scraperapi.com/structured/amazon/product', params=payload)

Node.js library for automating web browser interactions. It provides a set of tools and methods for interacting with and controlling a (potentially headless) browser programmatically, allowing developers to scrape data from websites and perform other automated tasks.

Web scraping is used to collect data from dynamic websites. Because Puppeteers open a Chrome instance in the background, it can render JavaScript content and interact with JS elements like buttons and forms.

Proxy a server that acts as an intermediary between a web scraper and its target website. Proxies can be used to hide the IP address and the location of the scraper to avoid detection or blocking by the server.

It’s a software tool or library that is used to manage a pool of proxy servers. It may handle tasks such as rotating through a pool of proxies, checking the status of proxies, or automatically selecting the most appropriate proxy for a given scraping task.

Type of HTTP request used to send data from a client (in this case, a web scraper) to a server. The data is typically sent as part of the request payload and can be used to perform various actions on the server, such as submitting a form or updating a database.

For example, you can send post() requests to our Async endpoint to submit scraping jobs:

python
initial_request = requests.post(
   url = 'https://async.scraperapi.com/jobs',
   json={
   'apiKey': 'YOUR_API_KEY',
   'url': 'https://quotes.toscrape.com/'
   })

R

Web scraping terms that start with R

​​Popular Python library used for sending HTTP requests.

refers to the process of loading and displaying a webpage in a web browser. In web scraping, rendering can be important for accessing dynamic content that may not be immediately available in the initial HTML source code.

Unlike data center proxies, residential proxies are IP addresses that are associated with real residential locations and devices. These are more expensive than regular proxies but are less likely to be flagged by servers, making them the best option for scraping complex sites.

R package used for web scraping and data extraction. It provides all tools needed to scrape web data using R, including methods to send HTTP requests and a parser to manipulate HTML and XML data.

Type of proxy server that automatically rotates the IP addresses used for each request. For example, ScraperAPI uses machine learning and statistical analysis to determine the best IP address (proxy) to ensure a successful request (smart IP rotation).

Refers to a mechanism for automatically repeating a failed request or operation. This can be useful for dealing with temporary network errors or other issues that may prevent a request from succeeding on the first attempt.

Anti-scraping technique used by websites to limit the number of requests that can be made by a single user or IP address within a given period of time. If your script surpasses this limit, your IP will be banned temporarily or even permanently from accessing the site.

We recommend using a web scraping tool like ScraperAPI to avoid putting your IP at risk.

The number of HTTP requests that are sent by a web scraper within a given period of time.

S

Web scraping terms that start with S

Program or script that is used to extract data from websites. It works by sending requests to a website and then parsing the HTML or other data that is returned in the response.

Python framework used for web scraping and crawling. It provides powerful tools for managing web scraping projects, including built-in support for handling common issues like pagination.

.NET port of the Scrapy framework, designed to provide similar functionality for web scraping in the .NET ecosystem.

This is the HTTP response that is sent back to a client (e.g. a web scraper) by a web server. It typically contains data like HTML code, images, or other resources that are requested by the client.

Also known as a crawler, a spider is a program or script used to crawl websites automatically and extract data from them – usually to add them to a directory or index. It typically works by following links on a webpage to discover and index subsequent pages.

Numeric code that is returned by a web server to indicate the status of a request. Common status codes include:

  • 200 Status Code – indicates that a request has been successful
  • 404 Status Code – indicates that the requested resource was not found on the server
  • 403 Status Code – indicates that access to the requested resource is forbidden

Refers to data that is extracted from websites and organized in a structured format like JSON or XML.

The process of analyzing the emotional tone of a piece of text (e.g., a customer review). It’s used alongside web scraping to analyze the sentiment of user comments, reviews, or online conversations on a website to gather insights for product launches, brand monitoring, etc.

Selenium is a common library to interact with browsers programmatically. It is used to scrape dynamic content with languages like Python, Java, C#, etc., making it possible to control a headless browser and interact with the target website.

Pages that deliver all its content within the initial HTML file. You can also think of these as pages that do not change their content through AJAX.

It’s a method to store information between HTTP requests. In web scraping, sessions can be useful for maintaining login credentials or other context information across multiple requests – like using the same proxies across several requests.

Refers to the percentage of successful requests in retrieving the desired data.

T

Web scraping terms that start with T

It’s the process of navigating through the HTML code of a webpage to find specific elements or data. This can involve moving up and down the HTML tree (parse tree or DOM tree), accessing child or parent elements, and using selectors like class names or IDs to identify specific elements.

A mechanism used to limit the amount of time that a web scraper waits for a response from a server. If a server fails to respond within the specified time limit, the scraper may either retry the request or move on to the next request in the queue. Timeouts are important for preventing a scraper from getting stuck indefinitely waiting for a slow or unresponsive server.

U

Web scraping terms that start with U

An acronym for Uniform Resource Locator (URL), is the address of a specific webpage or resource on the internet. It typically starts with “http://” or “https://” and includes the domain name, subdomain, and any path or query parameters needed to access the resource.

It’s the process of converting special characters in a URL into a format that can be safely transmitted over the internet. This is necessary because certain characters, such as spaces, symbols, and non-ASCII characters, can cause errors or be misinterpreted by servers or web browsers. Urlencoding replaces these characters with a percent sign followed by their hexadecimal ASCII code.

For example, if we want to link to a particular phrase within a URL, we can use “%20” to represent the spaces between words:

https://www.scraperapi.com/documentation/node/#:~:text=retrieve%20product%20data%20from%20an%20amazon%20product%20page

W

Web scraping terms that start with W

An automated process of extracting data from websites using software tools or scripts. It involves sending requests to web servers, parsing the HTML or XML response, and extracting the desired information using selectors or regular expressions.

Web scraping is used for a wide range of applications, including market research, data analysis, and content aggregation.

A method for web applications to send real-time notifications or data to other applications or servers. It works by setting up a URL endpoint on the receiving server, which the sending application can then call whenever a certain event occurs. This allows for automated data transfer and communication between different applications.

When using ScraperAPI’s Async endpoint, you can set a Webhook to receive the HTML response once the submitted scraping job is resolved.

X

Web scraping terms that start with X

Just like CSS selectors, Xpath selectors are used for identifying specific elements or nodes within an HTML or XML document. It uses a language called Xpath to navigate through the hierarchical structure of the document and select elements based on their attributes, location in the tree, or other criteria.

It’s usually used as an alternative to CSS selectors or to increase precision during web testing tasks – however, these are also more susceptible to failure if the website changes the elements’ order.

Talk to an expert and learn how to build a scalable scraping solution.