Scraped HTML data can be difficult to use and analyze in its raw form. In some cases, you might want to remove tags like span
and script
from your HTML documents, making them tighter and easier to work with.
In this comprehensive guide, we’ll show you how to 1) remove unwanted HTML elements using simple BeautifulSoup methods and 2) clean and structure scraped data using data classes and data pipelines.
Turn pages into structured JSON and CSV data to speed up development time and reduce data cleaning time.
Eliminating HTML Tags With BeautifulSoup
BeautifulSoup provides several methods to manipulate HTML documents, which are essential for cleaning scraped data. Our focus will be on five main techniques:
- Unwrapping tag contents with the
unwrap()
method - Deleting tags with the
decompose()
method - Replacing tags with the
replace_with()
method - Extracting inner text with the
get_text()
method - Prettify HTML with
soup.prettify()
Method
1. Unwrap Tag Contents With the unwrap() Method
The unwrap()
method in BeautifulSoup allows you to remove a tag from the HTML document while keeping its contents. This method is helpful when you want to remove formatting tags like span
but retain the text within them.
For example, say we have some HTML document with b
tags we wish to remove. We would approach this using a few easy steps:
Step 1: Import BeautifulSoup and parse your HTML document
<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup
html_doc = """
< p class="title"><b>The Dormouse's story</b>< /p>
< p class="story">Once upon a time there were three little sisters; and their names were
<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>,
<a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and
<a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>;
and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')</pre>
Step 2: Use the unwrap()
method to remove the b
tag but keep the text
# Find the b tag and unwrap it
soup.b.unwrap()
print(soup)
This code locates the first b
tag within the paragraph class title and unwraps it, effectively removing the b
tag but keeping “The Dormouse’s story” text:
<pre class="wp-block-syntaxhighlighter-code">< p class="title">The Dormouse's story< /p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p></pre>
2. Delete Tag With decompose() Method
The decompose()
method entirely removes a tag and its contents from the document. This approach is useful for eradicating unnecessary or spammy content.
Assume we need to delete the span
tags from the a
elements in this HTML document entirely. We would do so in a couple of easy steps:
Step 1: Import BeautifulSoup and parse your HTML document
<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup
html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie(Link)</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie(Link)</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie(Link)</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
Step 2: Use the decompose()
method to remove span
tags along with their contents
a_tags = soup.find_all('a')
for a_tag in a_tags:
a_tag.span.decompose()
print(soup)
The code above iterates over each a
tag, finds the span
tag, and removes it along with its contents. The output is shown below:
<pre class="wp-block-syntaxhighlighter-code">< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p></pre>
3. Replace Tag With replace_with() Method
Sometimes, you may not want to delete an HTML element entirely. Instead of just removing the tag, you might want to replace it with another tag or text. The replace_with()
method enables this functionality.
Step 1: Import BeautifulSoup and parse your HTML document
<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup
html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie(Link)</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie(Link)</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie(Link)</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')</pre>
Step 2: Use the replace_with()
method to replace span
tags with b
tags
for span_tag in soup.find_all('span'):
new_tag = soup.new_tag("b")
new_tag.string = "[Click Here]"
span_tag.replace_with(new_tag)
print(soup)
This code creates a new b
tag with the text "[Click Here]". It then finds all the span
tags within the document and replaces them with the new b
tag.
4. Extract Inner Text With get_text() Method
The get_text()
method extracts all the text within a tag, including the text within its child tags. Using the strip=True
parameter further cleans the output by removing leading and trailing whitespace, making the text cleaner and more readable.
The following steps can be used to extract text from a HTML document:
Step 1: Import BeautifulSoup and parse your HTML document
<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup
html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')</pre>
Step 2: Use the get_text()
method to extract all text within p
tags
# Using get_text with strip=True to clean the text
story_text = soup.find('p', class_='story').get_text(strip=True)
print(story_text)
This would print the story paragraph's text, devoid of any HTML tags and additional spaces at the beginning and end, providing a cleaner output:
"Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well."
Using get_text(strip=True)
is especially useful when dealing with scraped HTML content that may have irregular spacing or newline characters within the text, ensuring the extracted text is as clean and usable as possible.
5. Prettify HTML with soup.prettify() Method
The soup.prettify()
method is used to format the HTML document in a more readable format. This can be particularly useful when dealing with messy or poorly formatted HTML.
Step 1: Import BeautifulSoup and parse your HTML document
<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup
html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')</pre>
Step 2: Use the soup.prettify()
method to format the HTML document
print(soup.prettify())
prettify()
takes the current state of the soup object and returns a string with a formatted, indented version of the HTML.
This makes the document easier to read and debug, especially when working with complex or deeply nested HTML structures.
< p class="title"><b> The Dormouse's story </b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie"> Elsie </a> , <a id="link2" class="sister" href="http://example.com/lacie"> Lacie </a> and <a id="link3" class="sister" href="http://example.com/tillie"> Tillie </a> ; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
</pre>
By applying these techniques with BeautifulSoup, you can transform cluttered and nested HTML into clean, structured data ready for analysis or further processing!
DataPipeline offers a visual interface and ready-to-use templates to build and schedule entire web scraping projects.
Cleaning Dirty Data and Dealing With Edge Cases
Data scraped from the internet is often inconsistent or incomplete, presenting challenges in most scraping projects. This section will discuss strategies to handle such edge cases, including missing data, varying data formats, and duplicate entries.
Here are some strategies to deal with edge cases
- Try/Except – Useful to handle errors gracefully.
- Conditional Parsing – Implement conditional logic to parse data differently based on its structure.
- Data Classes – Use Python’s data classes to structure your data, making it easier to clean and manipulate.
- Data Pipelines – Implement a data pipeline to clean, validate, and transform data before storage.
- Clean During Data Analysis – Perform data cleaning as part of your data analysis process.
Implementing Data Classes for Structured Data
In this section, we’ll use data classes to structure our scraped data, ensuring consistency and ease of manipulation.
Before cleaning and processing the data, let’s define the necessary imports from the dataclasses
module and set up our Product
data class. This class will serve as the blueprint for structuring our scraped product data.
from dataclasses import dataclass, field, InitVar
@dataclass
class Product:
name: str = ""
price: InitVar[str] = ""
url: str = ""
price_gbp: float = field(init=False)
price_usd: float = field(init=False)
def __post_init__(self, price):
self.url = self.normalize_url()
self.price_gbp = self.clean_price(price)
self.price_usd = self.price_to_usd()
Here, we import dataclass
, field
, and InitVar
from the dataclasses
module. The @dataclass
decorator automatically adds special methods to the Product
class, such as __init__
and __repr__
, based on the class attributes.
The Product class is defined with several attributes:
- name: A string representing the product name.
- price: An initialization variable (
InitVar
) that holds the price as a string. This attribute is not included in the class’s__init__
method and is used for processing before being discarded. - url: A string representing the product URL.
- price_gbp: A float representing the product price in GBP. This attribute is not included in the class’s
__init__
method and is set during initialization (init=False
). - price_usd: A float representing the product price in USD. Like
price_gbp
, this attribute is not included in the class’s__init__
method and is set during initialization.
This setup provides a structured way to manage product data, including cleaning and converting prices, normalizing URLs, and more.
The next steps will involve implementing methods within the Product
class to perform these operations.
Clean the Price
With our Product
data class defined, the next step is implementing methods for cleaning the price data. We’ll clean the price string by removing unnecessary characters like the currency symbol (“£”) and sale price indicators (“Sale price£”, “Sale priceFrom £”).
Define a method clean_price
within the Product
class that takes a price string, removes any non-numeric characters, and returns the cleaned price as a float.
def clean_price(self, price_string: str):
price_string = price_string.strip()
price_string = price_string.replace("Sale price£", "")
price_string = price_string.replace("Sale priceFrom £", "")
price_string = price_string.replace("£", "")
if price_string == "":
return 0.0
return float(price_string)
- This method first strips any leading or trailing whitespace from the price string
- It then removes any instances of “Sale price£” and “Sale priceFrom £” from the string
- After that, it removes the “£” symbol.
If the resulting string is empty, it returns “0.0”, indicating that the price is missing or not available. Otherwise, it converts the cleaned string to a float and returns it.
Convert the Price
After cleaning the price, we need to convert it from GBP to USD to standardize the currency across our dataset, especially when dealing with international data.
def price_to_usd(self):
return self.price_gbp * 1.28
This method multiplies the cleaned GBP price by the conversion rate (1.28 in this example) to calculate the price in USD. This conversion rate can be dynamically updated based on current exchange rates.
Normalize the URL
Another common edge case with scraped data is inconsistent URL formats. Some URLs might be relative paths, while others are absolute URLs. Normalizing URLs ensures they are consistently formatted, making them easier to work with.
We will define a normalize_url
method within the Product
class that checks if the URL starts with http:// or https://. If not, it prepends “http://example.com” to the URL.
def normalize_url(self):
if self.url == "":
return "missing"
if not self.url.startswith("http://") and not self.url.startswith("https://"):
return "http://example.com" + self.url
return self.url
- This method first checks if the URL is empty. If it is, it returns “missing”
- Then, it checks if the URL does not start with “http://” or “https://”
- If this is the case, it prepends “http://example.com” to the URL to ensure it has a valid format.
- If the URL already starts with “http://” or “https://”, it returns the URL as is.
Test the Product Data Class
Finally, test the Product data class with some sample data.
# Sample scraped data
scraped_products = [
{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
]
# Process and clean the data
processed_products = [Product(name=product["name"], price=product["price"], url=product["url"]) for product in scraped_products]
# Display the processed products
for product in processed_products:
print(f"Name: {product.name}, GBP_Price: £{product.price_gbp}, USD_Price: ${product.price_usd}, URL: {product.url}")
This code creates a list of dictionaries representing some scraped product data. It then iterates over this list, creating a Product
instance for each dictionary and appending it to the processed_products
list.
Finally, it iterates over the processed_products
list and prints out the name, GBP price, USD price, and URL of each product.
Name: Delicious Chocolate, GBP_Price: £1.99, USD_Price: $2.5472,URL: http://example.com/delicious-chocolate
Name: Yummy Cookies, GBP_Price: £2.5, USD_Price: $3.2,URL: http://example.com/yummy-cookies
Name: Tasty Candy, GBP_Price: £0.99, USD_Price: $1.2672,URL: http://example.com/apple-pies
This verifies that the Product data class correctly cleans and processes the scraped data.
Full Data Classes Code
Here’s the complete code for the Product
data class example.
from dataclasses import dataclass, field, InitVar
@dataclass
class Product:
name: str = ""
price: InitVar[str] = ""
url: str = ""
price_gbp: float = field(init=False)
price_usd: float = field(init=False)
def __post_init__(self, price):
self.url = self.normalize_url()
self.price_gbp = self.clean_price(price)
self.price_usd = self.price_to_usd()
def clean_price(self, price_string: str):
price_string = price_string.strip()
price_string = price_string.replace("Sale price£", "")
price_string = price_string.replace("Sale priceFrom £", "")
price_string = price_string.replace("£", "")
if price_string == "":
return 0.0
return float(price_string)
def normalize_url(self):
if self.url == "":
return "missing"
if not self.url.startswith("http://") and not self.url.startswith("https://"):
return "http://example.com" + self.url
return self.url
def price_to_usd(self):
return self.price_gbp * 1.28
# Sample scraped data
sample_products = [
{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
{'name': 'Gourmet Cookies', 'price': '£5.50', 'url': 'http://example.com/gourmet-cookies'},
{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
]
# Process and clean the data
processed_products = []
for product in scraped_products:
prod = Product(name=product["name"],
price=product["price"],
url=product["url"])
processed_products.append(prod)
# Display the processed products
for product in processed_products:
print(f"Name: {product.name}, GBP_Price: £{product.price_gbp}, USD_Price: ${product.price_usd},URL: {product.url}")
Process and Store Scraped Data with Data Pipelines
After cleaning our data using the structured approach provided by the Product data class, we proceed to the next crucial step: processing and storing this data efficiently.
Data pipelines play a pivotal role in this phase, enabling us to systematically process the data before saving it. The operations within our data pipeline will include:
- Duplicate Check: Verify if an item is already present in the dataset to prevent redundancy.
- Data Queue Management: Temporarily store processed data before saving it, managing flow and volume.
- Periodic Data Storage: Save the data to a CSV file at regular intervals or based on specific triggers.
Setting Up the ProductDataPipeline Class
First, let’s define the structure of our ProductDataPipeline
class, focusing on initialization and the foundational methods that support the operations mentioned above:
import csv
from dataclasses import asdict, fields
class ProductDataPipeline:
def __init__(self, csv_filename='product_data.csv', storage_queue_limit=10):
self.names_seen = set()
self.storage_queue = []
self.storage_queue_limit = storage_queue_limit
self.csv_filename = csv_filename
self.csv_file_open = False
def save_to_csv(self):
pass
def clean_raw_product(self, scraped_data):
pass
def is_duplicate(self, product_data):
pass
def add_product(self, scraped_data):
pass
def close_pipeline(self):
pass
The __init__
method sets up initial conditions, including a set for tracking seen product names (to check duplicates), a queue for temporarily storing products, and configuration for CSV file output.
Cleaning Raw Product Data
Before a product can be added to our processing queue, it must first be sanitized and structured. The clean_raw_product
method accomplishes this by converting raw scraped data into an instance of the Product
data class, ensuring that our data conforms to the expected structure and types.
def clean_raw_product(self, scraped_data):
cleaned_data = {
"name": scraped_data.get("name", ""),
"price": scraped_data.get("price", ""),
"url": scraped_data.get("url", "")
}
return Product(**cleaned_data)
Once the raw product data is cleaned, it must be checked for duplicates and added to the processing queue if unique. This is managed by the add_product
and is_duplicate
methods, respectively.
Adding Products and Checking for Duplicates
The is_duplicate()
function checks if a product is already in the names_seen
list. If it is, it prints a message and returns True
, indicating that the product is a duplicate. If it’s not, it adds the product name to the names_seen
list and returns False
.
def is_duplicate(self, product_data):
if product_data.name in self.names_seen:
print(f"Duplicate item found: {product_data.name}. Item dropped.")
return True
self.names_seen.add(product_data.name)
return False
The add_product
method first cleans the scraped data and creates a Product
object. Then, it checks if the product is a duplicate using the is_duplicate
method. If it’s not a duplicate, it adds the product to the storage_queue
. Finally, it checks if the storage queue has reached its limit and, if so, calls the save_to_csv
method to save the products to the CSV file.
def add_product(self, scraped_data):
product = self.clean_raw_product(scraped_data)
if not self.is_duplicate(product):
self.storage_queue.append(product)
if len(self.storage_queue) >= self.storage_queue_limit:
self.save_to_csv()
Periodic Saving of Data to CSV
The save_to_csv
method is triggered either when the storage queue reaches its limit or when the pipeline is closing, ensuring data persistence.
def save_to_csv(self):
if not self.storage_queue:
return
self.csv_file_open = True
with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[field.name for field in fields(self.storage_queue[0])])
if csvfile.tell() == 0:
writer.writeheader()
for product in self.storage_queue:
writer.writerow(asdict(product))
self.storage_queue.clear()
self.csv_file_open = False
The save_to_csv
method is designed to execute when the storage queue is not empty. It marks the CSV file as open (to manage concurrent access), then iterates over each product in the queue, serializing it to a dictionary (using asdict
) and writing it to the CSV file.
The CSV file is opened in append mode (a
), which allows for data to be added without overwriting existing information. The method checks if the file is newly created and writes column headers accordingly.
After saving the queued products, it clears the queue and marks the CSV file as closed, ready for the next batch of data.
Closing the Pipeline
Ensuring no data is left unsaved, the close_pipeline
method handles the final data flush to the CSV file.
def close_pipeline(self):
if self.storage_queue:
self.save_to_csv()
Testing the Data Pipeline
To demonstrate the effectiveness of our ProductDataPipeline
, we’ll simulate the process of adding several products, including a duplicate, to see how our pipeline manages data cleaning, duplicate detection, and CSV storage.
data_pipeline = ProductDataPipeline(csv_filename='product_data.csv', storage_queue_limit=3)
# Sample scraped data to add to our pipeline
sample_products = [
{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
{'name': 'Gourmet Cookies', 'price': '£5.50', 'url': 'http://example.com/gourmet-cookies'},
{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
]
# Add each product to the pipeline
for product in sample_products:
data_pipeline.add_product(product)
# Ensure all remaining data in the pipeline gets saved to the CSV file
data_pipeline.close_pipeline()
print("Data processing complete. Check the product_data.csv file for output.")
This test script initializes the ProductDataPipeline
with a specified CSV file name and a storage queue limit. It then attempts to add three products, including a duplicate, to see how our pipeline handles them.
The close_pipeline
method is called last to ensure all data is flushed to the CSV file, demonstrating the pipeline’s capability to manage data end-to-end.
Full Data Pipeline Code
Here’s the complete code for the ProductDataPipeline
class, integrating all the steps mentioned in this article:
import csv
from dataclasses import asdict, fields
class ProductDataPipeline:
def __init__(self, csv_filename='product_data.csv', storage_queue_limit=10):
self.names_seen = set()
self.storage_queue = []
self.storage_queue_limit = storage_queue_limit
self.csv_filename = csv_filename
self.csv_file_open = False
def save_to_csv(self):
if not self.storage_queue:
return
self.csv_file_open = True
with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[field.name for field in fields(self.storage_queue[0])])
if csvfile.tell() == 0:
writer.writeheader()
for product in self.storage_queue:
writer.writerow(asdict(product))
self.storage_queue.clear()
self.csv_file_open = False
def clean_raw_product(self, scraped_data):
cleaned_data = {
"name": scraped_data.get("name", ""),
"price": scraped_data.get("price", ""),
"url": scraped_data.get("url", "")
}
return Product(**cleaned_data)
def is_duplicate(self, product_data):
if product_data.name in self.names_seen:
print(f"Duplicate item found: {product_data.name}. Item dropped.")
return True
self.names_seen.add(product_data.name)
return False
def add_product(self, scraped_data):
product = self.clean_raw_product(scraped_data)
if not self.is_duplicate(product):
self.storage_queue.append(product)
if len(self.storage_queue) >= self.storage_queue_limit:
self.save_to_csv()
def close_pipeline(self):
if self.storage_queue:
self.save_to_csv()
data_pipeline = ProductDataPipeline(csv_filename='product_data.csv', storage_queue_limit=3)
# Sample scraped data to add to our pipeline
sample_products = [
{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
{'name': 'Gourmet Cookies', 'price': '£5.50', 'url': 'http://example.com/gourmet-cookies'},
{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
]
# Add each product to the pipeline
for product in sample_products:
data_pipeline.add_product(product)
# Ensure all remaining data in the pipeline gets saved to the CSV file
data_pipeline.close_pipeline()
print("Data processing complete. Check the product_data.csv file for output.")
Keep Learning
In this article, we have provided a comprehensive guide to cleaning and structuring scraped data using data classes and data pipelines. By following these techniques, you can ensure that your scraped data is accurate, consistent, and ready to be used for analysis.
If you are looking to scrape larger amounts of data without getting blocked, consider using ScraperAPI. It provides a simple API that allows you to fetch fully rendered HTML responses, including those from dynamic websites.
Until next time, happy Scraping!