Data Cleaning 101: How To Remove Tags with BeautifulSoup

John Fáwọlé
December 18, 2024

Scraped HTML data can be difficult to use and analyze in its raw form. In some cases, you might want to remove tags like span and script from your HTML documents, making them tighter and easier to work with.

In this comprehensive guide, we’ll show you how to 1) remove unwanted HTML elements using simple BeautifulSoup methods and 2) clean and structure scraped data using data classes and data pipelines.

Get clean data from top domains

Turn pages into structured JSON and CSV data to speed up development time and reduce data cleaning time.

Eliminating HTML Tags With BeautifulSoup

BeautifulSoup provides several methods to manipulate HTML documents, which are essential for cleaning scraped data. Our focus will be on five main techniques:

Unwrapping tag contents with the unwrap() method
Deleting tags with the decompose() method
Replacing tags with the replace_with() method
Extracting inner text with the get_text() method
Prettify HTML with soup.prettify() Method

1. Unwrap Tag Contents With the unwrap() Method

The unwrap() method in BeautifulSoup allows you to remove a tag from the HTML document while keeping its contents. This method is helpful when you want to remove formatting tags like span but retain the text within them.

For example, say we have some HTML document with b tags we wish to remove. We would approach this using a few easy steps:

Step 1: Import BeautifulSoup and parse your HTML document

<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup

html_doc = """

< p class="title"><b>The Dormouse's story</b>< /p>

< p class="story">Once upon a time there were three little sisters; and their names were
<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>,
<a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and
<a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>;
and they lived at the bottom of a well.< /p>

< p class="story">...< /p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')</pre>

Step 2: Use the `unwrap()` method to remove the `b` tag but keep the text

# Find the b tag and unwrap it
	soup.b.unwrap()
	
	print(soup)

This code locates the first b tag within the paragraph class title and unwraps it, effectively removing the b tag but keeping “The Dormouse’s story” text:

<pre class="wp-block-syntaxhighlighter-code">< p class="title">The Dormouse's story< /p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p></pre>

2. Delete Tag With decompose() Method

The decompose() method entirely removes a tag and its contents from the document. This approach is useful for eradicating unnecessary or spammy content.

Assume we need to delete the span tags from the a elements in this HTML document entirely. We would do so in a couple of easy steps:

Step 1: Import BeautifulSoup and parse your HTML document

<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup

html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie(Link)</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie(Link)</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie(Link)</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
	
soup = BeautifulSoup(html_doc, 'html.parser')

Step 2: Use the `decompose()` method to remove `span` tags along with their contents

a_tags = soup.find_all('a')
	for a_tag in a_tags:
		a_tag.span.decompose()
	print(soup)

The code above iterates over each a tag, finds the span tag, and removes it along with its contents. The output is shown below:

<pre class="wp-block-syntaxhighlighter-code">< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p></pre>

3. Replace Tag With replace_with() Method

Sometimes, you may not want to delete an HTML element entirely. Instead of just removing the tag, you might want to replace it with another tag or text. The replace_with() method enables this functionality.

Step 1: Import BeautifulSoup and parse your HTML document

<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup

html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie(Link)</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie(Link)</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie(Link)</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')</pre>

Step 2: Use the `replace_with()` method to replace `span` tags with `b` tags

for span_tag in soup.find_all('span'):
    	new_tag = soup.new_tag("b")
    	new_tag.string = "[Click Here]"
    	span_tag.replace_with(new_tag)

	print(soup)

This code creates a new b tag with the text "[Click Here]". It then finds all the span tags within the document and replaces them with the new b tag.

4. Extract Inner Text With get_text() Method

The get_text() method extracts all the text within a tag, including the text within its child tags. Using the strip=True parameter further cleans the output by removing leading and trailing whitespace, making the text cleaner and more readable.

The following steps can be used to extract text from a HTML document:

Step 1: Import BeautifulSoup and parse your HTML document

<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup

html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')</pre>

Step 2: Use the `get_text()` method to extract all text within `p` tags

# Using get_text with strip=True to clean the text
	story_text = soup.find('p', class_='story').get_text(strip=True)
	print(story_text)

This would print the story paragraph's text, devoid of any HTML tags and additional spaces at the beginning and end, providing a cleaner output:

"Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well."

Using get_text(strip=True) is especially useful when dealing with scraped HTML content that may have irregular spacing or newline characters within the text, ensuring the extracted text is as clean and usable as possible.

5. Prettify HTML with soup.prettify() Method

The soup.prettify() method is used to format the HTML document in a more readable format. This can be particularly useful when dealing with messy or poorly formatted HTML.

Step 1: Import BeautifulSoup and parse your HTML document

<pre class="wp-block-syntaxhighlighter-code">from bs4 import BeautifulSoup

html_doc = """
< p class="title"><b>The Dormouse's story</b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a> and <a id="link3" class="sister" href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')</pre>

Step 2: Use the `soup.prettify()` method to format the HTML document

print(soup.prettify())

prettify() takes the current state of the soup object and returns a string with a formatted, indented version of the HTML.

This makes the document easier to read and debug, especially when working with complex or deeply nested HTML structures.

< p class="title"><b> The Dormouse's story </b></ p>
< p class="story">Once upon a time there were three little sisters; and their names were <a id="link1" class="sister" href="http://example.com/elsie"> Elsie </a> , <a id="link2" class="sister" href="http://example.com/lacie"> Lacie </a> and <a id="link3" class="sister" href="http://example.com/tillie"> Tillie </a> ; and they lived at the bottom of a well.< /p>
< p class="story">...< /p>
</pre>

By applying these techniques with BeautifulSoup, you can transform cluttered and nested HTML into clean, structured data ready for analysis or further processing!

Automate web scraping with low-code

DataPipeline offers a visual interface and ready-to-use templates to build and schedule entire web scraping projects.

Cleaning Dirty Data and Dealing With Edge Cases

Data scraped from the internet is often inconsistent or incomplete, presenting challenges in most scraping projects. This section will discuss strategies to handle such edge cases, including missing data, varying data formats, and duplicate entries.

Here are some strategies to deal with edge cases

Try/Except – Useful to handle errors gracefully.
Conditional Parsing – Implement conditional logic to parse data differently based on its structure.
Data Classes – Use Python’s data classes to structure your data, making it easier to clean and manipulate.
Data Pipelines – Implement a data pipeline to clean, validate, and transform data before storage.
Clean During Data Analysis – Perform data cleaning as part of your data analysis process.

Implementing Data Classes for Structured Data

In this section, we’ll use data classes to structure our scraped data, ensuring consistency and ease of manipulation.

Before cleaning and processing the data, let’s define the necessary imports from the dataclasses module and set up our Product data class. This class will serve as the blueprint for structuring our scraped product data.

	from dataclasses import dataclass, field, InitVar

	@dataclass
	class Product:
		name: str = ""
		price: InitVar[str] = ""
		url: str = ""
		price_gbp: float = field(init=False)
		price_usd: float = field(init=False)
	
		def __post_init__(self, price):
			self.url = self.normalize_url()
			self.price_gbp = self.clean_price(price)
			self.price_usd = self.price_to_usd()

Here, we import dataclass, field, and InitVar from the dataclasses module. The @dataclass decorator automatically adds special methods to the Product class, such as __init__ and __repr__, based on the class attributes.

The Product class is defined with several attributes:

name: A string representing the product name.
price: An initialization variable (InitVar) that holds the price as a string. This attribute is not included in the class’s __init__ method and is used for processing before being discarded.
url: A string representing the product URL.
price_gbp: A float representing the product price in GBP. This attribute is not included in the class’s __init__ method and is set during initialization (init=False).
price_usd: A float representing the product price in USD. Like price_gbp, this attribute is not included in the class’s __init__ method and is set during initialization.

This setup provides a structured way to manage product data, including cleaning and converting prices, normalizing URLs, and more.

The next steps will involve implementing methods within the Product class to perform these operations.

Clean the Price

With our Product data class defined, the next step is implementing methods for cleaning the price data. We’ll clean the price string by removing unnecessary characters like the currency symbol (“£”) and sale price indicators (“Sale price£”, “Sale priceFrom £”).

Define a method clean_price within the Product class that takes a price string, removes any non-numeric characters, and returns the cleaned price as a float.

	def clean_price(self, price_string: str):
    price_string = price_string.strip()
    price_string = price_string.replace("Sale price£", "")
    price_string = price_string.replace("Sale priceFrom £", "")
    price_string = price_string.replace("£", "")
    if price_string == "":
        return 0.0
    return float(price_string)

This method first strips any leading or trailing whitespace from the price string
It then removes any instances of “Sale price£” and “Sale priceFrom £” from the string
After that, it removes the “£” symbol.

If the resulting string is empty, it returns “0.0”, indicating that the price is missing or not available. Otherwise, it converts the cleaned string to a float and returns it.

Convert the Price

After cleaning the price, we need to convert it from GBP to USD to standardize the currency across our dataset, especially when dealing with international data.

	def price_to_usd(self):
    return self.price_gbp * 1.28

This method multiplies the cleaned GBP price by the conversion rate (1.28 in this example) to calculate the price in USD. This conversion rate can be dynamically updated based on current exchange rates.

Normalize the URL

Another common edge case with scraped data is inconsistent URL formats. Some URLs might be relative paths, while others are absolute URLs. Normalizing URLs ensures they are consistently formatted, making them easier to work with.

We will define a normalize_url method within the Product class that checks if the URL starts with http:// or https://. If not, it prepends “http://example.com” to the URL.

	def normalize_url(self):
    if self.url == "":
        return "missing"
    if not self.url.startswith("http://") and not self.url.startswith("https://"):
        return "http://example.com" + self.url
    return self.url

This method first checks if the URL is empty. If it is, it returns “missing”
Then, it checks if the URL does not start with “http://” or “https://”
If this is the case, it prepends “http://example.com” to the URL to ensure it has a valid format.
If the URL already starts with “http://” or “https://”, it returns the URL as is.

Test the Product Data Class

Finally, test the Product data class with some sample data.

	# Sample scraped data
	scraped_products = [
		{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
		{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
		{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
	]
	
	# Process and clean the data
	processed_products = [Product(name=product["name"], price=product["price"], url=product["url"]) for product in scraped_products]
	
	# Display the processed products
	for product in processed_products:
		print(f"Name: {product.name}, GBP_Price: £{product.price_gbp}, USD_Price: ${product.price_usd}, URL: {product.url}")

This code creates a list of dictionaries representing some scraped product data. It then iterates over this list, creating a Product instance for each dictionary and appending it to the processed_products list.

Finally, it iterates over the processed_products list and prints out the name, GBP price, USD price, and URL of each product.

	Name: Delicious Chocolate, GBP_Price: £1.99,  USD_Price: $2.5472,URL: http://example.com/delicious-chocolate
	Name: Yummy Cookies, GBP_Price: £2.5,  USD_Price: $3.2,URL: http://example.com/yummy-cookies
	Name: Tasty Candy, GBP_Price: £0.99,  USD_Price: $1.2672,URL: http://example.com/apple-pies

This verifies that the Product data class correctly cleans and processes the scraped data.

Full Data Classes Code

Here’s the complete code for the Product data class example.

	from dataclasses import dataclass, field, InitVar

	@dataclass
	class Product:
		name: str = ""
		price: InitVar[str] = ""
		url: str = ""
		price_gbp: float = field(init=False)
		price_usd: float = field(init=False)
	
		def __post_init__(self, price):
			self.url = self.normalize_url()
			self.price_gbp = self.clean_price(price)
			self.price_usd = self.price_to_usd()
	
	
		def clean_price(self, price_string: str):
			price_string = price_string.strip()
			price_string = price_string.replace("Sale price£", "")
			price_string = price_string.replace("Sale priceFrom £", "")
			price_string = price_string.replace("£", "")
			if price_string == "":
				return 0.0
			return float(price_string)
		
		def normalize_url(self):
		  if self.url == "":
				return "missing"
		  if not self.url.startswith("http://") and not self.url.startswith("https://"):
				return "http://example.com" + self.url
		  return self.url
		
		def price_to_usd(self):
			return self.price_gbp * 1.28
	
	# Sample scraped data
	sample_products = [
		{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
		{'name': 'Gourmet Cookies', 'price': '£5.50', 'url': 'http://example.com/gourmet-cookies'},
		{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
		{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
		{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
		{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
	]
	
	
	# Process and clean the data
	processed_products = []
	for product in scraped_products:
		prod = Product(name=product["name"],
					   price=product["price"],
					   url=product["url"])
		processed_products.append(prod)
	
	# Display the processed products
	for product in processed_products:
		print(f"Name: {product.name}, GBP_Price: £{product.price_gbp},  USD_Price: ${product.price_usd},URL: {product.url}")

Process and Store Scraped Data with Data Pipelines

After cleaning our data using the structured approach provided by the Product data class, we proceed to the next crucial step: processing and storing this data efficiently.

Data pipelines play a pivotal role in this phase, enabling us to systematically process the data before saving it. The operations within our data pipeline will include:

Duplicate Check: Verify if an item is already present in the dataset to prevent redundancy.
Data Queue Management: Temporarily store processed data before saving it, managing flow and volume.
Periodic Data Storage: Save the data to a CSV file at regular intervals or based on specific triggers.

Setting Up the ProductDataPipeline Class

First, let’s define the structure of our ProductDataPipeline class, focusing on initialization and the foundational methods that support the operations mentioned above:

	import csv
	from dataclasses import asdict, fields
	
	class ProductDataPipeline:
		def __init__(self, csv_filename='product_data.csv', storage_queue_limit=10):
			self.names_seen = set()
			self.storage_queue = []
			self.storage_queue_limit = storage_queue_limit
			self.csv_filename = csv_filename
			self.csv_file_open = False
	
		def save_to_csv(self):
			pass
		
		def clean_raw_product(self, scraped_data):
			pass
		
		def is_duplicate(self, product_data):
			pass
		
		def add_product(self, scraped_data):
			pass
		
		def close_pipeline(self):
			pass

The __init__ method sets up initial conditions, including a set for tracking seen product names (to check duplicates), a queue for temporarily storing products, and configuration for CSV file output.

Cleaning Raw Product Data

Before a product can be added to our processing queue, it must first be sanitized and structured. The clean_raw_product method accomplishes this by converting raw scraped data into an instance of the Product data class, ensuring that our data conforms to the expected structure and types.

	def clean_raw_product(self, scraped_data):
    cleaned_data = {
        "name": scraped_data.get("name", ""),
        "price": scraped_data.get("price", ""),
        "url": scraped_data.get("url", "")
    }
    return Product(**cleaned_data)

Once the raw product data is cleaned, it must be checked for duplicates and added to the processing queue if unique. This is managed by the add_product and is_duplicate methods, respectively.

Adding Products and Checking for Duplicates

The is_duplicate() function checks if a product is already in the names_seen list. If it is, it prints a message and returns True, indicating that the product is a duplicate. If it’s not, it adds the product name to the names_seen list and returns False.

	def is_duplicate(self, product_data):
    if product_data.name in self.names_seen:
        print(f"Duplicate item found: {product_data.name}. Item dropped.")
        return True
    self.names_seen.add(product_data.name)
    return False

The add_product method first cleans the scraped data and creates a Product object. Then, it checks if the product is a duplicate using the is_duplicate method. If it’s not a duplicate, it adds the product to the storage_queue. Finally, it checks if the storage queue has reached its limit and, if so, calls the save_to_csv method to save the products to the CSV file.

	def add_product(self, scraped_data):
    product = self.clean_raw_product(scraped_data)
    if not self.is_duplicate(product):
        self.storage_queue.append(product)
        if len(self.storage_queue) >= self.storage_queue_limit:
            self.save_to_csv()

Periodic Saving of Data to CSV

The save_to_csv method is triggered either when the storage queue reaches its limit or when the pipeline is closing, ensuring data persistence.

	def save_to_csv(self):
    if not self.storage_queue:
        return
    
    self.csv_file_open = True
    with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=[field.name for field in fields(self.storage_queue[0])])
        if csvfile.tell() == 0:
            writer.writeheader()
        for product in self.storage_queue:
            writer.writerow(asdict(product))
    self.storage_queue.clear()
    self.csv_file_open = False

The save_to_csv method is designed to execute when the storage queue is not empty. It marks the CSV file as open (to manage concurrent access), then iterates over each product in the queue, serializing it to a dictionary (using asdict) and writing it to the CSV file.

The CSV file is opened in append mode (a), which allows for data to be added without overwriting existing information. The method checks if the file is newly created and writes column headers accordingly.

After saving the queued products, it clears the queue and marks the CSV file as closed, ready for the next batch of data.

Closing the Pipeline

Ensuring no data is left unsaved, the close_pipeline method handles the final data flush to the CSV file.

	def close_pipeline(self):
    if self.storage_queue:
        self.save_to_csv()

Testing the Data Pipeline

To demonstrate the effectiveness of our ProductDataPipeline, we’ll simulate the process of adding several products, including a duplicate, to see how our pipeline manages data cleaning, duplicate detection, and CSV storage.

	data_pipeline = ProductDataPipeline(csv_filename='product_data.csv', storage_queue_limit=3)

	# Sample scraped data to add to our pipeline
	sample_products = [
		{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
		{'name': 'Gourmet Cookies', 'price': '£5.50', 'url': 'http://example.com/gourmet-cookies'},
		{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
		{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
		{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
		{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
	]
	
	# Add each product to the pipeline
	for product in sample_products:
		data_pipeline.add_product(product)
	
	# Ensure all remaining data in the pipeline gets saved to the CSV file
	data_pipeline.close_pipeline()
	
	print("Data processing complete. Check the product_data.csv file for output.")

This test script initializes the ProductDataPipeline with a specified CSV file name and a storage queue limit. It then attempts to add three products, including a duplicate, to see how our pipeline handles them.

The close_pipeline method is called last to ensure all data is flushed to the CSV file, demonstrating the pipeline’s capability to manage data end-to-end.

Full Data Pipeline Code

Here’s the complete code for the ProductDataPipeline class, integrating all the steps mentioned in this article:

	import csv
	from dataclasses import asdict, fields
	
	class ProductDataPipeline:
		def __init__(self, csv_filename='product_data.csv', storage_queue_limit=10):
			self.names_seen = set()
			self.storage_queue = []
			self.storage_queue_limit = storage_queue_limit
			self.csv_filename = csv_filename
			self.csv_file_open = False
	
		def save_to_csv(self):
			if not self.storage_queue:
				return
			
			self.csv_file_open = True
			with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as csvfile:
				writer = csv.DictWriter(csvfile, fieldnames=[field.name for field in fields(self.storage_queue[0])])
				if csvfile.tell() == 0:
					writer.writeheader()
				for product in self.storage_queue:
					writer.writerow(asdict(product))
			self.storage_queue.clear()
			self.csv_file_open = False
	
		def clean_raw_product(self, scraped_data):
			cleaned_data = {
				"name": scraped_data.get("name", ""),
				"price": scraped_data.get("price", ""),
				"url": scraped_data.get("url", "")
			}
			return Product(**cleaned_data)
	
		def is_duplicate(self, product_data):
			if product_data.name in self.names_seen:
				print(f"Duplicate item found: {product_data.name}. Item dropped.")
				return True
			self.names_seen.add(product_data.name)
			return False
	
		def add_product(self, scraped_data):
			product = self.clean_raw_product(scraped_data)
			if not self.is_duplicate(product):
				self.storage_queue.append(product)
				if len(self.storage_queue) >= self.storage_queue_limit:
					self.save_to_csv()
	
		def close_pipeline(self):
			if self.storage_queue:
				self.save_to_csv()
	
	
	data_pipeline = ProductDataPipeline(csv_filename='product_data.csv', storage_queue_limit=3)
	
	# Sample scraped data to add to our pipeline
	sample_products = [
		{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
		{'name': 'Gourmet Cookies', 'price': '£5.50', 'url': 'http://example.com/gourmet-cookies'},
		{'name': 'Artisanal Chocolate', 'price': 'Sale price£2.99', 'url': '/artisanal-chocolate'},
		{"name": "Delicious Chocolate", "price": "Sale priceFrom £1.99", "url": "/delicious-chocolate"},
		{"name": "Yummy Cookies", "price": "£2.50", "url": "http://example.com/yummy-cookies"},
		{"name": "Tasty Candy", "price": "Sale price£0.99", "url": "/apple-pies"}
	]
	
	# Add each product to the pipeline
	for product in sample_products:
		data_pipeline.add_product(product)
	
	# Ensure all remaining data in the pipeline gets saved to the CSV file
	data_pipeline.close_pipeline()
	
	print("Data processing complete. Check the product_data.csv file for output.")

Keep Learning

In this article, we have provided a comprehensive guide to cleaning and structuring scraped data using data classes and data pipelines. By following these techniques, you can ensure that your scraped data is accurate, consistent, and ready to be used for analysis.

If you are looking to scrape larger amounts of data without getting blocked, consider using ScraperAPI. It provides a simple API that allows you to fetch fully rendered HTML responses, including those from dynamic websites.

Until next time, happy Scraping!

About the author

John Fáwọlé

John Fáwọlé is a technical writer and developer. He currently works as a freelance content marketer and consultant for tech startups.

ScraperAPI and Traject Data logos joined, marking the acquisition that unifies web scraping infrastructure and structured data APIs.

ScraperAPI + Traject Data: From data extraction to insights

A few months ago, ScraperAPI acquired Traject Data, the company behind highly respected structured data APIs. For years, ScraperAPI has been the rock-solid infrastructure that

Read article

April 29, 2026

Tutorial on How to scrape AI Snippets in Google Search Engine Results Pages

How to Scrape AI Overviews in Google Search Results

If you’ve ever searched for something on Google and noticed a helpful AI-generated summary at the top of the results, you’ve encountered Google’s AI overviews.

Read article

November 25, 2024

Tutorial on how to Bypass and Scrape Amazon WAF bot control with Python

How to Bypass AWS WAF with ScraperAPI

Scraping a site often starts smoothly. The first few requests are successful, the data is accurate, and everything is on track. Then your scraper stops.

Read article

November 25, 2024

Need More Than 3M API Credits per Month?

Talk to an expert and learn how to build a scalable scraping solution.

Data Cleaning 101: How To Remove Tags with BeautifulSoup

Eliminating HTML Tags With BeautifulSoup

1. Unwrap Tag Contents With the unwrap() Method

Step 1: Import BeautifulSoup and parse your HTML document

Step 2: Use the unwrap() method to remove the b tag but keep the text

2. Delete Tag With decompose() Method

Step 1: Import BeautifulSoup and parse your HTML document

Step 2: Use the decompose() method to remove span tags along with their contents

3. Replace Tag With replace_with() Method

Step 1: Import BeautifulSoup and parse your HTML document

Step 2: Use the replace_with() method to replace span tags with b tags

4. Extract Inner Text With get_text() Method

Step 1: Import BeautifulSoup and parse your HTML document

Step 2: Use the get_text() method to extract all text within p tags

5. Prettify HTML with soup.prettify() Method

Step 1: Import BeautifulSoup and parse your HTML document

Step 2: Use the soup.prettify() method to format the HTML document

Cleaning Dirty Data and Dealing With Edge Cases

Implementing Data Classes for Structured Data

Clean the Price

Convert the Price

Normalize the URL

Test the Product Data Class

Full Data Classes Code

Process and Store Scraped Data with Data Pipelines

Setting Up the ProductDataPipeline Class

Cleaning Raw Product Data

Adding Products and Checking for Duplicates

Periodic Saving of Data to CSV

Closing the Pipeline

Testing the Data Pipeline

Full Data Pipeline Code

Keep Learning

About the author

John Fáwọlé

Table of Contents

Scraping billions of pages?

Related Articles

ScraperAPI + Traject Data: From data extraction to insights

How to Scrape AI Overviews in Google Search Results

How to Bypass AWS WAF with ScraperAPI

Need More Than 3M API Credits per Month?

Step 2: Use the `unwrap()` method to remove the `b` tag but keep the text

Step 2: Use the `decompose()` method to remove `span` tags along with their contents

Step 2: Use the `replace_with()` method to replace `span` tags with `b` tags

Step 2: Use the `get_text()` method to extract all text within `p` tags

Step 2: Use the `soup.prettify()` method to format the HTML document