Turn webpages into LLM-ready data at scale with a simple API call

How to Scrape Reddit Web Data with Python [Detailed Guide]

Excerpt content

Want to scrape data from Reddit? You’ve come to the right place.

This web scraping guide will walk you through the process of collecting Reddit posts and comments using Python and ScraperAPI as your data scraping tool. We’ll also show you how to export the scraped data into JSON for easy usage.

Let’s start with our tutorial on how to scrape Reddit with Python!

TL;DR: Full Reddit Data Scraper

For those in a hurry, here is the complete code for building a Reddit scraper in Python:

import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup
  
scraper_api_key = 'YOUR API KEY'
  
def fetch_comments_from_post(post_data):
   payload = { 'api_key': scraper_api_key, 'url': 'https://www.reddit.com//r/valheim/comments/15o9jfh/shifte_chest_reason_for_removal_from_valheim/' }
   r = requests.get('https://api.scraperapi.com/', params=payload)
   soup = BeautifulSoup(r.content, 'html.parser')
       # Find all comment elements
   comment_elements = soup.find_all('div', class_='thing', attrs={'data-type': 'comment'})
   # Initialize a list to store parsed comments
   parsed_comments = []
   for comment_element in comment_elements:
       try:
           # Extract relevant information from the comment element, handling potential NoneType errors
           author = comment_element.find('a', class_='author').text.strip() if comment_element.find('a', class_='author') else None
           dislikes = comment_element.find('span', class_='score dislikes').text.strip() if comment_element.find('span', class_='score dislikes') else None
           unvoted = comment_element.find('span', class_='score unvoted').text.strip() if comment_element.find('span', class_='score unvoted') else None
           likes = comment_element.find('span', class_='score likes').text.strip() if comment_element.find('span', class_='score likes') else None
           timestamp = comment_element.find('time')['datetime'] if comment_element.find('time') else None
           text = comment_element.find('div', class_='md').find('p').text.strip() if comment_element.find('div', class_='md') else None
           # Skip comments with missing text
           if not text:
               continue  # Skip to the next comment in the loop
           # Append the parsed comment to the list
           parsed_comments.append({
               'author': author,
               'dislikes': dislikes,
               'unvoted': unvoted,
               'likes': likes,
               'timestamp': timestamp,
               'text': text
           })
       except Exception as e:
           print(f"Error parsing comment: {e}")
   return parsed_comments
  
reddit_query = f"https://www.reddit.com/t/valheim/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={reddit_query}'
r = requests.get(scraper_api_url)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article', class_='m-0')
# Initialize a list to store parsed posts
parsed_posts = []
for article in articles:
   post = article.find('shreddit-post')
   # Extract post details
   post_title = post['post-title']
   post_permalink = post['permalink']
   content_href = post['content-href']
   comment_count = post['comment-count']
   score = post['score']
  
   author_id = post.get('author-id', 'N/A')
   author_name = post['author']
  
   # Extract subreddit details
   subreddit_id = post['subreddit-id']
   post_id = post["id"]
   subreddit_name = post['subreddit-prefixed-name']
   comments = fetch_comments_from_post(post)
  
# Append the parsed post to the list
   parsed_posts.append({
       'post_title': post_title,
       'post_permalink': post_permalink,
       'content_href': content_href,
       'comment_count': comment_count,
       'score': score,
       'author_id': author_id,
       'author_name': author_name,
       'subreddit_id': subreddit_id,
       'post_id': post_id,
       'subreddit_name': subreddit_name,
       'comments': comments
   })
# Save the parsed posts to a JSON file
output_file_path = 'parsed_posts.json'
with open(output_file_path, 'w', encoding='utf-8') as json_file:
   json.dump(parsed_posts, json_file, ensure_ascii=False, indent=2)
   
print(f"Data has been saved to {output_file_path}")

After running the code, you should have the data dumped in the parsed_post.json file.

The image below shows how your parsed_post.json file should look like.

Curious to know how it all happened? Keep reading to get the step-by-step guide on how to build a website scraper for Reddit with Python!

Web Scraping Reddit Project Requirements

For this Reddit web scraping tutorial, you will need Python 3.8 or newer. Ensure you have a supported version before continuing.

Also, you will need to install Request and BeautifulSoup4.

  • The Request library helps you to download Reddit’s post search results page.
  • BS4 makes it simple to scrape information from web pages.

Install both of these libraries using PIP:

pip install requests bs4

Now create a new directory and a Python file to store all of the code for this tutorial:

mkdir reddit_scraper
echo. > app.py

You are now ready to go on to the next steps!

Deciding What to Scrape From Reddit

Before diving headfirst into code, define your target data meticulously. Ask yourself, what specific insights do I want to scrape? Are you after:

  • Sentiment analysis on trending topics?
  • Analyzing user behavior in a niche subreddit?
  • Do you need post titles, comments, or both?

Pinpointing your desired data points will guide your scraping strategy and prevent an overload of irrelevant information.

Consider specific keywords, subreddits, or even post IDs for a focused harvest.

For this article, we’ll focus on this specific subreddit: https://www.reddit.com/t/valheim/ and extract its posts and comments.

Using ScraperAPI for Reddit Scraping

Reddit is known for blocking scrapers from its website, making collecting data at any meaningful scale challenging. For that reason, we’ll be sending our get() requests through ScraperAPI, effectively bypassing Reddit’s anti-scraping mechanisms without complicated workarounds.

To get started quickly, create a free ScraperAPI account to access your API key.

You’ll receive 5,000 free API credits for a seven-day trial – which will start whenever you’re ready.

Now that you have your ScraperAPI account, let’s get down to business!!

Step 1: Importing Your Libraries

For any of this to work, you need to import the necessary Python libraries, which are jsonrequest, and BeautifulSoup.

import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

Next, create a variable to store your API key

scraper_api_key = 'ENTER KEY HERE'

Step 2: Fetching Reddit Post

Let’s start by adding our initial URL to reddit_query and then constructing our get() requests using ScraperAPI’s standard endpoint.

reddit_query = f"https://www.reddit.com/t/valheim/"
scraper_api_url = f'http://api.scraperapi.com/?api_key={scraper_api_key}&url={reddit_query}'
r = requests.get(scraper_api_url)

Next, let’s parse Reddit’s HTML (using BeautifulSoup), creating a soup object we can now use to select specific elements from the page:

soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article', class_='m-0')

Specifically, we’re searching for all article elements with a CSS class of m-0 using the find_all() method. The resulting list of articles is assigned to the variable articles which contain each post.

Step 3: Extracting Reddit Post Information

Now that you have scraped the article elements containing each subreddit post on the Reddit page, it is time to extract each post from it along with their information

# Initialize a list to store parsed posts
parsed_posts = []
 
 
for article in articles:
    post = article.find('shreddit-post')
    
    # Extract post details
    post_title = post['post-title']
    post_permalink = post['permalink']
    content_href = post['content-href']
    comment_count = post['comment-count']
    score = post['score']
    
    author_id = post.get('author-id', 'N/A')
    author_name = post['author']
    
    # Extract subreddit details
    subreddit_id = post['subreddit-id']
    post_id = post["id"]
    subreddit_name = post['subreddit-prefixed-name']
 
 
# Append the parsed post to the list
    parsed_posts.append({
        'post_title': post_title,
        'post_permalink': post_permalink,
        'content_href': content_href,
        'comment_count': comment_count,
        'score': score,
        'author_id': author_id,
        'author_name': author_name,
        'subreddit_id': subreddit_id,
        'post_id': post_id,
        'subreddit_name': subreddit_name
    }) 

In the code above you identified individual posts by looking for elements with the class shreddit-post. For each of these posts, you then extracted details like:

  • Title
  • Permalink
  • Content link
  • Comment count
  • Score
  • Author ID
  • Author name

By referring to specific HTML elements associated with each piece of information.

Moreover, when dealing with subreddit post details, the code also utilizes elements such as shreddit-idid, and subreddit-prefixed-name to capture relevant information.

In essence, the code programmatically navigates the HTML structure of Reddit posts, gathering essential details for each post and storing them in a list for further use or analysis.

Step 4: Fetching Reddit Post Comments

For you to be able to extract comments from posts, you need to send a request to ScraperAPI, or you risk getting banned by Reddit’s anti-scraping mechanisms.

First, let’s create a fetch_comments_from_post() function to send a request to ScraperAPI using your API KEY (don’t forget to replace YOUR_SCRAPER_API_KEY with your actual API KEY) and the post URL.

def fetch_comments_from_post(post_data):
payload = { 'api_key': 'YOUR_SCRAPER_API_KEY', 'url': 'https://www.reddit.com//r/valheim/comments/15o9jfh/shifte_chest_reason_for_removal_from_valheim/' }
r = requests.get('https://api.scraperapi.com/', params=payload)
soup = BeautifulSoup(r.content, 'html.parser')

This request is then passed to BeautifulSoup for parsing.

Then, we can identify all comments on the SubReddit by looking for div elements that have a data-type attribute set to comment.

# Find all comment elements
comment_elements = soup.find_all('div', class_='thing', attrs={'data-type': 'comment'})

Now that you have found the comments, it’s time to extract them.

# Initialize a list to store parsed comments
parsed_comments = []
 
for comment_element in comment_elements:
    try:
  
        # Extract relevant information from the comment element, handling potential NoneType errors
        author = comment_element.find('a', class_='author').text.strip() if comment_element.find('a', class_='author') else None
        dislikes = comment_element.find('span', class_='score dislikes').text.strip() if comment_element.find('span', class_='score dislikes') else None
        unvoted = comment_element.find('span', class_='score unvoted').text.strip() if comment_element.find('span', class_='score unvoted') else None
        likes = comment_element.find('span', class_='score likes').text.strip() if comment_element.find('span', class_='score likes') else None
        timestamp = comment_element.find('time')['datetime'] if comment_element.find('time') else None
        text = comment_element.find('div', class_='md').find('p').text.strip() if comment_element.find('div', class_='md') else None
  
        # Skip comments with missing text
        if not text:
            continue  # Skip to the next comment in the loop
  
        # Append the parsed comment to the list
        parsed_comments.append({
            'author': author,
            'dislikes': dislikes,
            'unvoted': unvoted,
            'likes': likes,
            'timestamp': timestamp,
            'text': text
        })
    except Exception as e:
        print(f"Error parsing comment: {e}")
 
return parsed_comments

In the code above, we called the parsed_comments list to store the information extracted. The code then iterates through each comment_element in a collection of comment elements.

Inside the loop, we used find() to extract relevant details from each comment, such as the:

  • Author’s name
  • Dislikes
  • Unvoted count
  • Likes
  • Timestamps
  • Text content of the comment

To handle cases where some information might be missing (resulting in NoneType errors), we’re using a conditional statement. If a comment lacks text content, it is skipped, and the loop moves on to the next comment. This will save us time and stop the NoneType errors.

The extracted data is then organized into a dictionary for each comment and appended to the parsed_comments list.

Any errors that occur during the parsing process are caught and printed.

Finally, the list containing the parsed comments is returned, providing a structured representation of essential comment information for further use or analysis.

After the comments for each subreddit post are extracted, you can save the comments in the initial parsed_post data by adding the code below.

comments = fetch_comments_from_post(post)

Step 5: Converting to JSON

Now that you have extracted all the data you need, it is time to save them to a JSON file for easy usability.

# Save the parsed posts to a JSON file
output_file_path = 'parsed_posts.json'
with open(output_file_path, 'w', encoding='utf-8') as json_file:
    json.dump(parsed_posts, json_file, ensure_ascii=False, indent=2)
  
print(f"Data has been saved to {output_file_path}")

The output_file_path variable is set to the file name parsed_posts.json. The code then opens this file in write mode (‘w‘) and uses the json.dump() function to write the content of the parsed_posts list to the file in JSON format.

The ensure_ascii=False argument ensures that non-ASCII characters are handled properly, and indent=2 adds indentation for better readability in the JSON file.

After writing the data, a message is printed to the console indicating that the data has been successfully saved to the specified JSON file (parsed_posts.json).

Testing the Scraper

To test out the scraper is very easy, you just run your Python code, and all data posts and comments in the subreddit provided will be scraped and converted into a JSON file named parsed_post.json.

Congratulations, you have successfully scraped data from Reddit, and you are free to use the data any way you want!

How to Turn Reddit Pages into LLM-Ready Data

Now that you’ve seen how to collect Reddit posts and comments using BeautifulSoup and ScraperAPI, here’s one more powerful technique, which would be helpful if you’re working with large language models like Gemini.

Instead of manually parsing HTML or handling edge cases in the DOM, you can use ScraperAPI’s output_format=markdown to return Reddit pages as clean, readable text. This format is ideal for LLMs, making it easier to analyse comment sentiment, summarise discussions, or extract key ideas from a thread without writing complex parsing logic.

Step 1: Obtain and Secure Your API Keys

To get started, ensure you have a ScraperAPI key and a Google Gemini API key. If you already have them, you can skip to the next step. You can get your ScraperAPI key from your dashboard or sign up here for a free one with 5,000 API credits. 

To get a Gemini API key:

  1. Go to Google AI Studio
  2. Sign in with your Google account
  3. Click on Create API Key
  4. Follow the prompts to generate your key

Step 2: Scrape Reddit as Markdown

Next, use ScraperAPI to fetch the Reddit thread in Markdown format with output_format=markdown in the request payload:

import requests

API_KEY = "YOUR_SCRAPERAPI_KEY"
url = "https://www.reddit.com/r/valheim/comments/15o9jfh/shifte_chest_reason_for_removal_from_valheim/"

payload = {
    "api_key": API_KEY,
    "url": url,
    "output_format": "markdown"
}

response = requests.get("http://api.scraperapi.com", params=payload)
markdown_data = response.text
print(markdown_data)

This will return the Reddit page data in markdown:

62K Members Online
[  Compliment System ](/r/PUBGConsole/comments/159c6qn/compliment%5Fsystem/)
10 upvotes· 17 comments
---
* [  Valheim reddit these days ](/r/valheim/comments/1gep5ey/valheim%5Freddit%5Fthese%5Fdays/)
[ ![r/valheim icon](https://styles.redditmedia.com/t5_jb05l/styles/communityIcon_s3236o15ufk41.png?width=48&height=48&frame=1&auto=webp&crop=48:48,smart&s=76b31c7a1180e13f2464fa0f6f7c9da604646ee9) r/valheim ](/r/valheim) • 8 mo. ago
![A banner for the subreddit](https://styles.redditmedia.com/t5_jb05l/styles/bannerBackgroundImage_yuxsts7tw0e61.png)
![r/valheim icon](https://styles.redditmedia.com/t5_jb05l/styles/communityIcon_s3236o15ufk41.png?width=96&height=96&frame=1&auto=webp&crop=96:96,smart&s=ea689e814ccc285f779ea9dc551a2f0d9c76ddc1) [r/valheim](/r/valheim/)
 Valheim is a brutal exploration and survival game for solo play or 2-10 (Co-op PvE) players, set in a procedurally-generated purgatory inspired by viking culture. It's available in Steam Early Access, developed by Iron Gate and published by Coffee Stain. \* https://discord.com/invite/valheim \* https://steamcommunity.com/app/892970/discussions/ \* https://www.facebook.com/valheimgame \* https://twitter.com/valheimgame
---
535K Members Online
[  Valheim reddit these days ](/r/valheim/comments/1gep5ey/valheim%5Freddit%5Fthese%5Fdays/)
[ ![r/valheim - Valheim reddit these days](https://external-preview.redd.it/ZnBucGExZnJnbnhkMW_hCrglJ_gMXFxGaHL55yq6gBZIqzazQmPH6JeYsKRE.png?width=140&height=140&crop=140:140,smart&format=jpg&v=enabled&lthumb=true&s=f472a3f1126d909f9f082f642a76da97d10aa01f) ](/r/valheim/comments/1gep5ey/valheim%5Freddit%5Fthese%5Fdays/)     
0:15
2.5K upvotes· 29 comments
---
* [  I know there's nothing quite like Valheim, but how is V Rising? ](/r/valheim/comments/1lapv1i/i%5Fknow%5Ftheres%5Fnothing%5Fquite%5Flike%5Fvalheim%5Fbut%5Fhow/)
[ ![r/valheim icon](https://styles.redditmedia.com/t5_jb05l/styles/communityIcon_s3236o15ufk41.png?width=48&height=48&frame=1&auto=webp&crop=48:48,smart&s=76b31c7a1180e13f2464fa0f6f7c9da604646ee9) r/valheim ](/r/valheim) • 5 days ago
![A banner for the subreddit](https://styles.redditmedia.com/t5_jb05l/styles/bannerBackgroundImage_yuxsts7tw0e61.png)
![r/valheim icon](https://styles.redditmedia.com/t5_jb05l/styles/communityIcon_s3236o15ufk41.png?width=96&height=96&frame=1&auto=webp&crop=96:96,smart&s=ea689e814ccc285f779ea9dc551a2f0d9c76ddc1) [r/valheim](/r/valheim/)
 Valheim is a brutal exploration and survival game for solo play or 2-10 (Co-op PvE) players, set in a procedurally-generated purgatory inspired by viking culture. It's available in Steam Early Access, developed by Iron Gate and published by Coffee Stain. \* https://discord.com/invite/valheim \* https://steamcommunity.com/app/892970/discussions/ \* https://www.facebook.com/valheimgame \* https://twitter.com/valheimgame
---

The response will return the whole Reddit thread in Markdown—including the post title, community banner, upvotes, comment counts, and comment text—all in a clean, readable structure that LLMs like Gemini can easily analyse or summarise.

Step 3: Summarize the Reddit Page with Gemini

Now that you have the Markdown, you can feed it to Google Gemini and extract a clean summary.

Start by installing the Gemini SDK if you haven’t already:

pip install google-generativeai

Then you can send the markdown straight to Gemini with a custom prompt to summarise the discussion or analyse the tone of the conversation.

import google.generativeai as genai

genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel(model_name="gemini-2.0-flash")

prompt = f"""
You are an online community analyst. Based on the Reddit thread below, provide:
- A short summary of the discussion
- The top 3 user opinions or concerns
- Overall sentiment (positive, negative, or mixed)

Here's the thread:
{markdown_data}
"""

response = model.generate_content(prompt)
print(response.text)

Gemini will give you a summary similar to this:

Okay, here's an analysis of the provided Reddit thread:
**Short Summary of the Discussion:**
The Reddit thread on r/valheim discusses the removal of the "Shift+E" chest interaction feature. The original poster shared a screenshot from the Valheim Twitter account explaining the removal was due to data showing players interacted "too little" with the inventory system, which the developers saw as a problem. The thread then devolves into discussion and disagreement about the developers' reasoning and game design philosophy.
**Top 3 User Opinions/Concerns:**
1.  **Disagreement with the Developer's Reasoning:**  Many users disagree with the developers' conclusion that players avoiding the inventory system means players *should* be forced to use it. The sentiment is that if players are avoiding the feature, it may indicate a flaw in the inventory system itself.
2.  **Concern About Tedious Gameplay:**  A key concern is that the developers are adding arbitrary "time gates" and making the game more tedious by forcing unnecessary inventory management. Users worry this is done for the wrong reasons (perhaps to artificially inflate playtime) when there are no microtransactions or in-game shop incentives.
3.  **Appreciation for Quality of Life Improvements:** Users point out that features like "Shift+E" (or similar mods) greatly improve the quality of life in the game, and that inventory management is frequently mentioned as one of the most tedious aspects of the game.
3.  **Appreciation for Quality of Life Improvements:** Users point out that features like "Shift+E" (or similar mods) greatly improve the quality of life in the game, and that inventory management is frequently mentioned as one of the most tedious aspects of the game.
**Overall Sentiment:**
The overall sentiment of the thread is **negative**...
[TRUNCATED]

Gemini provides an easy way to understand the key themes and sentiment within a Reddit thread, without needing to scroll through dozens of comments or parse complex HTML. You can extract the most discussed points, user opinions, and overall community sentiments with a single API request.

One of the main advantages of this approach is its flexibility. You can easily adjust your prompt depending on what you need,  for example, identifying controversial topics, patterns in feedback, or comparing perspectives across subreddits.

If your goal is to understand online communities, test language models, or build insight-driven tools from Reddit conversations, this method is a simple way to go from raw Reddit data to actual insights, without the usual scraping challenges.

Successfully Scrape Reddit Data with ScraperAPI

This tutorial presented a step-by-step approach to scraping data from Reddit, showing you how to

  • Collect the 25 most recent posts within a subreddit
  • Loop through all posts to collect comments
  • Send your requests through ScraperAPI to avoid getting banned
  • Export all extracted data into a structured JSON file

As we conclude this insightful journey into the realm of Reddit scraping, we hope you’ve gained valuable insights into the power of Python and ScraperAPI in unlocking the treasure trove of information Reddit holds.

Stay curious, and remember, with great scraping capabilities come great responsibilities.

Until next time, happy scraping!

FAQs About Scraping Reddit

Why Should I Scrape Reddit?

Scraping Reddit is valuable for diverse purposes, such as market research, competitor analysis, content curation, and SEO optimization. It provides real-time insights into user preferences, allows businesses to stay competitive, and aids in identifying trending topics and keywords.

What Can You Do with Reddit’s Data?

The possibilities with Reddit data are endless! Here are just a few examples:

Sentiment analysis: Analyze public opinion on current events, products, or brands by examining comments and post reactions.
Social network analysis: Understand how communities form and interact within specific subreddits, identifying influencers and key topics.
Behavioral analysis: Analyze user behavior patterns like voting trends, content engagement, and topic preferences.
Trend Analysis: Identify emerging trends and topics by analyzing post frequency and engagement metrics.
Data for Machine Learning Model: Scraped Reddit data can fuel powerful machine learning models for tasks like sentiment analysis, identifying trends and communities, and even chatbot training.


By using Reddit data thoughtfully and ethically, you can unlock valuable insights and drive positive outcomes in various fields. Just remember to choose your target data wisely and consider the potential impact of your scraping activities.

Does Reddit Block Web Scraping?

Yes, Reddit does block web scraping directly using techniques like IP blocking and Finger Printing. To avoid your scrapers from breaking, we advise using a tool like ScraperAPI to bypass anti-bot mechanisms and automatically handle most complexities involved in web scraping.

Can I Scrape Private or Restricted Subreddits?

No, scraping data from private or restricted subreddits on Reddit is explicitly prohibited and goes against Reddit’s terms of service.
Private and restricted subreddits have limited access intentionally to protect the privacy and exclusivity of their content. It is essential to respect the rules and privacy settings of each subreddit and refrain from scraping any data from private or restricted communities without explicit permission.
Remember, the best way to keep your scrapers legal is only to collect publicly available data.

About the author

Picture of Prince-Joel Ezimorah

Prince-Joel Ezimorah

Prince-Joel Ezimorah is a technical content writer based in Nigeria with experience in Python, Javascript, SQL, and Solidity. He’s a Google development student club member and is currently working at a gamification group as a technical content writer. Contact him on LinkedIn.