Twitter is still one of the most popular social media platforms in the world. Counting with around 396 million users consuming and generating content at a massive scale.
At its core, Twitter is about sharing opinions and creating conversations around politics, culture, jobs, industries, brands, products, and more. Because of this, it also means Twitter is a gold mine of data for sentiment analysis, forecasting, and more.
In essence, businesses can scrape twitter to find important trends, understand what people think about specific products, campaigns, and subjects, and build solutions upon those findings to increase the chances of success.
In today’s tutorial, we’re going to show you how to scrape thousands of tweets in seconds without any big limitations – still, prudence is advice.
What is the Twitter’s API?
As Twitter themselves puts it, the Twitter API is an endpoint that can be used to “programmatically retrieve and analyze Twitter data, as well as build for the conversation on Twitter.”
With their API, researchers and developers can use Twitter’s data to build new applications or for further analysis. Their endpoints give you access to Tweets, users, spaces, direct messages, lists, and more.
So why wouldn’t just use their API? Well, you should if that’s enough for your needs. However, the API does come with a few limitations, like only getting 3,200 tweets per user, being able to only get seven-day-old tweets, or having to use authentication.
Note: Check their Twitter API limitation page for a more detailed explanation of the limitations.
This heavily slows down our projects or makes them impossible to complete.
Instead, we want full access to historical data to ensure we’re creating our models with all the data we can get and not just partial information – which would corrupt the results in many cases.
Twitter API Libraries for Web Scraping
When working with Twitter, we can use three popular solutions instead of the Twitter API:
1. Tweepy
Tweepy is a great library built upon Twitter API, which allows for easy access of data and to perform complex queries. It also allows you to take full advantage of all Twitter API features, making it a highly used option.
To get started, you can use the pip install tweepy command to install the library.
Here’s an example from Tweepy’s documentation to download your home’s timeline tweets:
</p>
import tweepy
auth = tweepy.OAuth1UserHandler(
consumer_key, consumer_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
public_tweets = api.home_timeline()
for tweet in public_tweets:
print(tweet.text)
<p>
However, we still run into common problems like not having access to historical data – Tweepy allows you to retrieve data from a week (seven days) window – and the requirement for authentication.
Because of these limitations, we will not be using this solution for this tutorial. However, if you need to use some of the unique features, Twitter API provides, then it’s worth considering.
2. Twint
Unlike Tweepy, Twint is a complete Twitter scraping tool able to scrape tweets from specific users, topics, hashtags, locations, and more, without the need of connecting to the Twitter API. This prevent you from hitting any rate limit or having to create a Twitter approved application before hand.
Something worth mentioning is that Twint can also perform “special queries to scrape user’s followers, tweets users have liked and who they follow” without using headless browsers or other more complex solutions.
3. Snscrape
Similar to Twint, Snscrape doesn’t go through Twitter API, allowing you to scrape historical data without any rate limits.
However, Snscrape it’s not designed soly for Twitter. You can actually use it to scrape social platforms like Mastodon, Reddit, and more.
Of course, the scraping capabilities are different for every platform, so check their github to ensure you can access what you’re looking for.
In the case of Twitter, we can scrape users, user profiles, hashtags, searches, tweets (single or surrounding thread), list posts, and trends.
Using Snscrape to Scrape Twitter Data in Python
The best part of Snscrape is how easy it is to use, making it the best starting point for anyone wanting to scrape data from Twitter.
Getting the Project Ready
To start the project, let’s open a new directory on VScode (or your favorite IDE) and open a terminal. From there, just install Snscrape using pip:
</p>
pip install snscrape
<p>
It’ll automatically download all the necessary files. For it to work, you’ll need to have Python 3.8 or higher installed. To verify your Python version, use the command python –version on your terminal.
Note: If you don’t have it already, also install Pandas using pip install pandas. We’ll use it to visualize the scrape data and export everything to a CSV file.
Next, create a new file called tweet-scraper.py and import the dependencies at the top:
</p>
import snscrape.modules.twitter as sntwitter
import pandas as pd
<p>
And now we’re ready to start scraping the data!
Understanding the Structure of the Response
Just like websites, we need to understand the structure of the Twitter data Snscrape provides, so we can pick and choose the bits of data we’re actually interested in.Let’s say that we want to know what people are talking about around web scraping. To make it happen, we’ll need to send a query to Twitter through the Snscrape’s Twitter module like this:
</p>
query = "web scraping"
for tweet in sntwitter.TwitterSearchScraper(query).get_items():
print(vars(tweet))
break
<p>
The .TwitterSearchScraper() method is basically like using Twitter’s search bar on the website. We’re passing it a query (in our case, web scraping) and getting the items resulting from the search.
More Snscrape Methods
Here’s a list of all other methods you can use to query Twitter using the sntwitter module:
- TwitterSearchScraper
- TwitterUserScraper
- TwitterProfileScraper
- TwitterHashtagScraper
- TwitterTweetScraperMode
- TwitterTweetScraper
- TwitterListPostsScraper
- TwitterTrendsScraper
While the vars() function will return all attributes of an element, in this case tweet.
The returned JSON has all the information associated with the Tweet. On the image, we pinned the three that are most important for us: the data of the tweet, the content of the tweet (which is the tweet itself) and the user who tweet it.
If you’ve worked with JSON data before, accessing the value of these fields is simple:
- tweet.date
- tweet.content
- tweet.user.username
The third is different because we don’t want the entire value of user, instead, we access the field user and then move down to the username field.
Scraping Complex Queries in Snscrape
Although there are several ways to construct your queries, the easiest way is using Twitter’s search bar to generate a query with the exact parameters we need
First, go to Twitter and enter whatever query you like:
And now click on Advanced Search:
Now fill in the form with the parameter that matches your needs. For this example, we’ll use the following information:
Field | Value |
---|---|
These exact words | web scraping |
Language | English |
From date | January 01, 2022 |
To date | September 30, 2022 |
Note: You can also set specific accounts and filters and use less restrictive word combinations.Once that’s ready, click the search button on the top right corner.
It has created a custom query we can use in our code to pass it to the TwitterSearchScraper() method.
Setting a Limit to Your Scraper
There are A LOT of tweets on Twitter. There’s just a ridiculous amount of tweets being generated every day. So let’s set a limit to the number of tweets we want to scrape and break the loop once we reach it.
Setting the limit is super simple:
</p>
limit = 1000
<p>
However, if we just print tweets out, it will never actually reach any limit. To make it work, we’ll need to store the tweets into a list.
</p>
tweets = []
<p>
With these two elements, we can add the following logic to our for loop without issues:
</p>
for tweet in sntwitter.TwitterSearchScraper(query).get_items():
if len(tweets) == limit:
break
else:
tweets.append([tweet.date, tweet.user.username, tweet.content])
<p>
Creating a Dataframe With Pandas
Just for testing, let’s change the limit to 10 and print the array to see what it returns:
A little hard to read but you can clearly see those two usernames, the dates and the tweets. Perfect!
Now, let’s give it a better structure before exporting the data. With Pandas, all we need to do is pass our array to the .DataFrame() method and create the columns.
</p>
df = pd.DataFrame(tweets, columns=['Date', 'User', 'Tweet'])
#print(df)
<p>
Note: These should match the data we’re scraping and in the order they will be scraped.
You can print the dataframe to ensure you’re getting all the tweets specified in the limit variable, but it should work just fine.
Exporting Your Dataframe to a CSV/JSON File
How could you not love Python when it makes exporting so easy?
Let’s set the limit to 1000 and run our script!
Exporting to CSV:
</p>
df.to_csv('scraped-tweets.csv', index=False, encoding='utf-8')
<p>
Export to JSON:
</p>
df.to_json('scraped-tweets.json', orient='records', lines=True)
<p>
Note: Of course, the script will take longer than before to scrape all the data. So don’t worry if it takes a few minutes before it returns the tweets.
Congratulations! You just scraped 1000 Tweets from January to September 2022 in a few minutes. You can use this method to scrape tweets from any time range and using any filter you want to make your research laser focused.
Wrapping Up: Considerations Before You Scrape Twitter
Scraping Twitter has become super simple thanks to libraries like Snscrape and Twint, but that doesn’t mean you should extract all information without thinking beyond just implementation.
Twitter has a lot of sensitive data that you’ll need to make sure to respect. Things like emails and address can be expose on tweets and those need to be filtered out before using the data. This is what’s called sensitive data and misusing it can bring a lot of consequences.
You’ll also want to plan the whole scope of the project before starting to scrape content. If you don’t have a clear objective and a definition of the models you’ll use, no matter the amount of data you extract, it won’t be useful at all.
One of the best use cases for Twitter data is sentiment analysis. With this process, you can pull out insights from conversations around specific topics, brands and products, to determine how people feels about them.
This is invaluable data for things like reputation management, product launch performance, preventing PR catastrophes and much more.
However, not all platforms and websites are as open as Twitter when it comes to automation. Just like Snscrape makes it easy to scrape social media, ScraperAPI allows you to scrape millions of websites without getting blocked or putting your project at risk.
Visit our documentation to learn how ScraperAPI simplifies web scraping in Python at scale in just one line of code.
Until next time, happy scraping!