Newspaper3k: Scrape Newspapers and Articles with Python

tutorial on how to use the library Newspaper3k to scrape news articles

Struggling to navigate the overwhelming flood of online news? Newspaper3k is here to help. This powerful Python library empowers developers with a robust extraction toolkit for getting information specifically from news media.

This guide will cover:

  • An introduction to Newspaper3k
  • How to use Newspaper3k
  • Scraping news articles with Newspaper3k
  • Implementing proxy rotation in Newspaper3k using ScraperAPI
  • Using Newpaper3k NLP methods, and
  • Using Newspaper3k multi-threading article downloads feature
Scrape Google News at Scale

ScraperAPI’s Google News endpoint returns all news associated with your target keywords in JSON or CSV format with a simple get() request.

TL;DR: Using Newspaper3k

To get started, install the Newspaper3k package:

  <pre class="wp-block-syntaxhighlighter-code">
		pip install newspaper3k
	</pre
  >
  

Here are some code snippets that come in handy when using the package:

  • Download the page’s HTML for parsing
  •       <pre class="wp-block-syntaxhighlighter-code">
    			article.download
    	</pre
          >
        
  • Parse the HTML response to extract specific data points
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.parse()
    	</pre
        >
        
  • Extract all the authors from an article page
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.authors
    	</pre
        >
        
  • Extract the article’s published date
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.publish_date
    	</pre
        >
        
  • Extract the article’s text from the HTML
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.text
    	</pre
        >
        
  • Extract the article’s feature image
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.top_image
    	</pre
        >
        
  • Download any video on the article
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.movies
    	</pre
        >
        
  • Use natural language processing (NLP) to analyze the article before extracting additional data, allowing you to get keywords and a summary of the article
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.nlp()
    	</pre
        >
        
  • Extract relevant search terms from the article
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.keywords
    	</pre
        >
        
  • Generate a concise summary of an article
  •     <pre class="wp-block-syntaxhighlighter-code">
    			article.summary
    	</pre
        >
        

Want to learn how to use all of this in a real project? Keep reading!

What is Newspaper3k?

Newspaper3k is a powerful tool for web scraping. It is a Python library that scrapes content from web pages structured like online articles. It also has other features that allow developers to parse HTML content to extract data like author, title, main text, publication date, and sometimes images or video content associated with the article.

How to Scrape News Articles with Newspaper3k

To scrape a news article using the Newspaper3k package, start by creating a project folder and a file named index.py within it. Then, follow the steps below.

Step 1: Install the Newspaper3k Package

Install the Newspaper3k package by running the command below:


	pip install newspaper3k

Step 2: Download and Parse the Article

Import the package and grab the URL from the news page you want to scrape. First, we download the article by calling the download() method on the article. Then, we parse the article by calling the method parse(), as shown below.


	from newspaper import Article

	url = 'https://edition.cnn.com/2023/06/10/sport/manchester-city-wins-champions-league-for-first-time-beating-inter-milan-1-0-in-tense-istanbul-final/index.html'
	article = Article(url)
	article.download()
	article.parse()

Step 3: Extract the desired data

The parse method we called in the previous step extracts the data from the HTML page. These data include:

  • title - extracts the article title
  • authors - extracts the author or list of authors of the article and returns the result in an array.
  • publish_date - extracts the date and time of the publication of the article
  • text - extracts the article’s textual content
  • html - returns the full HTML of the page
  • top_image - returns the featured image of the article (if present)
  • images - returns an object containing the URL of all the images in the article
  • videos - extracts all the videos in the article (if present)

	# Extract and print the desired data
	print("**Headline:**", article.title)
	print("**Authors:**", article.authors)
	print("**Publication Date:**", article.publish_date)
	print("**Main Text:**", article.text)

Run the script by running the command below.


	python index.py

The result should be similar to the image below.

Scraping news data on newspaper3k

Step 4: Specify the Article's Language

The Newspaper3k package also has an embedded seamless language detection and extraction feature. This allows the developer to specify a language to use for data extraction. If no language is specified, Newspaper3k auto-detects a language and uses it by default.

Let's see how to specify the language when scraping data:


	url = 'https://www.bbc.com/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics'
	article = Article(url, language='zh')  #Chinese

At the time of writing, Newspaper3k supports the following languages:


	Your available languages are:

	input code              full name
	  ar                      Arabic
	  be                      Belarusian
	  bg                      Bulgarian
	  da                      Danish
	  de                      German
	  el                      Greek
	  en                      English
	  es                      Spanish
	  et                      Estonian
	  fa                      Persian
	  fi                      Finnish
	  fr                      French
	  he                      Hebrew
	  hi                      Hindi
	  hr                      Croatian
	  hu                      Hungarian
	  id                      Indonesian
	  it                      Italian
	  ja                      Japanese
	  ko                      Korean
	  mk                      Macedonian
	  nb                      Norwegian (Bokmål)
	  nl                      Dutch
	  no                      Norwegian
	  pl                      Polish
	  pt                      Portuguese
	  ro                      Romanian
	  ru                      Russian
	  sl                      Slovenian
	  sr                      Serbian
	  sv                      Swedish
	  sw                      Swahili
	  tr                      Turkish
	  uk                      Ukrainian
	  vi                      Vietnamese
	  zh                      Chinese

Add proxy rotation in Newspaper3k using ScraperAPI

Many newspapers and article websites use anti-bot technologies, making it difficult for the package to scrape websites at scale. Optimizing headers and using proxies to extract raw HTML data from the websites is usually inevitable in these situations.

However, a major setback of Newspaper3k is that its download functionality does not have built-in support for proxies. Therefore, an HTTP client like Python Request should be used to implement this, and then the HTML should be parsed using the Newspaper3k library.

For this example, we’ll use ScraperAPI’s scraping API as a proxy solution to extract and pass the HTML content into Newspaper3k for parsing.

Note: To use the ScraperAPI Proxy Endpoint method, you must create a free ScraperAPI account to access Your API Key in your dashboard – your free trial will include 5,000 API credits, which in most cases will allow you to scrape up to 5,000 pages for free.


	import requests
	from urllib.parse import urlencode
	from newspaper import Article
	
	url = 'https://edition.cnn.com/2023/06/10/sport/manchester-city-wins-champions-league-for-first-time-beating-inter-milan-1-0-in-tense-istanbul-final/index.html'
	article = Article(url)
	
	## Download HTML using ScraperAPI Proxy
	payload = {'api_key': ‘API-KEY', 'url': url}
	response = requests.get('https://api.scraperapi.com', params=urlencode(payload))
	
	## Insert HTML into the Newspaper3k article object and parse the article
	article.download(input_html=response.text)
	article.parse()
	
	print("Headline: ", article.title)
	print("Authors: ", article.authors)
	print("Publication Date: ", article.publish_date)

With this combination, you’ll be able to scale your scrapers to millions of pages without worrying about CAPTCHAs, rate limiting, and other potential challenges.

How to use Newspaper3k's NLP methods

Newspaper3k also offers a Natural Language Processing (NLP) feature. This allows developers to analyze the content before extracting it. The nlp() method can obtain the summary and keywords in the article.

The NLP method is just as expensive as the parse method. Hence, it is important to use it only when necessary. Below is a demo of how to implement the NLP method.


	from newspaper import Article

	url = 'https://edition.cnn.com/2023/06/10/sport/manchester-city-wins-champions-league-for-first-time-beating-inter-milan-1-0-in-tense-istanbul-final/index.html'
	article = Article(url)
	article.download()
	article.parse()
	article.nlp()
	
	
	# Extract and print the desired data
	print("**Text Summary:**", article.summary)
	print("**Keywords: **", article.keywords)

The result should be similar to the image below.

Scraping summary and articles from newspaper3k

Troubleshooting Tips for Newspaper3k NLP Method

This error may occur when implementing the nlp() method for the first time.

Showing an error when implementing NLP method

This error can be resolved by adding the code below to the top of the script:


	import nltk
	nltk.download('punkt')

After doing this, rerun the script. This script downloads the punkt package needed for the nlp function to execute. The two lines of code can then be deleted afterward without affecting the script execution.

Using Newspaper3k's Multi-Threading Article Download Feature

This feature allows developers to extract news from multiple news sources simultaneously. Spamming a single news source with multiple threads or multiple async-io requests simultaneously will cause rate limiting. Hence, Newspaper3k provides 1-2 threads for each news source provided.

To implement the multi-threading feature for the article download feature, use the following code:


	import newspaper
	from newspaper import news_pool
	
	ted = newspaper.build('https://ted.com')
	cnbc = newspaper.build('https://cnbc.com')
	fox_news = newspaper.build('https://foxnews.com/')
	
	papers = [ted, cnbc, fox_news]
	news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
	news_pool.join()
	
	# At this point, you can safely assume that download() has been
	# called on every single article for all three sources.
	
	print(cnbc.size())

The join() method calls the download function for every article from each source. Hence, each source returns an array, and the data within each array can be accessed as shown below.


	for article in cnbc.articles:
    print(article.title)

Keep Learning

In this article, we have learned how to:

  • Use Newspaper3k package for scraping newspaper and article websites
  • Integrate Newspaper3k with ScraperAPI to scale your infrastructure in seconds
  • Use the nlp() method to extract relevant search terms and generate a summary for each article

The package provides a few more features, including the hot and popular_url methods on the newspaper package. These methods return the trending terms and popular news sources, respectively. You can check out its official documentation to learn more about this package.

If you want to learn more about scraping with Python, check out some of these resources:

Until next time, happy scraping!

About the author

John Fawole

John Fawole

John Fáwọlé is a technical writer and developer. He currently works as a freelance content marketer and consultant for tech startups.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.