Skip to contents

The All The News (ATN) Corpus 2.0. contains 2,688,878 news articles from 27. news organizations collected by Andrew Thompson. According to the creator: "Publications were scraped with Python according to the publications' sitemaps, with a few exceptions (like Vox) involving RSS feeds. The last day of scraping was on April 2, 2020."

Usage

data("corpus_atn2")

Format

A data frame with 2688879 rows and 11 variables.

Source

https://components.one/datasets/all-the-news-2-news-articles-dataset/

Details

News organizations include:

  • Axios 47815

  • Business Insider 57953

  • Buzzfeed News 32819

  • CNBC 238096

  • CNN 127602

  • Economist 26227

  • Fox News 20144

  • Gizmodo 27228

  • Hyperallergic 13551

  • Mashable 94107

  • New Republic 11809

  • New Yorker 4701

  • People 136488

  • Politico 46377

  • Refinery 29 111433

  • Reuters 840094

  • TMZ 49595

  • TechCrunch 52095

  • The Hill 208411

  • The New York Times 252259

  • The Verge 52424

  • Vice 101137

  • Vice News 15539

  • Vox 47272

  • Washington Post 40882

  • Wired 20243

Variables

  • doc_id. Unique ID for each article

  • date. Date of publication, e.g., "2016-12-09 18:31:00"

  • title. Title of the article

  • author. Author of the article (if provided)

  • article. Full text of the article

  • publication. News organization publishing the article

  • section. Section of the news site (if applicable), e.g., "World News," "Financials"

  • url. URL of article (many do not have URLs)

  • digital.