The All The News (ATN) Corpus 1.0. contains 204,135 news articles from 15 news organizations collected in 2017 by Andrew Thompson. According to the creator: "For each publication, I used archive.org to grab the past year-and-a-half of either home-page headlines or RSS feeds and ran those links through the scraper. That is, the articles are not the product of scraping an entire site, but rather their more prominently placed articles."

data("corpus_atn")

Format

A data frame with 204,135 rows and 13 variables.

Source

https://www.kaggle.com/datasets/snapcrack/all-the-news

Details

News organizations include:

  • New York Times

  • CNN

  • Breitbart

  • Business Insider

  • The Atlantic

  • Fox News

  • category.

  • section.

  • Talking Points Memo

  • New York Post

  • Buzzfeed News

  • National Review

  • The Guardian

  • NPR

  • Reuters

  • Vox

  • The Washington Post

Variables

  • doc_id. Unique ID for each article

  • title. Title of the article

  • author. Author of the article (if provided)

  • date. Date of publication

  • content. Full text of the article

  • publication. News organization publishing the article

  • category.

  • section.

  • url. URL of article (many do not have URLs)

  • digital.

  • year. Year of publication

  • month. Month of publication