The All The News (ATN) Corpus 2.0. contains 2,688,878 news articles from 27. news organizations collected by Andrew Thompson. According to the creator: "Publications were scraped with Python according to the publications' sitemaps, with a few exceptions (like Vox) involving RSS feeds. The last day of scraping was on April 2, 2020."
Usage
data("corpus_atn2")
Details
News organizations include:
Axios 47815
Business Insider 57953
Buzzfeed News 32819
CNBC 238096
CNN 127602
Economist 26227
Fox News 20144
Gizmodo 27228
Hyperallergic 13551
Mashable 94107
New Republic 11809
New Yorker 4701
People 136488
Politico 46377
Refinery 29 111433
Reuters 840094
TMZ 49595
TechCrunch 52095
The Hill 208411
The New York Times 252259
The Verge 52424
Vice 101137
Vice News 15539
Vox 47272
Washington Post 40882
Wired 20243
Variables
doc_id. Unique ID for each article
date. Date of publication, e.g., "2016-12-09 18:31:00"
title. Title of the article
author. Author of the article (if provided)
article. Full text of the article
publication. News organization publishing the article
section. Section of the news site (if applicable), e.g., "World News," "Financials"
url. URL of article (many do not have URLs)
digital.