Skip to contents

Corpora (included)

Corpora installed with the package

corpus_senti_bench4k
Subset of 6 Corpora for the SentiStrength Benchmark
corpus_annual_review
Abstracts from the Annual Review of Sociology, 2020
corpus_atn_immigr
Balanced Sample of Immigration related articles from All the News Corpus
corpus_beyonce
Lyrics of Beyonce's Songs
corpus_cmu_blogs100
Sample of 100 Blogposts from the CMU 2008 Political Blog Corpus
corpus_envsociology
Environmental Sociology Article Abstracts, 1990-2014
corpus_europarl_subset
Sample from European Parliament Proceedings Parallel Corpus
corpus_finefoods10k
Subset of Amazon Fine Food Reviews Corpus, 2011-2012
corpus_isot_fake_news2k
Sample of 2,000 ISOT Fake News Dataset
corpus_ittpr
Immigration Think Tank Press Release (ITTPR) Corpus, 1998-2020
corpus_presidential
U.S. Presidential Speeches, 1952-1996
corpus_reddit_aita10k
Subset of Community Ethical Judgements on Real-Life Anecdotes Corpus
corpus_taylor_swift
Lyrics of Taylor Swift's Songs
corpus_tng_season5
Lines from Star Trek: The Next Generation, Season 5
corpus_usnss
National Security Strategy of the United States, 1987-2017

Corpora (downloaded)

Corpora which must be downloaded first

corpus_senti_bench
6 Corpora for the SentiStrength Benchmark
corpus_disaster
Figure Eight Disaster Tweets
corpus_enron
Internal Emails from Enron Email Corpus
corpus_nytimes_covid
New York Times Articles about COVID-19, 2020
corpus_web_dubois
Lines from three books by W.E.B DuBois
corpus_isot_fake_news
ISOT Fake News Dataset
corpus_dsj_vox
DJS VOX Articles Corpus, 2014-2017
corpus_pitchfork
Pitckfork Reviews, 1999-2019
corpus_atn
All The News (ATN) Corpus 1.0, 2015-2017
corpus_atn2
All The News (ATN) Corpus 2.0, 2016-2020
corpus_finefoods
Amazon Fine Food Reviews Corpus, 2011-2012
corpus_reddit_aita
Community Ethical Judgements on Real-Life Anecdotes Corpus
corpus_black_mirror
Lines from Black Mirror
corpus_scifi_pulp
20th Century Science Fiction Pulp Magazines
corpus_moral_stories
Moral Stories

Functions

Helper functions

download_corpus()
Download specified corpus

Tweet IDs

Tweet IDs which can be “rehydrated”

tweetids_covid
Tweet IDs for 1,922 tweets using #Covid19 collected in 2021
tweetids_covid_geo
Tweet IDs for 1,999 geo-tagged tweets #Covid19 collected in 2021
tweetids_gme
Tweet IDs of 15,594 tweets using the $GME (GameStop Ticker)
tweetids_stayhome
Tweet IDs for 23,737 tweets using #StayHome collected in 2021