Corpora (included)

Corpora installed with the package

corpus_senti_bench4k

Subset of 6 Corpora for the SentiStrength Benchmark

corpus_annual_review

Abstracts from the Annual Review of Sociology, 2020

corpus_atn_immigr

Balanced Sample of Immigration related articles from All the News Corpus

corpus_beyonce

Lyrics of Beyonce's Songs

corpus_cmu_blogs100

Sample of 100 Blogposts from the CMU 2008 Political Blog Corpus

corpus_envsociology

Environmental Sociology Article Abstracts, 1990-2014

corpus_europarl_subset

Sample from European Parliament Proceedings Parallel Corpus

corpus_finefoods10k

Subset of Amazon Fine Food Reviews Corpus, 2011-2012

corpus_isot_fake_news2k

Sample of 2,000 ISOT Fake News Dataset

corpus_ittpr

Immigration Think Tank Press Release (ITTPR) Corpus, 1998-2020

corpus_presidential

U.S. Presidential Speeches, 1952-1996

corpus_reddit_aita10k

Subset of Community Ethical Judgements on Real-Life Anecdotes Corpus

corpus_taylor_swift

Lyrics of Taylor Swift's Songs

corpus_tng_season5

Lines from Star Trek: The Next Generation, Season 5

corpus_usnss

National Security Strategy of the United States, 1987-2017

Corpora (downloaded)

Corpora which must be downloaded first

corpus_senti_bench

6 Corpora for the SentiStrength Benchmark

corpus_disaster

Figure Eight Disaster Tweets

corpus_enron

Internal Emails from Enron Email Corpus

corpus_nytimes_covid

New York Times Articles about COVID-19, 2020

corpus_web_dubois

Lines from three books by W.E.B DuBois

corpus_isot_fake_news

ISOT Fake News Dataset

corpus_dsj_vox

DJS VOX Articles Corpus, 2014-2017

corpus_pitchfork

Pitckfork Reviews, 1999-2019

corpus_atn

All The News (ATN) Corpus 1.0, 2015-2017

corpus_atn2

All The News (ATN) Corpus 2.0, 2016-2020

corpus_finefoods

Amazon Fine Food Reviews Corpus, 2011-2012

corpus_reddit_aita

Community Ethical Judgements on Real-Life Anecdotes Corpus

corpus_black_mirror

Lines from Black Mirror

Functions

Helper functions

download_corpus()

Download specified corpus

Tweet IDs

Tweet IDs which can be “rehydrated”

tweetids_covid

Tweet IDs for 1,922 tweets using #Covid19 collected in 2021

tweetids_covid_geo

Tweet IDs for 1,999 geo-tagged tweets #Covid19 collected in 2021

tweetids_gme

Tweet IDs of 15,594 tweets using the $GME (GameStop Ticker)

tweetids_stayhome

Tweet IDs for 23,737 tweets using #StayHome collected in 2021