Corpora for Text Analysis

This is an R package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the intial package ballooning!

You can install the package using:

remotes::install_gitlab("culturalcartography/text2map.corpora")

data("corpus_finefoods")

# for corpora that need to be downloaded first
download_corpus("corpus_web_dubois")
data("corpus_web_dubois")

Available Corpora

The following corpora can be loaded into your R session using data() provided
text2map.corpora is also loaded.

Corpora Installed with Package
NAME N VARS N DOCS TOKENS TYPES SIZE
corpora_senti_bench 6 11557 308830 56492 2.8 Mb
corpus_annual_review 7 70 9982 1770 56.2 Kb
corpus_atn_immigr 8 3230 4235162 216471 24.7 Mb
corpus_beyonce 10 83 38240 4465 213.4 Kb
corpus_cmu_blogs100 6 100 46808 11919 299.1 Kb
corpus_envsociology 8 817 126729 16492 1.1 Mb
corpus_europarl_subset 4 10000 261904 26792 2.4 Mb
corpus_finefoods 9 50000 4119699 140842 29.4 Mb
corpus_isot_fake_news2k 5 2000 833437 67987 5.3 Mb
corpus_ittpr 7 976 455733 38173 3.3 Mb
corpus_presidential 13 2475 4930817 145616 27.8 Mb
corpus_reddit_aita 18 32766 11056240 267134 73.9 Mb
corpus_taylor_swift 10 120 44488 5033 263.2 Kb
corpus_tng_season5 5 10834 118671 15661 1.6 Mb
corpus_usnss 2 18 405556 23035 2.6 Mb

The following corpora are currently available to be downloaded. Once downloaded they can be loaded using data() so long as text2map.corpora is loaded. They need only be downloaded once per machine (not per session).

Corpora That Must Be Downloaded
NAME N VARS N DOCS TOKENS TYPES SIZE
corpus_disaster 3 10860 161285 41853 2.5 Mb
corpus_enron 7 30965 6353609 243605 39.3 Mb
corpus_nytimes_covid 24 982 18974 5968 40.6 Mb
corpus_web_dubois 5 12757 143081 13841 2.3 Mb
corpus_isot_fake_news 5 44244 18196332 396170 99.8 Mb
corpus_dsj_vox 8 22789 25410700 1358106 205.7 Mb
corpus_pitchfork 13 20873 13921384 666134 91.7 Mb
corpus_atn 12 204135 156294551 2507849 943.1 Mb