This is an R
package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the intial package ballooning!
You can install the package using:
remotes::install_gitlab("culturalcartography/text2map.corpora")
data("corpus_finefoods")
# for corpora that need to be downloaded first
download_corpus("corpus_web_dubois")
data("corpus_web_dubois")
The following corpora can be loaded into your R session using data()
providedtext2map.corpora
is also loaded.
NAME | N VARS | N DOCS | TOKENS | TYPES | SIZE |
---|---|---|---|---|---|
corpora_senti_bench | 6 | 11557 | 308830 | 56492 | 2.8 Mb |
corpus_annual_review | 7 | 70 | 9982 | 1770 | 56.2 Kb |
corpus_atn_immigr | 8 | 3230 | 4235162 | 216471 | 24.7 Mb |
corpus_beyonce | 10 | 83 | 38240 | 4465 | 213.4 Kb |
corpus_cmu_blogs100 | 6 | 100 | 46808 | 11919 | 299.1 Kb |
corpus_envsociology | 8 | 817 | 126729 | 16492 | 1.1 Mb |
corpus_europarl_subset | 4 | 10000 | 261904 | 26792 | 2.4 Mb |
corpus_finefoods | 9 | 50000 | 4119699 | 140842 | 29.4 Mb |
corpus_isot_fake_news2k | 5 | 2000 | 833437 | 67987 | 5.3 Mb |
corpus_ittpr | 7 | 976 | 455733 | 38173 | 3.3 Mb |
corpus_presidential | 13 | 2475 | 4930817 | 145616 | 27.8 Mb |
corpus_reddit_aita | 18 | 32766 | 11056240 | 267134 | 73.9 Mb |
corpus_taylor_swift | 10 | 120 | 44488 | 5033 | 263.2 Kb |
corpus_tng_season5 | 5 | 10834 | 118671 | 15661 | 1.6 Mb |
corpus_usnss | 2 | 18 | 405556 | 23035 | 2.6 Mb |
The following corpora are currently available to be downloaded. Once downloaded they can be loaded using data()
so long as text2map.corpora
is loaded. They need only be downloaded once per machine (not per session).
NAME | N VARS | N DOCS | TOKENS | TYPES | SIZE |
---|---|---|---|---|---|
corpus_disaster | 3 | 10860 | 161285 | 41853 | 2.5 Mb |
corpus_enron | 7 | 30965 | 6353609 | 243605 | 39.3 Mb |
corpus_nytimes_covid | 24 | 982 | 18974 | 5968 | 40.6 Mb |
corpus_web_dubois | 5 | 12757 | 143081 | 13841 | 2.3 Mb |
corpus_isot_fake_news | 5 | 44244 | 18196332 | 396170 | 99.8 Mb |
corpus_dsj_vox | 8 | 22789 | 25410700 | 1358106 | 205.7 Mb |
corpus_pitchfork | 13 | 20873 | 13921384 | 666134 | 91.7 Mb |
corpus_atn | 12 | 204135 | 156294551 | 2507849 | 943.1 Mb |