Skip to contents

Corpora for Text Analysis

This is an R package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the intial package ballooning!

You can install the package using:

remotes::install_gitlab("culturalcartography/text2map.corpora")

Load corpora installed with the package:

data("corpus_finefoods", package = "text2map.corpora")

Download additional corpora and then load it:

text2map.corpora::download_corpus("corpus_web_dubois")
data("corpus_web_dubois", package = "text2map.corpora")

Available Corpora

The following corpora can be loaded into your R session using data() provided
text2map.corpora is also loaded. Note: tokens and types are measured without preprocessing using the space to mark words.

Corpora Installed with Package
NAME N_VARS N_DOCS TOKENS TYPES SIZE
corpus_senti_bench4k 6 4044 113066 26426 1 Mb
corpus_annual_review 7 70 9982 1770 56.2 Kb
corpus_atn_immigr 8 3230 4235162 216471 24.7 Mb
corpus_beyonce 10 83 38240 4465 213.4 Kb
corpus_cmu_blogs100 6 100 46808 11919 299.1 Kb
corpus_envsociology 8 817 126729 16492 1.1 Mb
corpus_europarl_subset 4 10000 261904 26792 2.4 Mb
corpus_finefoods10k 9 9999 827039 55006 6.8 Mb
corpus_isot_fake_news2k 5 2000 833437 67987 5.3 Mb
corpus_ittpr 7 976 455733 38173 3.3 Mb
corpus_presidential 13 2475 4930817 145616 27.8 Mb
corpus_reddit_aita10k 18 10157 3407207 122317 22.9 Mb
corpus_taylor_swift 10 120 44488 5033 263.2 Kb
corpus_tng_season5 5 10834 118671 15661 1.6 Mb
corpus_usnss 2 18 405556 23035 2.6 Mb

The following corpora are currently available to be downloaded. They need only be downloaded once per machine (not per session). Once downloaded they can be loaded using data() so long as text2map.corpora is loaded.

Corpora That Must Be Downloaded
NAME N VARS N DOCS TOKENS TYPES SIZE
corpus_senti_bench 6 11557 308830 56492 2.8 Mb
corpus_disaster 3 10860 161285 41853 2.5 Mb
corpus_enron 7 30965 6353609 243605 39.3 Mb
corpus_nytimes_covid 24 982 18974 5968 40.6 Mb
corpus_web_dubois 5 12757 143081 13841 2.3 Mb
corpus_isot_fake_news 5 44244 18196332 396170 99.8 Mb
corpus_dsj_vox 8 22789 25410700 1358106 205.7 Mb
corpus_pitchfork 13 20873 13921384 666134 91.7 Mb
corpus_atn 12 204135 156294551 2507849 943.1 Mb
corpus_atn2 11 2688879 1344232395 3049120 8.6 Gb
corpus_finefoods 9 50000 4119699 140842 29.4 Mb
corpus_reddit_aita 18 32766 11056240 267134 73.9 Mb
corpus_black_mirror 5 18972 113323 22025 2.2 Mb
corpus_scifi_pulp 10 2110 160189078 3857877 740.7 Mb
corpus_moral_stories 9 24000 1469341 33765 15.1 Mb

There are four related packages hosted on GitLab:

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")

Contributions and Support

We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues