This is an R
package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the intial package ballooning!
You can install the package using:
remotes::install_gitlab("culturalcartography/text2map.corpora")
Load corpora installed with the package:
data("corpus_finefoods", package = "text2map.corpora")
Download additional corpora and then load it:
text2map.corpora::download_corpus("corpus_web_dubois")
data("corpus_web_dubois", package = "text2map.corpora")
The following corpora can be loaded into your R session using data()
providedtext2map.corpora
is also loaded. Note: tokens and types are measured without preprocessing using the space to mark words.
NAME | N_VARS | N_DOCS | TOKENS | TYPES | SIZE |
---|---|---|---|---|---|
corpus_senti_bench4k | 6 | 4044 | 113066 | 26426 | 1 Mb |
corpus_annual_review | 7 | 70 | 9982 | 1770 | 56.2 Kb |
corpus_atn_immigr | 8 | 3230 | 4235162 | 216471 | 24.7 Mb |
corpus_beyonce | 10 | 83 | 38240 | 4465 | 213.4 Kb |
corpus_cmu_blogs100 | 6 | 100 | 46808 | 11919 | 299.1 Kb |
corpus_envsociology | 8 | 817 | 126729 | 16492 | 1.1 Mb |
corpus_europarl_subset | 4 | 10000 | 261904 | 26792 | 2.4 Mb |
corpus_finefoods10k | 9 | 9999 | 827039 | 55006 | 6.8 Mb |
corpus_isot_fake_news2k | 5 | 2000 | 833437 | 67987 | 5.3 Mb |
corpus_ittpr | 7 | 976 | 455733 | 38173 | 3.3 Mb |
corpus_presidential | 13 | 2475 | 4930817 | 145616 | 27.8 Mb |
corpus_reddit_aita10k | 18 | 10157 | 3407207 | 122317 | 22.9 Mb |
corpus_taylor_swift | 10 | 120 | 44488 | 5033 | 263.2 Kb |
corpus_tng_season5 | 5 | 10834 | 118671 | 15661 | 1.6 Mb |
corpus_usnss | 2 | 18 | 405556 | 23035 | 2.6 Mb |
The following corpora are currently available to be downloaded. They need only be downloaded once per machine (not per session). Once downloaded they can be loaded using data()
so long as text2map.corpora
is loaded.
NAME | N VARS | N DOCS | TOKENS | TYPES | SIZE |
---|---|---|---|---|---|
corpus_senti_bench | 6 | 11557 | 308830 | 56492 | 2.8 Mb |
corpus_disaster | 3 | 10860 | 161285 | 41853 | 2.5 Mb |
corpus_enron | 7 | 30965 | 6353609 | 243605 | 39.3 Mb |
corpus_nytimes_covid | 24 | 982 | 18974 | 5968 | 40.6 Mb |
corpus_web_dubois | 5 | 12757 | 143081 | 13841 | 2.3 Mb |
corpus_isot_fake_news | 5 | 44244 | 18196332 | 396170 | 99.8 Mb |
corpus_dsj_vox | 8 | 22789 | 25410700 | 1358106 | 205.7 Mb |
corpus_pitchfork | 13 | 20873 | 13921384 | 666134 | 91.7 Mb |
corpus_atn | 12 | 204135 | 156294551 | 2507849 | 943.1 Mb |
corpus_atn2 | 11 | 2688879 | 1344232395 | 3049120 | 8.6 Gb |
corpus_finefoods | 9 | 50000 | 4119699 | 140842 | 29.4 Mb |
corpus_reddit_aita | 18 | 32766 | 11056240 | 267134 | 73.9 Mb |
corpus_black_mirror | 5 | 18972 | 113323 | 22025 | 2.2 Mb |
corpus_scifi_pulp | 10 | 2110 | 160189078 | 3857877 | 740.7 Mb |
corpus_moral_stories | 9 | 24000 | 1469341 | 33765 | 15.1 Mb |
There are four related packages hosted on GitLab:
text2map
: text analysis functionstext2map.dictionaries
: norm dictionaries and word frequency liststext2map.pretrained
: pretrained embeddings and topic modelstext2map.theme
: changes ggplot2
aesthetics and loads viridis color scheme as defaultThe above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")
We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues