Corpora for Text Analysis
This is an R
package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the intial package ballooning!
You can install the package using:
remotes::install_gitlab("culturalcartography/text2map.corpora")
Load corpora installed with the package:
data("corpus_finefoods", package = "text2map.corpora")
Download additional corpora and then load it:
text2map.corpora::download_corpus("corpus_web_dubois")
data("corpus_web_dubois", package = "text2map.corpora")
Available Corpora
The following corpora can be loaded into your R session using data()
providedtext2map.corpora
is also loaded. Note: tokens and types are measured without preprocessing using the space to mark words.
NAME | N_VARS | N_DOCS | TOKENS | TYPES | SIZE |
---|---|---|---|---|---|
corpus_senti_bench4k | 6 | 4044 | 113066 | 26426 | 1 Mb |
corpus_annual_review | 7 | 70 | 9982 | 1770 | 56.2 Kb |
corpus_atn_immigr | 8 | 3230 | 4235162 | 216471 | 24.7 Mb |
corpus_beyonce | 10 | 83 | 38240 | 4465 | 213.4 Kb |
corpus_cmu_blogs100 | 6 | 100 | 46808 | 11919 | 299.1 Kb |
corpus_envsociology | 8 | 817 | 126729 | 16492 | 1.1 Mb |
corpus_europarl_subset | 4 | 10000 | 261904 | 26792 | 2.4 Mb |
corpus_finefoods10k | 9 | 9999 | 827039 | 55006 | 6.8 Mb |
corpus_isot_fake_news2k | 5 | 2000 | 833437 | 67987 | 5.3 Mb |
corpus_ittpr | 7 | 976 | 455733 | 38173 | 3.3 Mb |
corpus_presidential | 13 | 2475 | 4930817 | 145616 | 27.8 Mb |
corpus_reddit_aita10k | 18 | 10157 | 3407207 | 122317 | 22.9 Mb |
corpus_taylor_swift | 10 | 120 | 44488 | 5033 | 263.2 Kb |
corpus_tng_season5 | 5 | 10834 | 118671 | 15661 | 1.6 Mb |
corpus_usnss | 2 | 18 | 405556 | 23035 | 2.6 Mb |
The following corpora are currently available to be downloaded. They need only be downloaded once per machine (not per session). Once downloaded they can be loaded using data()
so long as text2map.corpora
is loaded.
NAME | N VARS | N DOCS | TOKENS | TYPES | SIZE |
---|---|---|---|---|---|
corpus_senti_bench | 6 | 11557 | 308830 | 56492 | 2.8 Mb |
corpus_disaster | 3 | 10860 | 161285 | 41853 | 2.5 Mb |
corpus_enron | 7 | 30965 | 6353609 | 243605 | 39.3 Mb |
corpus_nytimes_covid | 24 | 982 | 18974 | 5968 | 40.6 Mb |
corpus_web_dubois | 5 | 12757 | 143081 | 13841 | 2.3 Mb |
corpus_isot_fake_news | 5 | 44244 | 18196332 | 396170 | 99.8 Mb |
corpus_dsj_vox | 8 | 22789 | 25410700 | 1358106 | 205.7 Mb |
corpus_pitchfork | 13 | 20873 | 13921384 | 666134 | 91.7 Mb |
corpus_atn | 12 | 204135 | 156294551 | 2507849 | 943.1 Mb |
corpus_atn2 | 11 | 2688879 | 1344232395 | 3049120 | 8.6 Gb |
corpus_finefoods | 9 | 50000 | 4119699 | 140842 | 29.4 Mb |
corpus_reddit_aita | 18 | 32766 | 11056240 | 267134 | 73.9 Mb |
corpus_black_mirror | 5 | 18972 | 113323 | 22025 | 2.2 Mb |
corpus_scifi_pulp | 10 | 2110 | 160189078 | 3857877 | 740.7 Mb |
corpus_moral_stories | 9 | 24000 | 1469341 | 33765 | 15.1 Mb |
Related Packages
There are four related packages hosted on GitLab:
-
text2map
: text analysis functions -
text2map.dictionaries
: norm dictionaries and word frequency lists -
text2map.pretrained
: pretrained embeddings and topic models -
text2map.theme
: changesggplot2
aesthetics and loads viridis color scheme as default
The above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")
Contributions and Support
We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues