Corpora for Text Analysis
This is an R package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the initial package ballooning!
Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences
Installation
library(remotes)
install_gitlab("culturalcartography/text2map.corpora")Usage
A number of smaller corpora are included with the package and can be loaded immediately:
library(text2map.corpora)
# Load a bundled corpus using data()
data("corpus_beyonce")
# Or use load_corpus() for any corpus
beyonce <- load_corpus("corpus_beyonce")Larger corpora must be downloaded once per machine, then loaded each session:
# Download once per machine
download_corpus("corpus_web_dubois")
# Load each session
dubois <- load_corpus("corpus_web_dubois")Bundled Corpora
The following corpora are included with the package and can be loaded immediately with data() or load_corpus():
| NAME | N_VARS | N_DOCS | TOKENS | TYPES | SIZE |
|---|---|---|---|---|---|
| corpus_senti_bench4k | 6 | 4044 | 113066 | 26426 | 1 Mb |
| corpus_annual_review | 7 | 70 | 9982 | 1770 | 56.2 Kb |
| corpus_atn_immigr | 8 | 3230 | 4235162 | 216471 | 24.7 Mb |
| corpus_beyonce | 10 | 83 | 38240 | 4465 | 213.4 Kb |
| corpus_cmu_blogs100 | 6 | 100 | 46808 | 11919 | 299.1 Kb |
| corpus_envsociology | 8 | 817 | 126729 | 16492 | 1.1 Mb |
| corpus_europarl_subset | 4 | 10000 | 261904 | 26792 | 2.4 Mb |
| corpus_finefoods10k | 9 | 9999 | 827039 | 55006 | 6.8 Mb |
| corpus_isot_fake_news2k | 5 | 2000 | 833437 | 67987 | 5.3 Mb |
| corpus_ittpr | 7 | 976 | 455733 | 38173 | 3.3 Mb |
| corpus_presidential | 13 | 2475 | 4930817 | 145616 | 27.8 Mb |
| corpus_reddit_aita10k | 18 | 10157 | 3407207 | 122317 | 22.9 Mb |
| corpus_taylor_swift | 10 | 120 | 44488 | 5033 | 263.2 Kb |
| corpus_tng_season5 | 5 | 10834 | 118671 | 15661 | 1.6 Mb |
| corpus_usnss | 2 | 18 | 405556 | 23035 | 2.6 Mb |
Downloadable Corpora
The following corpora are available to download. They need only be downloaded once per machine. Once downloaded, they can be loaded with load_corpus() or data().
Text Corpora:
| NAME | N VARS | N DOCS | TOKENS | TYPES | SIZE |
|---|---|---|---|---|---|
| corpus_senti_bench | 6 | 11557 | 308830 | 56492 | 2.8 Mb |
| corpus_disaster | 3 | 10860 | 161285 | 41853 | 2.5 Mb |
| corpus_enron | 7 | 30965 | 6353609 | 243605 | 39.3 Mb |
| corpus_nytimes_covid | 24 | 982 | 18974 | 5968 | 40.6 Mb |
| corpus_web_dubois | 5 | 12757 | 143081 | 13841 | 2.3 Mb |
| corpus_isot_fake_news | 5 | 44244 | 18196332 | 396170 | 99.8 Mb |
| corpus_dsj_vox | 8 | 22789 | 25410700 | 1358106 | 205.7 Mb |
| corpus_pitchfork | 13 | 20873 | 13921384 | 666134 | 91.7 Mb |
| corpus_atn | 12 | 204135 | 156294551 | 2507849 | 943.1 Mb |
| corpus_atn2 | 11 | 2688879 | 1344232395 | 3049120 | 8.6 Gb |
| corpus_finefoods | 9 | 50000 | 4119699 | 140842 | 29.4 Mb |
| corpus_reddit_aita | 18 | 32766 | 11056240 | 267134 | 73.9 Mb |
| corpus_black_mirror | 5 | 18972 | 113323 | 22025 | 2.2 Mb |
| corpus_scifi_pulp | 10 | 2110 | 160189078 | 3857877 | 740.7 Mb |
| corpus_moral_stories | 9 | 24000 | 1469341 | 33765 | 15.1 Mb |
Tweet ID Lists (for rehydration):
| NAME | N IDS | SIZE |
|---|---|---|
| tweetids_covid | 1,922 | 11 Kb |
| tweetids_covid_geo | 1,999 | 12 Kb |
| tweetids_stayhome | 23,737 | 128 Kb |
| tweetids_gme | 15,594 | 82 Kb |
Helper Functions
The package provides several helper functions for managing corpora:
| Function | Description |
|---|---|
list_corpora() |
List all available corpora with metadata |
corpus_info() |
Get detailed info about a specific corpus |
corpus_exists() |
Check if a corpus is available (bundled or downloaded) |
corpus_path() |
Get file path to a downloaded corpus |
delete_corpus() |
Remove a downloaded corpus from disk |
# List all available corpora
list_corpora()
# List only bundled corpora
list_corpora(type = "bundled")
# List only downloaded corpora
list_corpora(downloaded_only = TRUE)
# List tweet ID datasets
list_corpora(category = "tweetids")
# Get info about a specific corpus
corpus_info("corpus_beyonce")
# Check if a corpus is available
corpus_exists("corpus_beyonce") # TRUE (bundled)
corpus_exists("corpus_enron") # FALSE (not downloaded)
# Get path to a downloaded corpus
corpus_path("corpus_enron")
# Delete a downloaded corpus
delete_corpus("corpus_web_dubois")File Formats
Downloaded corpora are stored in multiple formats, loaded in this priority order:
-
.qs2— Fastest loading (~10x faster than .rda) -
.fst— Fast loading (~3x faster than .rda) -
.rda— Standard R format (fallback)
The download_corpus() function downloads the best available format from the repository. The load_corpus() function automatically detects and uses the appropriate loader.
Related Packages
There are four related packages hosted on GitLab:
-
text2map: text analysis functions -
text2map.dictionaries: norm dictionaries and word frequency lists -
text2map.pretrained: pretrained embeddings and topic models -
text2map.theme: changesggplot2aesthetics and loads viridis color scheme as default
The above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")Contributions and Support
We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues
