Skip to contents

Corpora for Text Analysis

This is an R package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the initial package ballooning!

Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences

Installation

library(remotes)
install_gitlab("culturalcartography/text2map.corpora")

Usage

A number of smaller corpora are included with the package and can be loaded immediately:

library(text2map.corpora)

# Load a bundled corpus using data()
data("corpus_beyonce")

# Or use load_corpus() for any corpus
beyonce <- load_corpus("corpus_beyonce")

Larger corpora must be downloaded once per machine, then loaded each session:

# Download once per machine
download_corpus("corpus_web_dubois")

# Load each session
dubois <- load_corpus("corpus_web_dubois")

Bundled Corpora

The following corpora are included with the package and can be loaded immediately with data() or load_corpus():

NAME N_VARS N_DOCS TOKENS TYPES SIZE
corpus_senti_bench4k 6 4044 113066 26426 1 Mb
corpus_annual_review 7 70 9982 1770 56.2 Kb
corpus_atn_immigr 8 3230 4235162 216471 24.7 Mb
corpus_beyonce 10 83 38240 4465 213.4 Kb
corpus_cmu_blogs100 6 100 46808 11919 299.1 Kb
corpus_envsociology 8 817 126729 16492 1.1 Mb
corpus_europarl_subset 4 10000 261904 26792 2.4 Mb
corpus_finefoods10k 9 9999 827039 55006 6.8 Mb
corpus_isot_fake_news2k 5 2000 833437 67987 5.3 Mb
corpus_ittpr 7 976 455733 38173 3.3 Mb
corpus_presidential 13 2475 4930817 145616 27.8 Mb
corpus_reddit_aita10k 18 10157 3407207 122317 22.9 Mb
corpus_taylor_swift 10 120 44488 5033 263.2 Kb
corpus_tng_season5 5 10834 118671 15661 1.6 Mb
corpus_usnss 2 18 405556 23035 2.6 Mb

Downloadable Corpora

The following corpora are available to download. They need only be downloaded once per machine. Once downloaded, they can be loaded with load_corpus() or data().

Text Corpora:

NAME N VARS N DOCS TOKENS TYPES SIZE
corpus_senti_bench 6 11557 308830 56492 2.8 Mb
corpus_disaster 3 10860 161285 41853 2.5 Mb
corpus_enron 7 30965 6353609 243605 39.3 Mb
corpus_nytimes_covid 24 982 18974 5968 40.6 Mb
corpus_web_dubois 5 12757 143081 13841 2.3 Mb
corpus_isot_fake_news 5 44244 18196332 396170 99.8 Mb
corpus_dsj_vox 8 22789 25410700 1358106 205.7 Mb
corpus_pitchfork 13 20873 13921384 666134 91.7 Mb
corpus_atn 12 204135 156294551 2507849 943.1 Mb
corpus_atn2 11 2688879 1344232395 3049120 8.6 Gb
corpus_finefoods 9 50000 4119699 140842 29.4 Mb
corpus_reddit_aita 18 32766 11056240 267134 73.9 Mb
corpus_black_mirror 5 18972 113323 22025 2.2 Mb
corpus_scifi_pulp 10 2110 160189078 3857877 740.7 Mb
corpus_moral_stories 9 24000 1469341 33765 15.1 Mb

Tweet ID Lists (for rehydration):

NAME N IDS SIZE
tweetids_covid 1,922 11 Kb
tweetids_covid_geo 1,999 12 Kb
tweetids_stayhome 23,737 128 Kb
tweetids_gme 15,594 82 Kb

Helper Functions

The package provides several helper functions for managing corpora:

Function Description
list_corpora() List all available corpora with metadata
corpus_info() Get detailed info about a specific corpus
corpus_exists() Check if a corpus is available (bundled or downloaded)
corpus_path() Get file path to a downloaded corpus
delete_corpus() Remove a downloaded corpus from disk
# List all available corpora
list_corpora()

# List only bundled corpora
list_corpora(type = "bundled")

# List only downloaded corpora
list_corpora(downloaded_only = TRUE)

# List tweet ID datasets
list_corpora(category = "tweetids")

# Get info about a specific corpus
corpus_info("corpus_beyonce")

# Check if a corpus is available
corpus_exists("corpus_beyonce")     # TRUE (bundled)
corpus_exists("corpus_enron")       # FALSE (not downloaded)

# Get path to a downloaded corpus
corpus_path("corpus_enron")

# Delete a downloaded corpus
delete_corpus("corpus_web_dubois")

File Formats

Downloaded corpora are stored in multiple formats, loaded in this priority order:

  1. .qs2 — Fastest loading (~10x faster than .rda)
  2. .fst — Fast loading (~3x faster than .rda)
  3. .rda — Standard R format (fallback)

The download_corpus() function downloads the best available format from the repository. The load_corpus() function automatically detects and uses the appropriate loader.

There are four related packages hosted on GitLab:

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")

Contributions and Support

We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues