Corpora for Text Analysis

This is an R package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the initial package ballooning!

Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences

Installation

library(remotes)
install_gitlab("culturalcartography/text2map.corpora")

Usage

A number of smaller corpora are included with the package and can be loaded immediately:

library(text2map.corpora)

# Load a bundled corpus
corpus <- load_corpus("corpus_beyonce")

Larger corpora must be downloaded once per machine, then loaded each session:

# Download once per machine
download_corpus("corpus_web_dubois")

# Load each session
dubois <- load_corpus("corpus_web_dubois")

Bundled Corpora

The following corpora are included with the package and can be loaded immediately with load_corpus():

NAME	N_VARS	N_DOCS	TOKENS	TYPES	SIZE
corpus_senti_bench4k	6	4044	113066	26426	1 Mb
corpus_annual_review	7	70	9982	1770	56.2 Kb
corpus_atn_immigr	8	3230	4235162	216471	24.7 Mb
corpus_beyonce	10	83	38240	4465	213.4 Kb
corpus_cmu_blogs100	6	100	46808	11919	299.1 Kb
corpus_envsociology	8	817	126729	16492	1.1 Mb
corpus_europarl_subset	4	10000	261904	26792	2.4 Mb
corpus_finefoods10k	9	9999	827039	55006	6.8 Mb
corpus_isot_fake_news2k	5	2000	833437	67987	5.3 Mb
corpus_ittpr	7	976	455733	38173	3.3 Mb
corpus_presidential	13	2475	4930817	145616	27.8 Mb
corpus_reddit_aita10k	18	10157	3407207	122317	22.9 Mb
corpus_taylor_swift	10	120	44488	5033	263.2 Kb
corpus_tng_season5	5	10834	118671	15661	1.6 Mb
corpus_usnss	2	18	405556	23035	2.6 Mb
corpus_metal_lyrics_100	15	100	-	-	74 Kb

Downloadable Corpora

The following corpora are available to download. They need only be downloaded once per machine. Once downloaded, they can be loaded with load_corpus().

Text Corpora:

NAME	N VARS	N DOCS	TOKENS	TYPES	SIZE
corpus_senti_bench	6	11557	308830	56492	2.8 Mb
corpus_disaster	3	10860	161285	41853	2.5 Mb
corpus_enron	7	30965	6353609	243605	39.3 Mb
corpus_nytimes_covid	24	982	18974	5968	40.6 Mb
corpus_web_dubois	5	12757	143081	13841	2.3 Mb
corpus_isot_fake_news	5	44244	18196332	396170	99.8 Mb
corpus_dsj_vox	8	22789	25410700	1358106	205.7 Mb
corpus_pitchfork	13	20873	13921384	666134	91.7 Mb
corpus_atn	12	204135	156294551	2507849	943.1 Mb
corpus_atn2	11	2688879	1344232395	3049120	8.6 Gb
corpus_finefoods	9	50000	4119699	140842	29.4 Mb
corpus_reddit_aita	18	32766	11056240	267134	73.9 Mb
corpus_black_mirror	5	18972	113323	22025	2.2 Mb
corpus_scifi_pulp	10	2110	160189078	3857877	740.7 Mb
corpus_moral_stories	9	24000	1469341	33765	15.1 Mb
corpus_metal_lyrics	15	193623	-	-	104.5 Mb

Tweet ID Lists (for rehydration):

NAME	N IDS	SIZE
tweetids_covid	1,922	11 Kb
tweetids_covid_geo	1,999	12 Kb
tweetids_stayhome	23,737	128 Kb
tweetids_gme	15,594	82 Kb

Helper Functions

The package provides several helper functions for managing corpora:

Function	Description
`list_corpora()`	List all available corpora with metadata
`corpus_info()`	Get detailed info about a specific corpus
`corpus_exists()`	Check if a corpus is available (bundled or downloaded)
`corpus_path()`	Get file path to a downloaded corpus
`delete_corpus()`	Remove a downloaded corpus from disk

# List all available corpora
list_corpora()

# List only bundled corpora
list_corpora(type = "bundled")

# List only downloaded corpora
list_corpora(downloaded_only = TRUE)

# List tweet ID datasets
list_corpora(category = "tweetids")

# Get info about a specific corpus
corpus_info("corpus_beyonce")

# Check if a corpus is available
corpus_exists("corpus_beyonce")     # TRUE (bundled)
corpus_exists("corpus_enron")       # FALSE (not downloaded)

# Get path to a downloaded corpus
corpus_path("corpus_enron")

# Delete a downloaded corpus
delete_corpus("corpus_web_dubois")

File Formats

Downloaded corpora are stored in multiple formats, loaded in this priority order:

.qs2 — Fastest loading (~10x faster than .rda)
.fst — Fast loading (~3x faster than .rda)
.rda — Standard R format (fallback)

The download_corpus() function downloads the best available format from the repository. The load_corpus() function automatically detects and uses the appropriate loader.

There are four related packages hosted on GitLab:

text2map: text analysis functions
text2map.dictionaries: norm dictionaries and word frequency lists
text2map.pretrained: pretrained embeddings and topic models
text2map.theme: changes ggplot2 aesthetics and loads viridis color scheme as default

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")

Contributions and Support

We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues

text2map.corpora