Corpora for Text Analysis

This is an R package with a collection of corpora for text analysis. Some corpora are included when installing the package (see table below). Others must be downloaded first. This allows us to continue adding new corpora without the intial package ballooning!

You can install the package using:

remotes::install_gitlab("culturalcartography/text2map.corpora")

Load corpora installed with the package:

data("corpus_finefoods", package = "text2map.corpora")

Download additional corpora and then load it:

text2map.corpora::download_corpus("corpus_web_dubois")
data("corpus_web_dubois", package = "text2map.corpora")

Available Corpora

The following corpora can be loaded into your R session using data() provided
text2map.corpora is also loaded. Note: tokens and types are measured without preprocessing using the space to mark words.

Corpora Installed with Package
NAME	N_VARS	N_DOCS	TOKENS	TYPES	SIZE
corpus_senti_bench4k	6	4044	113066	26426	1 Mb
corpus_annual_review	7	70	9982	1770	56.2 Kb
corpus_atn_immigr	8	3230	4235162	216471	24.7 Mb
corpus_beyonce	10	83	38240	4465	213.4 Kb
corpus_cmu_blogs100	6	100	46808	11919	299.1 Kb
corpus_envsociology	8	817	126729	16492	1.1 Mb
corpus_europarl_subset	4	10000	261904	26792	2.4 Mb
corpus_finefoods10k	9	9999	827039	55006	6.8 Mb
corpus_isot_fake_news2k	5	2000	833437	67987	5.3 Mb
corpus_ittpr	7	976	455733	38173	3.3 Mb
corpus_presidential	13	2475	4930817	145616	27.8 Mb
corpus_reddit_aita10k	18	10157	3407207	122317	22.9 Mb
corpus_taylor_swift	10	120	44488	5033	263.2 Kb
corpus_tng_season5	5	10834	118671	15661	1.6 Mb
corpus_usnss	2	18	405556	23035	2.6 Mb

The following corpora are currently available to be downloaded. They need only be downloaded once per machine (not per session). Once downloaded they can be loaded using data() so long as text2map.corpora is loaded.

Corpora That Must Be Downloaded
NAME	N VARS	N DOCS	TOKENS	TYPES	SIZE
corpus_senti_bench	6	11557	308830	56492	2.8 Mb
corpus_disaster	3	10860	161285	41853	2.5 Mb
corpus_enron	7	30965	6353609	243605	39.3 Mb
corpus_nytimes_covid	24	982	18974	5968	40.6 Mb
corpus_web_dubois	5	12757	143081	13841	2.3 Mb
corpus_isot_fake_news	5	44244	18196332	396170	99.8 Mb
corpus_dsj_vox	8	22789	25410700	1358106	205.7 Mb
corpus_pitchfork	13	20873	13921384	666134	91.7 Mb
corpus_atn	12	204135	156294551	2507849	943.1 Mb
corpus_atn2	11	2688879	1344232395	3049120	8.6 Gb
corpus_finefoods	9	50000	4119699	140842	29.4 Mb
corpus_reddit_aita	18	32766	11056240	267134	73.9 Mb
corpus_black_mirror	5	18972	113323	22025	2.2 Mb
corpus_scifi_pulp	10	2110	160189078	3857877	740.7 Mb
corpus_moral_stories	9	24000	1469341	33765	15.1 Mb

There are four related packages hosted on GitLab:

text2map: text analysis functions
text2map.dictionaries: norm dictionaries and word frequency lists
text2map.pretrained: pretrained embeddings and topic models
text2map.theme: changes ggplot2 aesthetics and loads viridis color scheme as default

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.dictionaries")
install_gitlab("culturalcartography/text2map.pretrained")
install_gitlab("culturalcartography/text2map.theme")

Contributions and Support

We welcome new corpora. If you have a corpus you would like to be easily available to other researchers, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.corpora/-/issues

text2map.corpora

Corpora for Text Analysis

Available Corpora

Related Packages

Contributions and Support

`text2map.corpora`