Pretrained Models for Text Analysis

This is an R package to load and download a pretrained text analysis models. Some models are quite large and must be separately downloaded first. See also text2map.

Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences

Installation

library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")

library(text2map.pretrained)

Usage

A few smaller topic models are included when the package is installed:

Structural Topic Models
MODEL N Docs
stm_envsoc 817
stm_fiction_cohort 1,000

These can be loaded directly with data():


data("stm_envsoc")

Word embedding models are much larger and must be first downloaded to your machine. Then they can be loaded with data(). The names are informative, but also long! So, it can be useful to assign it to a new object and then remove the original


## ~1 million fastText word vectors
# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")

# load the model each session
data("vecs_fasttext300_wiki_news")
dim(vecs_fasttext300_wiki_news)

# assign to new (shorter) object 
wv <- vecs_fasttext300_wiki_news
# then remove the original
rm(vecs_fasttext300_wiki_news)

Below are the currently available word embedding models (please suggest others).

Word Embedding Models
MODEL Language N TERMS N DIMS METHOD
vecs_fasttext300_wiki_news English 1,000,000 300 fastText
vecs_fasttext300_wiki_news_subword English 1,000,000 300 fastText
vecs_fasttext300_commoncrawl English 2,000,000 300 fastText
vecs_glove300_wiki_gigaword English 400,000 300 GloVe
vecs_cbow300_googlenews English 3,000,000 300 CBOW
vecs_sgns300_bnc_pos English 163,473 300 SGNS
vecs_sgns300_googlengrams_kte_en English 928,250 300 SGNS
Diachronic (Temporal) Word Embedding Models
MODEL Language N TERMS N DIMS METHOD YEARS
vecs_sgns300_coha_histwords English 50,000 300 SGNS 1810-2000
vecs_sgns300_googlengrams_histwords English 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_fic_histwords English 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_histwords_fr French 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_histwords_de German 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_histwords_zh Chinese 29,701 300 SGNS 1950-1990
vecs_svd300_googlengrams_histwords English 75,682 300 SVD 1800-1990
vecs_sgns200_british_news English 78,879 200 SGNS 1800-1910

There are four related packages hosted on GitLab:

The above packages can be installed using the following:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")

Contributions and Support

We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues