Pretrained Models for Text Analysis
This is an R package to load and download a pretrained text analysis models. Some models are quite large and must be separately downloaded first. See also text2map.
Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences
Installation
library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")
library(text2map.pretrained)Usage
A few smaller topic models are included when the package is installed:
| MODEL | N Docs |
|---|---|
| stm_envsoc | 817 |
| stm_fiction_cohort | 1,000 |
These can be loaded directly with data():
data("stm_envsoc")Word embedding models are much larger and must be first downloaded to your machine. Then they can be loaded with data(). The names are informative, but also long! So, it can be useful to assign it to a new object and then remove the original
## ~1 million fastText word vectors
# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")
# load the model each session
data("vecs_fasttext300_wiki_news")
dim(vecs_fasttext300_wiki_news)
# assign to new (shorter) object
wv <- vecs_fasttext300_wiki_news
# then remove the original
rm(vecs_fasttext300_wiki_news)Below are the currently available word embedding models (please suggest others).
| MODEL | Language | N TERMS | N DIMS | METHOD |
|---|---|---|---|---|
| vecs_fasttext300_wiki_news | English | 1,000,000 | 300 | fastText |
| vecs_fasttext300_wiki_news_subword | English | 1,000,000 | 300 | fastText |
| vecs_fasttext300_commoncrawl | English | 2,000,000 | 300 | fastText |
| vecs_glove300_wiki_gigaword | English | 400,000 | 300 | GloVe |
| vecs_cbow300_googlenews | English | 3,000,000 | 300 | CBOW |
| vecs_sgns300_bnc_pos | English | 163,473 | 300 | SGNS |
| vecs_sgns300_googlengrams_kte_en | English | 928,250 | 300 | SGNS |
| MODEL | Language | N TERMS | N DIMS | METHOD | YEARS |
|---|---|---|---|---|---|
| vecs_sgns300_coha_histwords | English | 50,000 | 300 | SGNS | 1810-2000 |
| vecs_sgns300_googlengrams_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_fic_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_fr | French | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_de | German | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_zh | Chinese | 29,701 | 300 | SGNS | 1950-1990 |
| vecs_svd300_googlengrams_histwords | English | 75,682 | 300 | SVD | 1800-1990 |
| vecs_sgns200_british_news | English | 78,879 | 200 | SGNS | 1800-1910 |
Related Packages
There are four related packages hosted on GitLab:
-
text2map: text analysis functions -
text2map.corpora: 13+ text datasets -
text2map.dictionaries: norm dictionaries and word frequency lists -
text2map.theme: changesggplot2aesthetics and loads viridis color scheme as default
The above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")Contributions and Support
We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues
