Pretrained Models for Text Analysis
This is an R package to load and download a pretrained text analysis models. Some models are quite large and must be separately downloaded first. See also text2map
.
Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences
Installation
library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")
library(text2map.pretrained)
Usage
A few smaller topic models are included when the package is installed:
MODEL | N Docs |
---|---|
stm_envsoc | 817 |
stm_fiction_cohort | 1,000 |
These can be loaded directly with data()
:
data("stm_envsoc")
Word embedding models are much larger and must be first downloaded to your machine. Then they can be loaded with data()
. The names are informative, but also long! So, it can be useful to assign it to a new object and then remove the original
## ~1 million fastText word vectors
# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")
# load the model each session
data("vecs_fasttext300_wiki_news")
dim(vecs_fasttext300_wiki_news)
# assign to new (shorter) object
wv <- vecs_fasttext300_wiki_news
# then remove the original
rm(vecs_fasttext300_wiki_news)
Below are the currently available word embedding models (please suggest others).
MODEL | Language | N TERMS | N DIMS | METHOD |
---|---|---|---|---|
vecs_fasttext300_wiki_news | English | 1,000,000 | 300 | fastText |
vecs_fasttext300_wiki_news_subword | English | 1,000,000 | 300 | fastText |
vecs_fasttext300_commoncrawl | English | 2,000,000 | 300 | fastText |
vecs_glove300_wiki_gigaword | English | 400,000 | 300 | GloVe |
vecs_cbow300_googlenews | English | 3,000,000 | 300 | CBOW |
vecs_sgns300_bnc_pos | English | 163,473 | 300 | SGNS |
vecs_sgns300_googlengrams_kte_en | English | 928,250 | 300 | SGNS |
MODEL | Language | N TERMS | N DIMS | METHOD | YEARS |
---|---|---|---|---|---|
vecs_sgns300_coha_histwords | English | 50,000 | 300 | SGNS | 1810-2000 |
vecs_sgns300_googlengrams_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
vecs_sgns300_googlengrams_fic_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
vecs_sgns300_googlengrams_histwords_fr | French | 100,000 | 300 | SGNS | 1800-1990 |
vecs_sgns300_googlengrams_histwords_de | German | 100,000 | 300 | SGNS | 1800-1990 |
vecs_sgns300_googlengrams_histwords_zh | Chinese | 29,701 | 300 | SGNS | 1950-1990 |
vecs_svd300_googlengrams_histwords | English | 75,682 | 300 | SVD | 1800-1990 |
vecs_sgns200_british_news | English | 78,879 | 200 | SGNS | 1800-1910 |
Related Packages
There are four related packages hosted on GitLab:
-
text2map
: text analysis functions -
text2map.corpora
: 13+ text datasets -
text2map.dictionaries
: norm dictionaries and word frequency lists -
text2map.theme
: changesggplot2
aesthetics and loads viridis color scheme as default
The above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")
Contributions and Support
We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R
, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues