Pretrained Models for Text Analysis
This is an R package to load and download a pretrained text analysis models. Some models are quite large and must be separately downloaded first. See also text2map.
Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences
Installation
library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")
library(text2map.pretrained)Usage
A few smaller topic models are included when the package is installed:
| MODEL | N Docs |
|---|---|
| stm_envsoc | 817 |
| stm_fiction_cohort | 1,000 |
These can be loaded directly with data():
data("stm_envsoc")Word embedding models are much larger and must be first downloaded to your machine. Then they can be loaded with load_pretrained().
## ~1 million fastText word vectors
# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")
# load the model each session
wv <- load_pretrained("vecs_fasttext300_wiki_news")
dim(wv)Below are the currently available word embedding models (please suggest others).
| MODEL | Language | N TERMS | N DIMS | METHOD |
|---|---|---|---|---|
| vecs_fasttext300_wiki_news | English | 1,000,000 | 300 | fastText |
| vecs_fasttext300_wiki_news_subword | English | 1,000,000 | 300 | fastText |
| vecs_fasttext300_commoncrawl | English | 2,000,000 | 300 | fastText |
| vecs_glove300_wiki_gigaword | English | 400,000 | 300 | GloVe |
| vecs_cbow300_googlenews | English | 3,000,000 | 300 | CBOW |
| vecs_sgns300_bnc_pos | English | 163,473 | 300 | SGNS |
| vecs_sgns300_googlengrams_kte_en | English | 928,250 | 300 | SGNS |
| MODEL | Language | N TERMS | N DIMS | METHOD | YEARS |
|---|---|---|---|---|---|
| vecs_sgns300_coha_histwords | English | 50,000 | 300 | SGNS | 1810-2000 |
| vecs_sgns300_googlengrams_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_fic_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_fr | French | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_de | German | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_zh | Chinese | 29,701 | 300 | SGNS | 1950-1990 |
| vecs_svd300_googlengrams_histwords | English | 75,682 | 300 | SVD | 1800-1990 |
| vecs_sgns200_british_news | English | 78,879 | 200 | SGNS | 1800-1910 |
File Formats
Models are stored in multiple formats, loaded in this priority order:
-
.qs2— Fastest loading (~10x faster than .rda) -
.fst— Fast loading (~3x faster than .rda) -
.rda— Standard R format (fallback)
The download_pretrained() function downloads the best available format from the repository. The load_pretrained() function automatically detects and uses the appropriate loader.
Helper Functions
The package provides several helper functions for managing models:
| Function | Description |
|---|---|
list_models() |
List all available models with metadata |
model_info() |
Get detailed info about a specific model |
model_exists() |
Check if a model is downloaded |
model_path() |
Get file path to downloaded model |
delete_model() |
Remove a downloaded model |
# List all available models
list_models()
# List only downloaded models
list_models(downloaded_only = TRUE)
# Get info about a specific model
model_info("vecs_sgns300_bnc_pos")
# Check if model is downloaded
model_exists("vecs_sgns300_bnc_pos")
# Get path to downloaded model
model_path("vecs_sgns300_bnc_pos")
# Delete a downloaded model
delete_model("vecs_sgns300_bnc_pos")Related Packages
There are four related packages hosted on GitLab:
-
text2map: text analysis functions -
text2map.corpora: 13+ text datasets -
text2map.dictionaries: norm dictionaries and word frequency lists -
text2map.theme: changesggplot2aesthetics and loads viridis color scheme as default
The above packages can be installed using the following:
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")Contributions and Support
We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues
