Pretrained Models for Text Analysis
This is an R package to load and download pretrained text analysis models. Some models are bundled with the package; others are larger and must be separately downloaded. See also text2map.
Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences
Installation
library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")
library(text2map.pretrained)Usage
Bundled Models
A few smaller topic models are included when the package is installed. Use load_pretrained() to load them:
stm <- load_pretrained("stm_envsoc")
stm <- load_pretrained("stm_fiction_cohort")Downloaded Models
Word embedding models are larger and must be downloaded once per machine, then loaded with load_pretrained().
# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")
# load the model each session
wv <- load_pretrained("vecs_fasttext300_wiki_news")
dim(wv)Available Models
Structural Topic Models (Bundled)
| MODEL | N Docs | Description |
|---|---|---|
| stm_envsoc | 817 | Environmental sociology abstracts |
| stm_fiction_cohort | 1,000 | Fiction-author cohort study |
Static Word Embedding Models (Download)
| MODEL | Language | N Terms | Dims | Method |
|---|---|---|---|---|
| vecs_fasttext300_wiki_news | English | 1,000,000 | 300 | fastText |
| vecs_fasttext300_wiki_news_subword | English | 1,000,000 | 300 | fastText |
| vecs_fasttext300_commoncrawl | English | 2,000,000 | 300 | fastText |
| vecs_glove300_wiki_gigaword | English | 400,000 | 300 | GloVe |
| vecs_cbow300_googlenews | English | 3,000,000 | 300 | CBOW |
| vecs_sgns300_bnc_pos | English | 163,473 | 300 | SGNS |
| vecs_sgns300_googlengrams_kte_en | English | 928,250 | 300 | SGNS |
| vecs_glove300_metal_lyrics | English | 52,885 | 300 | GloVe |
| vecs_svd20_metal_type | English | 54,187 | 20 | SVD |
| vecs_svd20_metal_position | English | 74 | 20 | SVD |
Diachronic (Temporal) Word Embedding Models (Download)
| MODEL | Language | N Terms | Dims | Method | Years |
|---|---|---|---|---|---|
| vecs_sgns300_coha_histwords | English | 50,000 | 300 | SGNS | 1810-2000 |
| vecs_sgns300_googlengrams_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_fic_histwords | English | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_fr | French | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_de | German | 100,000 | 300 | SGNS | 1800-1990 |
| vecs_sgns300_googlengrams_histwords_zh | Chinese | 29,701 | 300 | SGNS | 1950-1990 |
| vecs_svd300_googlengrams_histwords | English | 75,682 | 300 | SVD | 1800-1990 |
| vecs_sgns200_british_news | English | 78,879 | 200 | SGNS | 1800-1910 |
File Formats
Models are stored in multiple formats, loaded in this priority order:
-
.qs2— Fastest loading (~10x faster than .rda) -
.fst— Fast loading (~3x faster than .rda) -
.rda— Standard R format (fallback)
The download_pretrained() function downloads the best available format from the repository. The load_pretrained() function handles both bundled models (via data()) and downloaded models (auto-detecting format).
Helper Functions
The package provides several helper functions for managing models:
| Function | Description |
|---|---|
list_models() |
List all available models with metadata |
model_info() |
Get detailed info about a specific model |
model_exists() |
Check if a model is available (bundled or downloaded) |
model_path() |
Get file path to downloaded model (NA for bundled) |
download_pretrained() |
Download an on-demand model |
delete_model() |
Remove a downloaded model |
# List all available models
list_models()
# List only available (bundled + downloaded) models
list_models(downloaded_only = TRUE)
# Get info about a specific model
model_info("stm_fiction_cohort")
# Check if model is available
model_exists("vecs_sgns300_bnc_pos")
# Get path to downloaded model
model_path("vecs_sgns300_bnc_pos")
# Delete a downloaded model (cannot delete bundled models)
delete_model("vecs_sgns300_bnc_pos")Related Packages
There are several related packages hosted on GitLab:
-
text2map: text analysis functions -
text2map.corpora: text datasets -
text2map.dictionaries: norm dictionaries and word frequency lists -
text2map.theme: ggplot2 themes and color palettes
install.packages("text2map")
library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")Contributions and Support
We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R, send us an email (maintainers [at] textmapping.com) or submit pull requests.
Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues
