Skip to contents

Pretrained Models for Text Analysis

This is an R package to load and download pretrained text analysis models. Some models are bundled with the package; others are larger and must be separately downloaded. See also text2map.

Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences

Installation

library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")

library(text2map.pretrained)

Usage

Bundled Models

A few smaller topic models are included when the package is installed. Use load_pretrained() to load them:

stm <- load_pretrained("stm_envsoc")
stm <- load_pretrained("stm_fiction_cohort")

Downloaded Models

Word embedding models are larger and must be downloaded once per machine, then loaded with load_pretrained().

# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")

# load the model each session
wv <- load_pretrained("vecs_fasttext300_wiki_news")
dim(wv)

Available Models

Structural Topic Models (Bundled)

MODEL N Docs Description
stm_envsoc 817 Environmental sociology abstracts
stm_fiction_cohort 1,000 Fiction-author cohort study

Static Word Embedding Models (Download)

MODEL Language N Terms Dims Method
vecs_fasttext300_wiki_news English 1,000,000 300 fastText
vecs_fasttext300_wiki_news_subword English 1,000,000 300 fastText
vecs_fasttext300_commoncrawl English 2,000,000 300 fastText
vecs_glove300_wiki_gigaword English 400,000 300 GloVe
vecs_cbow300_googlenews English 3,000,000 300 CBOW
vecs_sgns300_bnc_pos English 163,473 300 SGNS
vecs_sgns300_googlengrams_kte_en English 928,250 300 SGNS
vecs_glove300_metal_lyrics English 52,885 300 GloVe
vecs_svd20_metal_type English 54,187 20 SVD
vecs_svd20_metal_position English 74 20 SVD

Diachronic (Temporal) Word Embedding Models (Download)

MODEL Language N Terms Dims Method Years
vecs_sgns300_coha_histwords English 50,000 300 SGNS 1810-2000
vecs_sgns300_googlengrams_histwords English 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_fic_histwords English 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_histwords_fr French 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_histwords_de German 100,000 300 SGNS 1800-1990
vecs_sgns300_googlengrams_histwords_zh Chinese 29,701 300 SGNS 1950-1990
vecs_svd300_googlengrams_histwords English 75,682 300 SVD 1800-1990
vecs_sgns200_british_news English 78,879 200 SGNS 1800-1910

File Formats

Models are stored in multiple formats, loaded in this priority order:

  1. .qs2 — Fastest loading (~10x faster than .rda)
  2. .fst — Fast loading (~3x faster than .rda)
  3. .rda — Standard R format (fallback)

The download_pretrained() function downloads the best available format from the repository. The load_pretrained() function handles both bundled models (via data()) and downloaded models (auto-detecting format).

Helper Functions

The package provides several helper functions for managing models:

Function Description
list_models() List all available models with metadata
model_info() Get detailed info about a specific model
model_exists() Check if a model is available (bundled or downloaded)
model_path() Get file path to downloaded model (NA for bundled)
download_pretrained() Download an on-demand model
delete_model() Remove a downloaded model
# List all available models
list_models()

# List only available (bundled + downloaded) models
list_models(downloaded_only = TRUE)

# Get info about a specific model
model_info("stm_fiction_cohort")

# Check if model is available
model_exists("vecs_sgns300_bnc_pos")

# Get path to downloaded model
model_path("vecs_sgns300_bnc_pos")

# Delete a downloaded model (cannot delete bundled models)
delete_model("vecs_sgns300_bnc_pos")

There are several related packages hosted on GitLab:

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")

Contributions and Support

We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues