Pretrained Models for Text Analysis

This is an R package to load and download pretrained text analysis models. Some models are bundled with the package; others are larger and must be separately downloaded. See also text2map.

Please check out our book Mapping Texts: Computational Text Analysis for the Social Sciences

Installation

library(remotes)
install_gitlab("culturalcartography/text2map.pretrained")

library(text2map.pretrained)

Usage

Bundled Models

A few smaller topic models are included when the package is installed. Use load_pretrained() to load them:

stm <- load_pretrained("stm_envsoc")
stm <- load_pretrained("stm_fiction_cohort")

Downloaded Models

Word embedding models are larger and must be downloaded once per machine, then loaded with load_pretrained().

# download the model once per machine
download_pretrained("vecs_fasttext300_wiki_news")

# load the model each session
wv <- load_pretrained("vecs_fasttext300_wiki_news")
dim(wv)

Available Models

Structural Topic Models (Bundled)

MODEL	N Docs	Description
stm_envsoc	817	Environmental sociology abstracts
stm_fiction_cohort	1,000	Fiction-author cohort study

Static Word Embedding Models (Download)

MODEL	Language	N Terms	Dims	Method
vecs_fasttext300_wiki_news	English	1,000,000	300	fastText
vecs_fasttext300_wiki_news_subword	English	1,000,000	300	fastText
vecs_fasttext300_commoncrawl	English	2,000,000	300	fastText
vecs_glove300_wiki_gigaword	English	400,000	300	GloVe
vecs_cbow300_googlenews	English	3,000,000	300	CBOW
vecs_sgns300_bnc_pos	English	163,473	300	SGNS
vecs_sgns300_googlengrams_kte_en	English	928,250	300	SGNS
vecs_glove300_metal_lyrics	English	52,885	300	GloVe
vecs_svd20_metal_type	English	54,187	20	SVD
vecs_svd20_metal_position	English	74	20	SVD

Diachronic (Temporal) Word Embedding Models (Download)

MODEL	Language	N Terms	Dims	Method	Years
vecs_sgns300_coha_histwords	English	50,000	300	SGNS	1810-2000
vecs_sgns300_googlengrams_histwords	English	100,000	300	SGNS	1800-1990
vecs_sgns300_googlengrams_fic_histwords	English	100,000	300	SGNS	1800-1990
vecs_sgns300_googlengrams_histwords_fr	French	100,000	300	SGNS	1800-1990
vecs_sgns300_googlengrams_histwords_de	German	100,000	300	SGNS	1800-1990
vecs_sgns300_googlengrams_histwords_zh	Chinese	29,701	300	SGNS	1950-1990
vecs_svd300_googlengrams_histwords	English	75,682	300	SVD	1800-1990
vecs_sgns200_british_news	English	78,879	200	SGNS	1800-1910

File Formats

Models are stored in multiple formats, loaded in this priority order:

.qs2 — Fastest loading (~10x faster than .rda)
.fst — Fast loading (~3x faster than .rda)
.rda — Standard R format (fallback)

The download_pretrained() function downloads the best available format from the repository. The load_pretrained() function handles both bundled models (via data()) and downloaded models (auto-detecting format).

Helper Functions

The package provides several helper functions for managing models:

Function	Description
`list_models()`	List all available models with metadata
`model_info()`	Get detailed info about a specific model
`model_exists()`	Check if a model is available (bundled or downloaded)
`model_path()`	Get file path to downloaded model (NA for bundled)
`download_pretrained()`	Download an on-demand model
`delete_model()`	Remove a downloaded model

# List all available models
list_models()

# List only available (bundled + downloaded) models
list_models(downloaded_only = TRUE)

# Get info about a specific model
model_info("stm_fiction_cohort")

# Check if model is available
model_exists("vecs_sgns300_bnc_pos")

# Get path to downloaded model
model_path("vecs_sgns300_bnc_pos")

# Delete a downloaded model (cannot delete bundled models)
delete_model("vecs_sgns300_bnc_pos")

There are several related packages hosted on GitLab:

text2map: text analysis functions
text2map.corpora: text datasets
text2map.dictionaries: norm dictionaries and word frequency lists
text2map.theme: ggplot2 themes and color palettes

install.packages("text2map")

library(remotes)
install_gitlab("culturalcartography/text2map.theme")
install_gitlab("culturalcartography/text2map.corpora")
install_gitlab("culturalcartography/text2map.dictionaries")

Contributions and Support

We welcome new models. If you have an embedding model or topic model you would like to be easily available to other researchers in R, send us an email (maintainers [at] textmapping.com) or submit pull requests.

Please report any issues or bugs here: https://gitlab.com/culturalcartography/text2map.pretrained/-/issues

text2map.pretrained