75k diachronic English-language SVD word embeddings, 20 decades, Google Books corpus — vecs_svd300_googlengrams

75 thousand SVD embeddings from the HistWords project, trained on the Google Books N-Gram corpus (all English) divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1800-1990. Each matrix is 75 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.

Format

A list of 20 matrices

Source

https://nlp.stanford.edu/projects/histwords/

References

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489–1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Examples


if (FALSE) { # \dontrun{


## download the model (once per machine)
download_pretrained("vecs_svd300_googlengrams_histwords")

## load the model each session
wv <- load_pretrained("vecs_svd300_googlengrams_histwords")

## check dims
length(wv) == 20L
dim(wv[[1]]) == c(75682, 300)

} # }