Skip to contents

100 thousand SGNS embeddings from the HistWords project, trained on the Google Books N-Gram corpus (all English) divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1800-1990. Each matrix is 100 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.

Format

A list of 20 matrices

Source

https://nlp.stanford.edu/projects/histwords/

References

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489--1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Examples


if (FALSE) {


## download the model (once per machine)
download_pretrained("vecs_sgns300_googlengrams_histwords")

## load the model each session
data("vecs_sgns300_googlengrams_histwords")

## check dims
length(vecs_sgns300_googlengrams_histwords) == 20L
dim(vecs_sgns300_googlengrams_histwords[[1]]) == c(100000, 300)

}