100k diachronic English-language SGNS word embeddings, 20 decades, Google Books Fiction corpus — vecs_sgns300_googlengrams_fic

100 thousand SGNS embeddings from the HistWords project, trained on the Google Books N-Gram Fiction corpus (all English) divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1800-1990. Each matrix is 100 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.

Format

A list of 20 matrices

Source

https://nlp.stanford.edu/projects/histwords/

References

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489–1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Examples


if (FALSE) { # \dontrun{


## download the model (once per machine)
download_pretrained("vecs_sgns300_googlengrams_fic_histwords")

## load the model each session
wv <- load_pretrained("vecs_sgns300_googlengrams_fic_histwords")

## check dims
length(wv) == 20L
dim(wv[[1]]) == c(100000, 300)

} # }