100k diachronic English-language SGNS word embeddings, 20 decades, Google Books corpus
vecs_sgns300_googlengrams_histwords.Rd
100 thousand SGNS embeddings from the HistWords project, trained on the Google Books N-Gram corpus (all English) divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1800-1990. Each matrix is 100 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.
References
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489--1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
Examples
if (FALSE) {
## download the model (once per machine)
download_pretrained("vecs_sgns300_googlengrams_histwords")
## load the model each session
data("vecs_sgns300_googlengrams_histwords")
## check dims
length(vecs_sgns300_googlengrams_histwords) == 20L
dim(vecs_sgns300_googlengrams_histwords[[1]]) == c(100000, 300)
}