vecs_sgns300_googlengrams_fic_histwords.Rd
100 thousand SGNS embeddings from the HistWords project, trained on the Google Books N-Gram Fiction corpus (all English) divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1800-1990. Each matrix is 100 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.
A list of 20 matrices
https://nlp.stanford.edu/projects/histwords/
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489--1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
if (FALSE) {
## download the model (once per machine)
download_pretrained("vecs_sgns300_googlengrams_fic_histwords")
## load the model each session
data("vecs_sgns300_googlengrams_fic_histwords")
## check dims
length(vecs_sgns300_googlengrams_fic_histwords) == 20L
dim(vecs_sgns300_googlengrams_fic_histwords[[1]]) == c(100000, 300)
}