50k diachronic English-language SGNS word embeddings over 20 decades
vecs_sgns300_coha_histwords.Rd
50 thousand SGNS embeddings from the HistWords project trained on the Corpus of Historical American English divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1810-2000. Each matrix is 50 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.
References
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489--1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
Examples
if (FALSE) {
## download the model (once per machine)
download_pretrained("vecs_sgns300_coha_histwords")
## load the model each session
data("vecs_sgns300_coha_histwords")
## check dims
length(vecs_sgns300_coha_histwords) == 20L
dim(vecs_sgns300_coha_histwords[[1]]) == c(50000, 300)
}