50k diachronic English-language SGNS word embeddings over 20 decades — vecs_sgns300_coha

50 thousand SGNS embeddings from the HistWords project trained on the Corpus of Historical American English divided into decades. This is a list of 20 elements, in which every element is an embedding matrix associated with a given decade, 1810-2000. Each matrix is 50 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.

Format

A list of 20 matrices

Source

https://nlp.stanford.edu/projects/histwords/

References

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489--1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Examples


if (FALSE) {

## download the model (once per machine)
download_pretrained("vecs_sgns300_coha_histwords")

## load the model each session
data("vecs_sgns300_coha_histwords")

## check dims
length(vecs_sgns300_coha_histwords) == 20L
dim(vecs_sgns300_coha_histwords[[1]]) == c(50000, 300)

}