Skip to contents

30 thousand SGNS embeddings from the HistWords project, trained on the Google Books N-Gram corpus (all Chinese) divided into decades. This is a list of 5 elements, in which every element is an embedding matrix associated with a given decade, 1950-1990. Each matrix is30 thousand vectors (rows) and 300 dimensions (columns). Note that each embedding has the same vocabulary, but when words do not appear in a given decade they appear as rows with only zero values.

Format

A list of 5 matrices

Source

https://nlp.stanford.edu/projects/histwords/

References

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. "Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change." Pp. 1489--1501 in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Examples


if (FALSE) {


## download the model (once per machine)
download_pretrained("vecs_sgns300_googlengrams_histwords_zh")

## load the model each session
data("vecs_sgns300_googlengrams_histwords_zh")

## check dims
length(vecs_sgns300_googlengrams_histwords_zh) == 5L
dim(vecs_sgns300_googlengrams_histwords_zh[[1]]) == c(29701, 300)

}