1 million English-language SGNS word embeddings trained on Google N-Grams — vecs_sgns300_googlengrams_kte

SGNS embeddings trained Google Books N-Grams from 2000-2012 using 5-grams. The result is a matrix of 928,250 word vectors and 300 dimensions. All n-grams were lowercased "to increase the frequency of rare words."

Format

A matrix of 1 million rows and 300 columns

Source

https://github.com/KnowledgeLab/GeometryofCulture

Details

Kozlowski et al. explains:

For contemporary validation, we train an embedding model on Google Ngrams of publications dating from 2000 through 2012. We use this range of years because Google Ngrams do not include publications more recent than 2012, and this duration is similar to those used in our historical analyses

References

Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). "The geometry of culture: Analyzing the meanings of class through word embeddings." American Sociological Review, 84(5), 905-949.