3 million English-language CBOW word embeddings trained on Google News corpus — vecs_cbow300

CBOW embeddings trained Google News dataset (corpus of about 100 billion words). Note: this model is well-known, but actual model training parameters are not obvious. The paper implies Skip-Gram, however, is likely CBOW with negative sampling. As described in the Google Group conversation, Mikolov states "We have released additional word vectors trained on about 100 billion words from Google News. The training was performed using the continuous bag of words architecture, with sub-sampling using threshold 1e-5, and with negative sampling with 3 negative examples per each positive one. The training time was about 9 hours on multi-core machine, and the resulting vectors have dimensionality 300. Vocabulary size is 3 million, and the entities contain both words and automatically derived phrases."

Format

A matrix of 3 million rows and 300 columns

Source

https://code.google.com/archive/p/word2vec/

Details

Mikolov goes on:

Additional info someone asked for recently - this model was trained using the following command line:

./word2vec -train train100B.txt -read-vocab voc -output vectors.bin -cbow 1 -size 300 -window 5 -negative 3 -hs 0 -sample 1e-5 -threads 12 -binary 1 -min-count 10

Source: https://groups.google.com/g/word2vec-toolkit/c/lxbl_MB29Ic/m/NDLGId3KPNEJ

References

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." In Proceedings of Workshop at ICLR

Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). "The geometry of culture: Analyzing the meanings of class through word embeddings." American Sociological Review, 84(5), 905-949.