Various Word Frequency/Rank Lists for English — english_freqs • text2map.dictionaries

A dataset containing word frequencies and ranks in four corpora, two older and two more recent:

Kucera and Francis/Brown Corpus (80,271 unique terms)
British National Corpus (7,726 unique terms)
Norvig Google Trillion Word Corpus (333,932 unique terms)
2019 English Wikipedia (84,163 unique terms)

Format

A data frame with 371938 rows and 14 variables.

Source

https://norvig.com/ngrams/, https://www.natcorp.ox.ac.uk/, https://doi.org/10.3758/s13428-013-0403-5

Details

Note that some terms have repeating rows because they are counted by their unique part-of-speech in the British National Corpus.

Variables

term. unique word (only unigrams)
freq_wiki. term count in the 2019 English Wikipedia corpus
wiki. logical, is the word in the 2019 English Wikipedia corpus
rank_wiki. term rank in the 2019 English Wikipedia corpus
freq_kf. term count in the Kučera and Francis/Brown Corpus
kf. logical, is the word in the Kučera and Francis/Brown Corpus
rank_kf. term rank in the Kučera and Francis/Brown Corpus
freq_bnc. term count in the British National Corpus
pos_bnc. part-of-speech tagging for terms in the British National Corpus
bnc. logical, is the word in the British National Corpus
rank_bnc. term rank in the British National Corpus
freq_google. term count in the Norvig Google Trillion Word Corpus
google. logical, is the word in the Norvig Google Trillion Word Corpus
rank_google. term rank in the Norvig Google Trillion Word Corpus

References

Kucera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Brown university press

Burnard, L. (Ed.). (1995). Users reference guide for the British National Corpus. Oxford University Computing Services

Norvig, P. (2009). "Natural language corpus data." Beautiful Data. Pp. 219-242. https://norvig.com/ngrams/