A dataset containing word frequencies and ranks in four corpora, two older and two more recent:
Details
- Kucera and Francis/Brown Corpus (80,271 unique terms) - British National Corpus (7,726 unique terms) - Norvig Google Trillion Word Corpus (333,932 unique terms) - 2019 English Wikipedia (84,163 unique terms)
Note that some terms have repeating rows because they are counted by their unique part-of-speech in the British National Corpus.
Variables
Variables:
term. unique word (only unigrams)
freq_wiki. term count in the 2019 English Wikipedia corpus
wiki. logical, is the word in the 2019 English Wikipedia corpus
rank_wiki. term rank in the 2019 English Wikipedia corpus
freq_kf. term count in the Kučera and Francis/Brown Corpus
kf. logical, is the word in the Kučera and Francis/Brown Corpus
rank_kf. term rank in the Kučera and Francis/Brown Corpus
freq_bnc. term count in the British National Corpus
pos_bnc. part-of-speech tagging for terms in the British National Corpus
bnc. logical, is the word in the British National Corpus
rank_bnc. term rank in the British National Corpus
freq_google. term count in the Norvig Google Trillion Word Corpus
google. logical, is the word in the Norvig Google Trillion Word Corpus
rank_google. term rank in the Norvig Google Trillion Word Corpus
References
Kucera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Brown university press
Burnard, L. (Ed.). (1995). Users reference guide for the British National Corpus. Oxford University Computing Services
Norvig, P. (2009). "Natural language corpus data." Beautiful Data. Pp. 219-242. https://norvig.com/ngrams/