A dataset containing word frequencies and ranks in four corpora, two older and two more recent:

english_freqs

Format

A data frame with 371,939 rows and 14 variables.

Details

- Kucera and Francis/Brown Corpus (80,271 unique terms) - British National Corpus (7,726 unique terms) - Norvig Google Trillion Word Corpus (333,932 unique terms) - 2019 English Wikipedia (84,163 unique terms)

Note that some terms have repeating rows because they are counted by their unique part-of-speech in the British National Corpus.

Variables

Variables:

  • term. unique word (only unigrams)

  • freq_wiki. term count in the 2019 English Wikipedia corpus

  • wiki. logical, is the word in the 2019 English Wikipedia corpus

  • rank_wiki. term rank in the 2019 English Wikipedia corpus

  • freq_kf. term count in the Kučera and Francis/Brown Corpus

  • kf. logical, is the word in the Kučera and Francis/Brown Corpus

  • rank_kf. term rank in the Kučera and Francis/Brown Corpus

  • freq_bnc. term count in the British National Corpus

  • pos_bnc. part-of-speech tagging for terms in the British National Corpus

  • bnc. logical, is the word in the British National Corpus

  • rank_bnc. term rank in the British National Corpus

  • freq_google. term count in the Norvig Google Trillion Word Corpus

  • google. logical, is the word in the Norvig Google Trillion Word Corpus

  • rank_google. term rank in the Norvig Google Trillion Word Corpus

References

Kucera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Brown university press

Burnard, L. (Ed.). (1995). Users reference guide for the British National Corpus. Oxford University Computing Services

Norvig, P. (2009). "Natural language corpus data." Beautiful Data. Pp. 219-242. https://norvig.com/ngrams/