A dataset of English bigram (two-word sequence) frequencies from the Google Web Trillion Word Corpus, as compiled by Peter Norvig. Bigrams are ordered by frequency (most common first) and include a rank column. This dictionary is essential for contextual disambiguation in text normalization (e.g., long-S correction), collocation analysis, and language modeling.
Variables
w1. first word of the bigram
w2. second word of the bigram
freq. frequency count in the Google Web Trillion Word Corpus
rank. rank by frequency (1 = most common)
source. data source attribution
References
Norvig, P. (2009). "Natural language corpus data." Beautiful Data, pp. 219-242. https://norvig.com/ngrams/
Brants, T. and Franz, A. (2006). Web 1T 5-gram Version 1. Linguistic Data Consortium.
