Skip to contents

A dataset of English bigram (two-word sequence) frequencies from the Google Web Trillion Word Corpus, as compiled by Peter Norvig. Bigrams are ordered by frequency (most common first) and include a rank column. This dictionary is essential for contextual disambiguation in text normalization (e.g., long-S correction), collocation analysis, and language modeling.

Format

A data frame with 286,357 rows and 5 variables.

Source

https://norvig.com/ngrams/

Variables

  • w1. first word of the bigram

  • w2. second word of the bigram

  • freq. frequency count in the Google Web Trillion Word Corpus

  • rank. rank by frequency (1 = most common)

  • source. data source attribution

References

Norvig, P. (2009). "Natural language corpus data." Beautiful Data, pp. 219-242. https://norvig.com/ngrams/

Brants, T. and Franz, A. (2006). Web 1T 5-gram Version 1. Linguistic Data Consortium.