Skip to contents

Log-frequency values for 811 English words, used as Bayesian priors in long-S/f disambiguation. Values are log10-transformed raw frequency counts from a large corpus. Higher values indicate more common words.

Format

A data frame with 811 rows and 3 variables.

Source

Ted Underwood / DataMunging, derived from corpus frequency analysis (CC-BY)

Variables

  • word. English word

  • log_freq. log10-transformed frequency count

  • source. data source attribution