SUBTLEXus Word Frequency/Ranks for English — subtlexus

A dataset containing word frequencies and ranks from the SUBTLEXus corpus which comprises American English subtitles downloaded from www.opensubtitles.org. This included 8,388 films and television episodes, including 16.1 million episodes and 14.3 million films from before 1990, and 20.6 million films after 1990. The corpus has a total of 51 million words.

Format

A data frame with 74,286 rows and 9 variables.

Source

https://doi.org/10.3758/BRM.41.4.977

Details

Note that this frequency list is somewhat case sensitive. The `term` variable will start with a capital letter if more often starts with an uppercase letter than with a lowercase letter in the corpus. The `freq_low` variable gives the frequency the term appears with starting with a lowercased letter.

Variables

term. unique word (only unigrams)
freq. term count in the corpus
cd_count. number of films/episodes in which the term appears (max 8,388)
freq_low. number of times the word appears in the corpus starting with a lowercase letter
cd_low. number of films in which the term appears starting with a lowercase letter.
subtl_wf. word frequency per million words
lg10_wf. log10(freq + 1)
subtl_cd. percent of the films/episodes the term appears
lg10_cd. log10(cd_count + 1)

References

Brysbaert, M., and New, B. (2009). "Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods. 41(4):977-990.