Skip to contents

A dataset containing word frequencies and ranks from the SUBTLEXus corpus which comprises American English subtitles downloaded from www.opensubtitles.org. This included 8,388 films and television episodes, including 16.1 million episodes and 14.3 million films from before 1990, and 20.6 million films after 1990. The corpus has a total of 51 million words.

Usage

subtlexus_freqs

Format

A data frame with 74,286 rows and 9 variables.

Details

Note that this frequency list is somewhat case sensitive. The `term` variable will start with a capital letter if more often starts with an uppercase letter than with a lowercase letter in the corpus. The `FREQlow` variable gives the frequency the term appears with starting with a lowercased letter.

Variables

Variables:

  • term. unique word (only unigrams)

  • freq. term count in the corpus

  • CDcount. number of films/episodes in which the term appears (max 8,388)

  • FREQlow. number of times the word appears in the corpus starting with a lowercase letter

  • CDlow. number of films in which the term appears starting with a lowercase letter.

  • SUBTLwf. word frequency per million words

  • Lg10WF. log10(freq + 1)

  • SUBTLcd. percent of the films/episodes the term appears

  • Lg10CD. log10(CDcount + 1)

References

Brysbaert, M., and New, B. (2009). "Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods. 41(4):977-990.