A dataset containing word frequencies and ranks from the SUBTLEXus corpus which comprises American English subtitles downloaded from www.opensubtitles.org. This included 8,388 films and television episodes, including 16.1 million episodes and 14.3 million films from before 1990, and 20.6 million films after 1990. The corpus has a total of 51 million words.
Details
Note that this frequency list is somewhat case sensitive. The `term` variable will start with a capital letter if more often starts with an uppercase letter than with a lowercase letter in the corpus. The `FREQlow` variable gives the frequency the term appears with starting with a lowercased letter.
Variables
Variables:
term. unique word (only unigrams)
freq. term count in the corpus
CDcount. number of films/episodes in which the term appears (max 8,388)
FREQlow. number of times the word appears in the corpus starting with a lowercase letter
CDlow. number of films in which the term appears starting with a lowercase letter.
SUBTLwf. word frequency per million words
Lg10WF. log10(freq + 1)
SUBTLcd. percent of the films/episodes the term appears
Lg10CD. log10(CDcount + 1)