Skip to contents

A dataset containing word frequencies and ranks in news from the Leipzig Corpora collection, Wikipedia, and subtitles from Opensubtitles.

Usage

german_freqs

Format

A data frame with 743,574 rows and 10 variables.

Variables

Variables:

  • term. unique word (only unigrams)

  • freq_wiki. term count in the 2021 German Wikipedia corpus

  • wiki. logical, is the word in the 2021 German Wikipedia corpus

  • rank_wiki. term rank in the 2021 German Wikipedia corpus

  • freq_news. term count in the 2024 Leipzig news corpus

  • news. logical, is the word in the 2024 Leipzig news corpus

  • rank_news. term rank in the 2024 Leipzig news corpus

  • freq_subs. term count in the 2018 OpenSubtitles Corpus

  • subtitles. logical, is the word in the 2018 OpenSubtitles Corpus

  • rank_subs. term rank in the 2018 OpenSubtitles Corpus

References

Hermit Dave (2020). FrequencyWords https://github.com/hermitdave/FrequencyWords

D. Goldhahn, T. Eckart & U. Quasthoff Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012