Provides access to 8 precompiled stoplists, including the most commonly used stoplist from the Snowball stemming package ("snowball2014"), text2map's tiny stoplist ("tiny2020"), a few historically important stop lists. This aims to be a transparent and well-document collection of stoplists. Only includes English language stoplists at the moment.

get_stoplist(source = "tiny2020", language = "en", tidy = FALSE)

Arguments

source

Character indicating source, default = "tiny2020"

language

Character (default = "en") indicating language of stopwords by ISO 639-1 code, currently only English is supported.

tidy

logical (default = FALSE), returns a tibble

Value

Character vector of words to be stopped, if tidy = TRUE, a tibble is returned

Details

There is no such thing as a stopword! But, there are tons of precompiled lists of words that someone thinks we should remove from our texts. (See for example: https://github.com/igorbrigadir/stopwords) One of the first stoplists is from C.J. van Rijsbergen's "Information retrieval: theory and practice" (1979) and includes 250 words. text2map's very own stoplist tiny2020 is a lean 34 words.

Below are stoplists available with get_stoplist:

  • "tiny2020": Tiny (2020) list of 33 words (Default)

  • "snowball2001": Snowball stemming package's (2001) list of 127 words

  • "snowball2014": Updated Snowball (2014) list of 175 words

  • "van1979": C. J. van Rijsbergen's (1979) list of 250 words

  • "fox1990": Christopher Fox's (1990) list of 421 words

  • "smart1993": Original SMART (1993) list of 570 words

  • "onix2000": ONIX (2000) list of 196 words

  • "nltk2001": Python's NLTK (2009) list of 179 words

The Snowball (2014) stoplist is likely the most commonly, it is the default in the stopwords package, which is used by quanteda, tidytext and tokenizers packages, followed closely by the Smart (1993) stoplist, the default in the tm package. The word counts for SMART (1993) and ONIX (2000) are slightly different than in other places because of duplicate words.

Author

Dustin Stoltz