Provides access to 8 precompiled stoplists, including the most commonly used
stoplist from the Snowball stemming package ("snowball2014"), text2map
's
tiny stoplist ("tiny2020"), a few historically important stop lists. This
aims to be a transparent and well-document collection of stoplists. Only
includes English language stoplists at the moment.
get_stoplist(source = "tiny2020", language = "en", tidy = FALSE)
Character indicating source, default = "tiny2020"
Character (default = "en") indicating language of stopwords by ISO 639-1 code, currently only English is supported.
logical (default = FALSE
), returns a tibble
Character vector of words to be stopped, if tidy = TRUE, a tibble is returned
There is no such thing as a stopword! But, there are tons of
precompiled lists of words that someone thinks we should remove from
our texts. (See for example: https://github.com/igorbrigadir/stopwords)
One of the first stoplists is from C.J. van Rijsbergen's "Information
retrieval: theory and practice" (1979) and includes 250 words.
text2map
's very own stoplist tiny2020
is a lean 34 words.
Below are stoplists available with get_stoplist:
"tiny2020": Tiny (2020) list of 33 words (Default)
"snowball2001": Snowball stemming package's (2001) list of 127 words
"snowball2014": Updated Snowball (2014) list of 175 words
"van1979": C. J. van Rijsbergen's (1979) list of 250 words
"fox1990": Christopher Fox's (1990) list of 421 words
"smart1993": Original SMART (1993) list of 570 words
"onix2000": ONIX (2000) list of 196 words
"nltk2001": Python's NLTK (2009) list of 179 words
The Snowball (2014) stoplist is likely the most commonly, it is the default
in the stopwords
package, which is used by quanteda
, tidytext
and
tokenizers
packages, followed closely by the Smart (1993) stoplist,
the default in the tm
package. The word counts for SMART (1993) and
ONIX (2000) are slightly different than in other places because of
duplicate words.