Provides access to 8 precompiled stoplists, including the most commonly used
stoplist from the Snowball stemming package ("snowball2014"), text2map's
tiny stoplist ("tiny2020"), a few historically important stop lists. This
aims to be a transparent and well-document collection of stoplists. Only
includes English language stoplists at the moment.
Details
There is no such thing as a stopword! But, there are tons of
precompiled lists of words that someone thinks we should remove from
our texts. (See for example: https://github.com/igorbrigadir/stopwords)
One of the first stoplists is from C.J. van Rijsbergen's "Information
retrieval: theory and practice" (1979) and includes 250 words.
text2map's very own stoplist tiny2020 is a lean 34 words.
Below are stoplists available with get_stoplist:
"tiny2020": Tiny (2020) list of 33 words (Default)
"snowball2001": Snowball stemming package's (2001) list of 127 words
"snowball2014": Updated Snowball (2014) list of 175 words
"van1979": C. J. van Rijsbergen's (1979) list of 250 words
"fox1990": Christopher Fox's (1990) list of 421 words
"smart1993": Original SMART (1993) list of 570 words
"onix2000": ONIX (2000) list of 196 words
"nltk2009": Python's NLTK (2009) list of 179 words
The Snowball (2014) stoplist is likely the most commonly, it is the default
in the stopwords package, which is used by quanteda, tidytext and
tokenizers packages, followed closely by the Smart (1993) stoplist,
the default in the tm package. The word counts for SMART (1993) and
ONIX (2000) are slightly different than in other places because of
duplicate words.
Examples
# \donttest{
stops <- get_stoplist("snowball2014")
head(stops)
#> [1] "a" "about" "above" "after" "again" "against"
stops_tb <- get_stoplist("snowball2014", tidy = TRUE)
head(stops_tb)
#> # A tibble: 6 × 2
#> word lexicon
#> <chr> <chr>
#> 1 a snowball2014
#> 2 about snowball2014
#> 3 above snowball2014
#> 4 after snowball2014
#> 5 again snowball2014
#> 6 against snowball2014
# }
