Skip to contents

A dataset containing eight English stoplists. Is used with the get_stoplist() function.

Format

A data frame with 1775 rows and 2 variables.

Source

  • tiny2020: Stoltz and Taylor (2020)

  • snowball2001: Porter (2001) Snowball stemming algorithm

  • snowball2014: Porter (2014) updated Snowball stoplist

  • van1979: van Rijsbergen (1979) "Information Retrieval"

  • fox1990: Fox (1990) "A Stop List for General Text"

  • smart1993: Salton and Buckley (1993) SMART retrieval system

  • onix2000: ONIX (2000) "Oxford English Dictionary" stoplist

  • nltk2009: Bird, Loper and Klein (2009) NLTK

Details

The stoplists include:

  • "tiny2020": Tiny (2020) list of 33 words (Default)

  • "snowball2001": Snowball (2001) list of 127 words

  • "snowball2014": Updated Snowball (2014) list of 175 words

  • "van1979": van Rijsbergen's (1979) list of 250 words

  • "fox1990": Christopher Fox's (1990) list of 421 words

  • "smart1993": Original SMART (1993) list of 570 words

  • "onix2000": ONIX (2000) list of 196 words

  • "nltk2009": Python's NLTK (2009) list of 179 words

Tiny 2020, is a very small stop list of the most frequent English conjunctions, articles, prepositions, and demonstratives (N=17). Also includes the 8 forms of the copular verb "to be" and the 8 most frequent personal (singular and plural) pronouns (minus gendered and possessive pronouns).

No contractions are included.

Variables

Variables:

  • words. words to be stopped

  • source. source of the list