Removes terms from a DTM based on rules — dtm

dtm_stopper will "stop" terms from the analysis by removing columns in a DTM based on stop rules. Rules include matching terms in a precompiled or custom list, terms meeting an upper or lower document frequency threshold, or terms meeting an upper or lower term frequency threshold.

dtm_stopper(
  dtm,
  stop_list = NULL,
  stop_termfreq = NULL,
  stop_termrank = NULL,
  stop_termprop = NULL,
  stop_docfreq = NULL,
  stop_docprop = NULL,
  stop_hapax = FALSE,
  stop_null = FALSE,
  omit_empty = FALSE,
  dense = FALSE,
  ignore_case = TRUE
)

Arguments

dtm: Document-term matrix with terms as columns. Works with DTMs produced by any popular text analysis package, or you can use the dtm_builder function.
stop_list: Vector of terms, from a precompiled stoplist or custom list such as c("never", "gonna", "give").
stop_termfreq: Vector of two numbers indicating the lower and upper threshold for exclusion (see details). Use Inf for max or min, respectively.
stop_termrank: Single integer indicating upper term rank threshold for exclusion (see details).
stop_termprop: Vector of two numbers indicating the lower and upper threshold for exclusion (see details). Use Inf for max or min, respectively.
stop_docfreq: Vector of two numbers indicating the lower and upper threshold for exclusion (see details). Use Inf for max or min, respectively.
stop_docprop: Vector of two numbers indicating the lower and upper threshold for exclusion (see details). Use Inf for max or min, respectively.
stop_hapax: Logical (default = FALSE) indicating whether to remove terms occurring one time (or zero times), a.k.a. hapax legomena
stop_null: Logical (default = FALSE) indicating whether to remove terms that occur zero times in the DTM.
omit_empty: Logical (default = FALSE) indicating whether to omit rows that are empty after stopping any terms.
dense: The default (FALSE) is to return a matrix of class "dgCMatrix". Setting dense to TRUE will return a normal base R dense matrix.
ignore_case: Logical (default = TRUE) indicating whether to ignore capitalization.

Value

returns a document-term matrix of class "dgCMatrix"

Details

Stopping terms by removing their respective columns in the DTM is significantly more efficient than searching raw text with string matching and deletion rules. Behind the scenes, the function relies on the fastmatch package to quickly match/not-match terms.

The stop_list arguments takes a list of terms which are matched and removed from the DTM. If ignore_case = TRUE (the default) then word case will be ignored.

The stop_termfreq argument provides rules based on a term's occurrences in the DTM as a whole -- regardless of its within document frequency. If real numbers between 0 and 1 are provided then terms will be removed by corpus proportion. For example c(0.01, 0.99), terms that are either below 1% of the total tokens or above 99% of the total tokens will be removed. If integers are provided then terms will be removed by total count. For example c(100, 9000), occurring less than 100 or more than 9000 times in the corpus will be removed. This also means that if c(0, 1) is provided, then the will only keep terms occurring once.

The stop_termrank argument provides the upper threshold for a terms' rank in the corpus. For example, 5L will remove the five most frequent terms.

The stop_docfreq argument provides rules based on a term's document frequency -- i.e. the number of documents within which it occurs, regardless of how many times it occurs. If real numbers between 0 and 1 are provided then terms will be removed by corpus proportion. For example c(0.01, 0.99), terms in more than 99% of all documents or terms that are in less than 1% of all documents. For example c(100, 9000), then words occurring in less than 100 documents or more than 9000 documents will be removed. This means that if c(0, 1) is provided, then the function will only keep terms occurring in exactly one document, and remove terms in more than one.

The stop_hapax argument is a shortcut for removing terms occurring just one time in the corpus -- called hapax legomena. Typically, a size-able portion of the corpus tends to be hapax terms, and removing them is a quick solution to reducing the dimensions of a DTM. The DTM must be frequency counts (not relative frequencies).

The stop_null argument removes terms that do not occur at all. In other words, there is a column for the term, but the entire column is zero. This can occur for a variety of reasons, such as starting with a predefined vocabulary (e.g., using dtm_builder's vocab argument) or through some cleaning processes.

The omit_empty argument will remove documents that are empty

Author

Dustin Stoltz

Examples


# create corpus and DTM
my_corpus <- data.frame(
  text = c(
    "I hear babies crying I watch them grow",
    "They'll learn much more than I'll ever know",
    "And I think to myself",
    "What a wonderful world",
    "Yes I think to myself",
    "What a wonderful world"
  ),
  line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))

dtm <- dtm_builder(
  data = my_corpus,
  text = clean_text,
  doc_id = line_id
)

## example 1 with R 4.1 pipe
# \donttest{
dtm_st <- dtm |>
  dtm_stopper(stop_list = c("world", "babies"))
# }

## example 2 without pipe
dtm_st <- dtm_stopper(
  dtm,
  stop_list = c("world", "babies")
)

## example 3 precompiled stoplist
dtm_st <- dtm_stopper(
  dtm,
  stop_list = get_stoplist("snowball2014")
)

## example 4, stop top 2
dtm_st <- dtm_stopper(
  dtm,
  stop_termrank = 2L
)

## example 5, stop docfreq
dtm_st <- dtm_stopper(
  dtm,
  stop_docfreq = c(2, 5)
)