dtm_stopper
will "stop" terms from the analysis by removing columns in a
DTM based on stop rules. Rules include matching terms in a precompiled or
custom list, terms meeting an upper or lower document frequency threshold,
or terms meeting an upper or lower term frequency threshold.
dtm_stopper(
dtm,
stop_list = NULL,
stop_termfreq = NULL,
stop_termrank = NULL,
stop_termprop = NULL,
stop_docfreq = NULL,
stop_docprop = NULL,
stop_hapax = FALSE,
stop_null = FALSE,
omit_empty = FALSE,
dense = FALSE,
ignore_case = TRUE
)
Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
dtm_builder
function.
Vector of terms, from a precompiled stoplist or
custom list such as c("never", "gonna", "give")
.
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use Inf
for max or min, respectively.
Single integer indicating upper term rank threshold for exclusion (see details).
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use Inf
for max or min, respectively.
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use Inf
for max or min, respectively.
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use Inf
for max or min, respectively.
Logical (default = FALSE) indicating whether to remove terms occurring one time (or zero times), a.k.a. hapax legomena
Logical (default = FALSE) indicating whether to remove terms that occur zero times in the DTM.
Logical (default = FALSE) indicating whether to omit rows that are empty after stopping any terms.
The default (FALSE
) is to return a matrix of class
"dgCMatrix". Setting dense to TRUE
will return a
normal base R
dense matrix.
Logical (default = TRUE) indicating whether to ignore capitalization.
returns a document-term matrix of class "dgCMatrix"
Stopping terms by removing their respective columns in the DTM is
significantly more efficient than searching raw text with string matching
and deletion rules. Behind the scenes, the function relies on
the fastmatch
package to quickly match/not-match terms.
The stop_list
arguments takes a list of terms which are matched and
removed from the DTM. If ignore_case = TRUE
(the default) then word
case will be ignored.
The stop_termfreq
argument provides rules based on a term's occurrences
in the DTM as a whole -- regardless of its within document frequency. If
real numbers between 0 and 1 are provided then terms will be removed by
corpus proportion. For example c(0.01, 0.99)
, terms that are either below
1% of the total tokens or above 99% of the total tokens will be removed. If
integers are provided then terms will be removed by total count. For example
c(100, 9000)
, occurring less than 100 or more than 9000 times in the
corpus will be removed. This also means that if c(0, 1)
is provided, then
the will only keep terms occurring once.
The stop_termrank
argument provides the upper threshold for a terms' rank
in the corpus. For example, 5L
will remove the five most frequent terms.
The stop_docfreq
argument provides rules based on a term's document
frequency -- i.e. the number of documents within which it occurs, regardless
of how many times it occurs. If real numbers between 0 and 1 are provided
then terms will be removed by corpus proportion. For example c(0.01, 0.99)
,
terms in more than 99% of all documents or terms that are in less than 1% of
all documents. For example c(100, 9000)
, then words occurring in less than
100 documents or more than 9000 documents will be removed. This means that if
c(0, 1)
is provided, then the function will only keep terms occurring in
exactly one document, and remove terms in more than one.
The stop_hapax
argument is a shortcut for removing terms occurring just one
time in the corpus -- called hapax legomena. Typically, a size-able portion
of the corpus tends to be hapax terms, and removing them is a quick solution
to reducing the dimensions of a DTM. The DTM must be frequency counts (not
relative frequencies).
The stop_null
argument removes terms that do not occur at all.
In other words, there is a column for the term, but the entire column
is zero. This can occur for a variety of reasons, such as starting with
a predefined vocabulary (e.g., using dtm_builder's vocab
argument) or
through some cleaning processes.
The omit_empty
argument will remove documents that are empty
# create corpus and DTM
my_corpus <- data.frame(
text = c(
"I hear babies crying I watch them grow",
"They'll learn much more than I'll ever know",
"And I think to myself",
"What a wonderful world",
"Yes I think to myself",
"What a wonderful world"
),
line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))
dtm <- dtm_builder(
data = my_corpus,
text = clean_text,
doc_id = line_id
)
## example 1 with R 4.1 pipe
# \donttest{
dtm_st <- dtm |>
dtm_stopper(stop_list = c("world", "babies"))
# }
## example 2 without pipe
dtm_st <- dtm_stopper(
dtm,
stop_list = c("world", "babies")
)
## example 3 precompiled stoplist
dtm_st <- dtm_stopper(
dtm,
stop_list = get_stoplist("snowball2014")
)
## example 4, stop top 2
dtm_st <- dtm_stopper(
dtm,
stop_termrank = 2L
)
## example 5, stop docfreq
dtm_st <- dtm_stopper(
dtm,
stop_docfreq = c(2, 5)
)