Takes any DTM and randomly resamples from each row, creating a new DTM

dtm_resampler(dtm, alpha = NULL, n = NULL)

Arguments

dtm

Document-term matrix with terms as columns. Works with DTMs produced by any popular text analysis package, or you can use the dtm_builder() function.

alpha

Number indicating proportion of document lengths, e.g., alpha = 1 returns resampled rows that are the same lengths as the original DTM.

n

Integer indicating the length of documents to be returned, e.g., n = 100L will bring documents shorter than 100 tokens up to 100, while bringing documents longer than 100 tokens down to 100.

Value

returns a document-term matrix of class "dgCMatrix"

Details

Using the row counts as probabilities, each document's tokens are resampled with replacement up to a certain proportion of the row count (set by alpha). This function can be used with iteration to "bootstrap" a DTM without returning to the raw text. It does not iterate, however, so operations can be performed on one DTM at a time without storing multiple DTMs in memory.

If alpha is less than 1, then a proportion of each documents' lengths is returned. For example, alpha = 0.50 will return a resampled DTM where each row has half the tokens of the original DTM. If alpha = 2, than each row in the resampled DTM twice the number of tokens of the original DTM. If an integer is provided to n then all documents will be resampled to that length. For example, n = 2000L will resample each document until they are 2000 tokens long -- meaning those shorter than 2000 will be increased in length, while those longer than 2000 will be decreased in length. alpha and n should not be specified at the same time.