Gets DTM summary statistics — dtm

dtm_stats() provides a summary of corpus-level statistics using any document-term matrix. These include (1) basic information on size (total documents, total unique terms, total tokens), (2) lexical richness, (3) distribution information, (4) central tendency, and (5) character-level information.

dtm_stats(
  dtm,
  richness = TRUE,
  distribution = TRUE,
  central = TRUE,
  character = TRUE,
  simplify = FALSE
)

Arguments

dtm: Document-term matrix with terms as columns. Works with DTMs produced by any popular text analysis package, or you can use the dtm_builder() function.
richness: Logical (default = TRUE), whether to include statistics about lexical richness, i.e. terms that occur once, twice, and three times (hapax, dis, tris), and the total type-token ratio.
distribution: Logical (default = TRUE), whether to include statistics about the distribution, i.e. min, max st. dev, skewness, kurtosis.
central: Logical (default = TRUE), whether to include statistics about the central tendencies i.e. mean and median for types and tokens.
character: Logical (default = TRUE), whether to include statistics about the character lengths of terms, i.e. min, max, mean
simplify: Logical (default = FALSE), whether to return statistics as a data frame where each statistic is a column. Default returns a list of small data frames.

Value

A list of one to five data frames with summary statistics (if simplify=FALSE), otherwise a single data frame where each statistic is a column.

Author

Dustin Stoltz