dtm_stats()
provides a summary of corpus-level statistics
using any document-term matrix. These include (1) basic information
on size (total documents, total unique terms, total tokens),
(2) lexical richness, (3) distribution information,
(4) central tendency, and (5) character-level information.
dtm_stats(
dtm,
richness = TRUE,
distribution = TRUE,
central = TRUE,
character = TRUE,
simplify = FALSE
)
Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
dtm_builder()
function.
Logical (default = TRUE), whether to include statistics about lexical richness, i.e. terms that occur once, twice, and three times (hapax, dis, tris), and the total type-token ratio.
Logical (default = TRUE), whether to include statistics about the distribution, i.e. min, max st. dev, skewness, kurtosis.
Logical (default = TRUE), whether to include statistics about the central tendencies i.e. mean and median for types and tokens.
Logical (default = TRUE), whether to include statistics about the character lengths of terms, i.e. min, max, mean
Logical (default = FALSE), whether to return statistics as a data frame where each statistic is a column. Default returns a list of small data frames.
A list of one to five data frames with summary statistics (if
simplify=FALSE
), otherwise a single data frame where each
statistic is a column.