dtm_stats() provides a summary of corpus-level statistics
using any document-term matrix. These include (1) basic information
on size (total documents, total unique terms, total tokens),
(2) lexical richness, (3) distribution information,
(4) central tendency, and (5) character-level information.
Usage
dtm_stats(
dtm,
richness = TRUE,
distribution = TRUE,
central = TRUE,
character = TRUE,
simplify = FALSE
)Arguments
- dtm
Document-term matrix with terms as columns. Works with DTMs produced by any popular text analysis package, or you can use the
dtm_builder()function.- richness
Logical (default = TRUE), whether to include statistics about lexical richness, i.e. terms that occur once, twice, and three times (hapax, dis, tris), and the total type-token ratio.
- distribution
Logical (default = TRUE), whether to include statistics about the distribution, i.e. min, max st. dev, skewness, kurtosis.
- central
Logical (default = TRUE), whether to include statistics about the central tendencies i.e. mean and median for types and tokens.
- character
Logical (default = TRUE), whether to include statistics about the character lengths of terms, i.e. min, max, mean
- simplify
Logical (default = FALSE), whether to return statistics as a data frame where each statistic is a column. Default returns a list of small data frames.
Value
A list of one to five data frames with summary statistics (if
simplify=FALSE), otherwise a single data frame where each
statistic is a column.
Examples
# \donttest{
data(jfk_speech)
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
dtm_stats(dtm)
#> $`Basic Information`
#> Measure Value
#> 1 Total Docs 84
#> 2 Percent Sparse 97.20%
#> 3 Total Types 771
#> 4 Total Tokens 2232
#> 5 Object Size 81.1 Kb
#>
#> $`Lexical Richness`
#> Measure Value
#> 1 Percent Hapax 65.00%
#> 2 Percent Dis 15.00%
#> 3 Percent Tris 6.00%
#> 4 Type-Token Ratio 0.35
#>
#> $`Term Distribution`
#> Measure Value
#> 1 Min Types 2
#> 2 Min Tokens 2
#> 3 Max Types 115
#> 4 Max Tokens 160
#> 5 St. Dev. Types 15.95
#> 6 St. Dev. Tokens 22.61
#> 7 Kurtosis Types 12.11
#> 8 Kurtosis Tokens 12.43
#> 9 Skew Types 2.63
#> 10 Skew Tokens 2.75
#>
#> $`Central Tendency`
#> Measure Value
#> 1 Mean Types 21.37
#> 2 Mean Tokens 26.57
#> 3 Median Types 19
#> 4 Median Tokens 21
#>
#> $`Term Lengths`
#> Measure Value
#> 1 Min Characters 1
#> 2 Max Characters 14
#> 3 Mean Characters 5.98
#>
dtm_stats(dtm, simplify = TRUE)
#> n_docs sparsity n_types n_tokens size hapax dis tris ttr min_types
#> 1 84 0.972 771 2232 81.1 Kb 0.65 0.15 0.06 0.35 2
#> min_tokens max_types max_tokens sd_types sd_tokens kr_types kr_tokens
#> 1 2 115 160 15.95 22.61 12.11 12.43
#> sk_types sk_tokens mu_types mu_tokens md_types md_tokens min_length
#> 1 2.63 2.75 21.37 26.57 19 21 1
#> max_length mu_length
#> 1 14 5.98
# }
