Skip to contents

dtm_stats() provides a summary of corpus-level statistics using any document-term matrix. These include (1) basic information on size (total documents, total unique terms, total tokens), (2) lexical richness, (3) distribution information, (4) central tendency, and (5) character-level information.

Usage

dtm_stats(
  dtm,
  richness = TRUE,
  distribution = TRUE,
  central = TRUE,
  character = TRUE,
  simplify = FALSE
)

Arguments

dtm

Document-term matrix with terms as columns. Works with DTMs produced by any popular text analysis package, or you can use the dtm_builder() function.

richness

Logical (default = TRUE), whether to include statistics about lexical richness, i.e. terms that occur once, twice, and three times (hapax, dis, tris), and the total type-token ratio.

distribution

Logical (default = TRUE), whether to include statistics about the distribution, i.e. min, max st. dev, skewness, kurtosis.

central

Logical (default = TRUE), whether to include statistics about the central tendencies i.e. mean and median for types and tokens.

character

Logical (default = TRUE), whether to include statistics about the character lengths of terms, i.e. min, max, mean

simplify

Logical (default = FALSE), whether to return statistics as a data frame where each statistic is a column. Default returns a list of small data frames.

Value

A list of one to five data frames with summary statistics (if simplify=FALSE), otherwise a single data frame where each statistic is a column.

Author

Dustin Stoltz

Examples

# \donttest{
data(jfk_speech)
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)

dtm_stats(dtm)
#> $`Basic Information`
#>          Measure   Value
#> 1     Total Docs      84
#> 2 Percent Sparse  97.20%
#> 3    Total Types     771
#> 4   Total Tokens    2232
#> 5    Object Size 81.1 Kb
#> 
#> $`Lexical Richness`
#>            Measure  Value
#> 1    Percent Hapax 65.00%
#> 2      Percent Dis 15.00%
#> 3     Percent Tris  6.00%
#> 4 Type-Token Ratio   0.35
#> 
#> $`Term Distribution`
#>            Measure Value
#> 1        Min Types     2
#> 2       Min Tokens     2
#> 3        Max Types   115
#> 4       Max Tokens   160
#> 5   St. Dev. Types 15.95
#> 6  St. Dev. Tokens 22.61
#> 7   Kurtosis Types 12.11
#> 8  Kurtosis Tokens 12.43
#> 9       Skew Types  2.63
#> 10     Skew Tokens  2.75
#> 
#> $`Central Tendency`
#>         Measure Value
#> 1    Mean Types 21.37
#> 2   Mean Tokens 26.57
#> 3  Median Types    19
#> 4 Median Tokens    21
#> 
#> $`Term Lengths`
#>           Measure Value
#> 1  Min Characters     1
#> 2  Max Characters    14
#> 3 Mean Characters  5.98
#> 
dtm_stats(dtm, simplify = TRUE)
#>   n_docs sparsity n_types n_tokens    size hapax  dis tris  ttr min_types
#> 1     84    0.972     771     2232 81.1 Kb  0.65 0.15 0.06 0.35         2
#>   min_tokens max_types max_tokens sd_types sd_tokens kr_types kr_tokens
#> 1          2       115        160    15.95     22.61    12.11     12.43
#>   sk_types sk_tokens mu_types mu_tokens md_types md_tokens min_length
#> 1     2.63      2.75    21.37     26.57       19        21          1
#>   max_length mu_length
#> 1         14      5.98
# }