A streamlined function to take raw texts from a column of a data.frame and
produce a list of all the unique tokens. Tokenizes by the fixed,
single whitespace, and then extracts the unique tokens. This can be used as
input to dtm_builder() to standardize the vocabulary (i.e. the columns)
across multiple DTMs. Prior to building the vocabulary, texts should have
whitespace trimmed, if desired, punctuation removed and terms lowercased.
Usage
vocab_builder(data, text)
Arguments
- data
Data.frame with one column of texts
- text
Name of the column with documents' text
Value
returns a list of unique terms in a corpus
Examples
# \donttest{
data(jfk_speech)
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
vocab <- vocab_builder(jfk_speech, sentence)
head(vocab)
#> [1] "president" "pitzer" "mr" "vice" "governor"
#> [6] "congressman"
# }