Skip to contents

A streamlined function to take raw texts from a column of a data.frame and produce a list of all the unique tokens. Tokenizes by the fixed, single whitespace, and then extracts the unique tokens. This can be used as input to dtm_builder() to standardize the vocabulary (i.e. the columns) across multiple DTMs. Prior to building the vocabulary, texts should have whitespace trimmed, if desired, punctuation removed and terms lowercased.

Usage

vocab_builder(data, text)

Arguments

data

Data.frame with one column of texts

text

Name of the column with documents' text

Value

returns a list of unique terms in a corpus

Author

Dustin Stoltz

Examples

# \donttest{
data(jfk_speech)
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
vocab <- vocab_builder(jfk_speech, sentence)
head(vocab)
#> [1] "president"   "pitzer"      "mr"          "vice"        "governor"   
#> [6] "congressman"
# }