A streamlined function to take raw texts from a column of a data.frame and produce a list of all the unique tokens. Tokenizes by the fixed, single whitespace, and then extracts the unique tokens. This can be used as input to dtm_builder() to standardize the vocabulary (i.e. the columns) across multiple DTMs. Prior to building the vocabulary, texts should have whitespace trimmed, if desired, punctuation removed and terms lowercased.

vocab_builder(data, text)

Arguments

data

Data.frame with one column of texts

text

Name of the column with documents' text

Value

returns a list of unique terms in a corpus

Author

Dustin Stoltz