A fast unigram vocabulary builder — vocab

A streamlined function to take raw texts from a column of a data.frame and produce a list of all the unique tokens. Tokenizes by the fixed, single whitespace, and then extracts the unique tokens. This can be used as input to dtm_builder() to standardize the vocabulary (i.e. the columns) across multiple DTMs. Prior to building the vocabulary, texts should have whitespace trimmed, if desired, punctuation removed and terms lowercased.

vocab_builder(data, text)

Arguments

data: Data.frame with one column of texts
text: Name of the column with documents' text

Value

returns a list of unique terms in a corpus

Author

Dustin Stoltz