rancor_builder() generates a random corpus (rancor) based on a user
defined term probabilities and vocabulary. Users can set the number of
documents, as well as the mean, standard deviation, minimum, and maximum
document lengths (i.e., number of tokens) of the parent normal distribution
from which the document lengths are randomly sampled. The output is a single
document-term matrix. To produce multiple random corpora, use
rancors_builder() (note the plural). Term probabilities/vocabulary can
come from a users own corpus, or a pre-compiled frequency list, such
as the one derived from the Google Book N-grams corpus
Usage
rancor_builder(
data,
vocab,
probs,
n_docs = 100L,
len_mean = 500,
len_var = 10L,
len_min = 20L,
len_max = 1000L,
seed = NULL
)Arguments
- data
Data.frame containing vocabulary and probabilities
- vocab
Name of the column containing vocabulary
- probs
Name of the column containing probabilities
- n_docs
Integer indicating the number of documents to be returned
- len_mean
Integer indicating the mean of the document lengths in the parent normal sampling distribution
- len_var
Integer indicating the standard deviation of the document lengths in the parent normal sampling distribution
- len_min
Integer indicating the minimum of the document lengths in the parent normal sampling distribution
- len_max
Integer indicating the maximum of the document lengths in the parent normal sampling distribution
- seed
Optional seed for reproducibility
Examples
# create corpus and DTM
my_corpus <- data.frame(
text = c(
"I hear babies crying I watch them grow",
"They'll learn much more than I'll ever know",
"And I think to myself",
"What a wonderful world",
"Yes I think to myself",
"What a wonderful world"
),
line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))
dtm <- dtm_builder(
data = my_corpus,
text = clean_text,
doc_id = line_id
)
# use colSums to get term frequencies
df <- data.frame(
terms = colnames(dtm),
freqs = colSums(dtm)
)
# convert to probabilities
df$probs <- df$freqs / sum(df$freqs)
# create random DTM
rDTM <- df |>
rancor_builder(terms, probs)
