Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.
doc_similarity(x, y = NULL, method, wv = NULL)
Document-term matrix with terms as columns.
Optional second matrix (default = NULL
).
Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details).
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities.
Document similarity methods include:
projection: finds the one-mode projection matrix from the two-mode DTM
using tcrossprod()
which measures the shared vocabulary overlap
cosine: compares row vectors using cosine similarity
jaccard: compares proportion of common words to unique words in both documents
wmd: word mover's distance to compare documents (requires word embedding vectors), using linear-complexity relaxed word mover's distance
centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)
# load example word embeddings
data(ft_wv_sample)
# load example text
data(jfk_speech)
# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
dsm_prj <- doc_similarity(dtm, method = "projection")
dsm_cos <- doc_similarity(dtm, method = "cosine")
dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample)
dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)