Find a similarities between documents — doc

Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.

doc_similarity(x, y = NULL, method, wv = NULL)

Arguments

x: Document-term matrix with terms as columns.
y: Optional second matrix (default = NULL).
method: Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details).
wv: Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities.

Details

Document similarity methods include:

projection: finds the one-mode projection matrix from the two-mode DTM using tcrossprod() which measures the shared vocabulary overlap
cosine: compares row vectors using cosine similarity
jaccard: compares proportion of common words to unique words in both documents
wmd: word mover's distance to compare documents (requires word embedding vectors), using linear-complexity relaxed word mover's distance
centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)

Author

Dustin Stoltz

Examples


# load example word embeddings
data(ft_wv_sample)

# load example text
data(jfk_speech)

# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)

# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)

dsm_prj <- doc_similarity(dtm, method = "projection")
dsm_cos <- doc_similarity(dtm, method = "cosine")
dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample)
dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)