Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.

doc_similarity(x, y = NULL, method, wv = NULL)

Arguments

x

Document-term matrix with terms as columns.

y

Optional second matrix (default = NULL).

method

Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details).

wv

Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities.

Details

Document similarity methods include:

  • projection: finds the one-mode projection matrix from the two-mode DTM using tcrossprod() which measures the shared vocabulary overlap

  • cosine: compares row vectors using cosine similarity

  • jaccard: compares proportion of common words to unique words in both documents

  • wmd: word mover's distance to compare documents (requires word embedding vectors), using linear-complexity relaxed word mover's distance

  • centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)

Author

Dustin Stoltz

Examples


# load example word embeddings
data(ft_wv_sample)

# load example text
data(jfk_speech)

# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)

# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)

dsm_prj <- doc_similarity(dtm, method = "projection")
dsm_cos <- doc_similarity(dtm, method = "cosine")
dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample)
dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)