Calculate Concept Mover's Distance

Concept Mover's Distance classifies documents of any length along a continuous measure of engagement with a given concept of interest using word embeddings.

CMDist(
  dtm,
  cw = NULL,
  cv = NULL,
  wv,
  missing = "stop",
  scale = TRUE,
  sens_interval = FALSE,
  alpha = 1,
  n_iters = 20L,
  parallel = FALSE,
  threads = 2L,
  setup_timeout = 120L
)

cmdist(
  dtm,
  cw = NULL,
  cv = NULL,
  wv,
  missing = "stop",
  scale = TRUE,
  sens_interval = FALSE,
  alpha = 1,
  n_iters = 20L,
  parallel = FALSE,
  threads = 2L,
  setup_timeout = 120L
)

Arguments

dtm: Document-term matrix with words as columns. Works with DTMs produced by any popular text analysis package, or using the dtm_builder() function.
cw: Vector with concept word(s) (e.g., c("love", "money"), c("critical thinking"))
cv: Concept vector(s) as output from get_direction(), get_centroid(), or get_regions()
wv: Matrix of word embedding vectors (a.k.a embedding model) with rows as words.
missing: Indicates what action to take if words are not in embeddings. If action = "stop" (default), the function is stopped and an error messages states which words are missing. If action = "remove", output is the same as terms but missing words or rows with missing words are removed. Missing words will be printed as a message.
scale: Logical (default = FALSE) uses scale() on output. This will set zero to the mean of the estimates, and scale by the standard deviation of the estimates. Document estimates will, therefore, be relative to other documents within that specific run, but not necessarily across discrete runs.
sens_interval: logical (default = FALSE), if TRUE several CMDs will be estimate on N resampled DTMs, sensitivity intervals are produced by returning the 2.5 and 97.5 percentiles of estimated CMDs for a given concept word or concept vector.
alpha: If sens_interval = TRUE, a number indicating the proportion of the document length to be resampled for sensitivity intervals. Default is 1 or 100 percent of each documents' length.
n_iters: If sens_interval = TRUE, integer (default = 20L) indicates the number of resampled DTMs to produced for sensitivity intervals
parallel: Logical (default = FALSE), whether to parallelize estimate
threads: If parallel = TRUE, an integer indicating attempts to connect to master before failing.
setup_timeout: If parallel = TRUE, maximum number of seconds a worker attempts to connect to master before failing.

Value

Returns a data frame with the first column as document ids and each subsequent column as the CMD engagement corresponding to each concept word or concept vector. The upper and lower bound estimates will follow each unique CMD if sens_interval = TRUE.

Details

CMDist() requires three things: a (1) document-term matrix (DTM), a (2) matrix of word embedding vectors, and (3) concept words or concept vectors. The function uses word counts from the DTM and word similarities from the cosine similarity of their respective word vectors in a word embedding model. The "cost" of transporting all the words in a document to a single vector or a few vectors (denoting a concept of interest) is the measure of engagement, with higher costs indicating less engagement. For intuitiveness the output of CMDist() is inverted such that higher numbers will indicate more engagement with a concept of interest.

The vector, or vectors, of the concept are specified in several ways. The simplest involves selecting a single word from the word embeddings, the analyst can also specify the concept by indicating a few words. The algorithm then splits the overall flow between each concept word (roughly) depending on which word in the document is nearest. The words need not be in the DTM, but they must be in the word embeddings (the function will either stop or remove words not in the embeddings).

Instead of selecting a word already in the embedding space, the function can also take a vector extracted from the embedding space in the form of a centroid (which averages the vectors of several words) ,a direction (which uses the offset of several juxtaposing words), or a region (which is built by clustering words into $k$ regions). The get_centroid(), get_direction(), and get_regions() functions will extract these.

References

Stoltz, Dustin S., and Marshall A. Taylor. (2019) 'Concept Mover's Distance' Journal of Computational Social Science 2(2):293-313. doi:10.1007/s42001-019-00048-6 .
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic directions with concept mover's distance to measure binary concept engagement.' Journal of Computational Social Science 1-12. doi:10.1007/s42001-020-00075-8 .
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Concept Class Analysis: A Method for Identifying Cultural Schemas in Texts.' Sociological Science 7:544-569. doi:10.15195/v7.a23 .

Author

Dustin Stoltz and Marshall Taylor

Examples



# load example word embeddings
data(ft_wv_sample)

# load example text
data(jfk_speech)

# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)

# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)

# example 1
cm.dists <- CMDist(dtm,
  cw = "space",
  wv = ft_wv_sample
)

# example 2
space <- c("spacecraft", "rocket", "moon")
cen <- get_centroid(anchors = space, wv = ft_wv_sample)

cm.dists <- CMDist(dtm,
  cv = cen,
  wv = ft_wv_sample
)