Concept Mover's Distance classifies documents of any length along a continuous measure of engagement with a given concept of interest using word embeddings.
Usage
CMDist(
dtm,
cw = NULL,
cv = NULL,
wv,
missing = "stop",
scale = TRUE,
sens_interval = FALSE,
alpha = 1,
n_iters = 20L,
parallel = FALSE,
threads = 2L,
setup_timeout = 120L
)
cmdist(
dtm,
cw = NULL,
cv = NULL,
wv,
missing = "stop",
scale = TRUE,
sens_interval = FALSE,
alpha = 1,
n_iters = 20L,
parallel = FALSE,
threads = 2L,
setup_timeout = 120L
)Arguments
- dtm
Document-term matrix with words as columns. Works with DTMs produced by any popular text analysis package, or using the
dtm_builder()function.- cw
Vector with concept word(s) (e.g.,
c("love", "money"),c("critical thinking"))- cv
Concept vector(s) as output from
get_direction(),get_centroid(), orget_regions()- wv
Matrix of word embedding vectors (a.k.a embedding model) with rows as words.
- missing
Indicates what action to take if words are not in embeddings. If
action = "stop"(default), the function is stopped and an error messages states which words are missing. Ifaction = "remove", output is the same as terms but missing words or rows with missing words are removed. Missing words will be printed as a message.- scale
Logical (default =
FALSE) usesscale()on output. This will set zero to the mean of the estimates, and scale by the standard deviation of the estimates. Document estimates will, therefore, be relative to other documents within that specific run, but not necessarily across discrete runs.- sens_interval
logical (default =
FALSE), ifTRUEseveral CMDs will be estimate on N resampled DTMs, sensitivity intervals are produced by returning the 2.5 and 97.5 percentiles of estimated CMDs for a given concept word or concept vector.- alpha
If
sens_interval = TRUE, a number indicating the proportion of the document length to be resampled for sensitivity intervals. Default is 1 or 100 percent of each documents' length.- n_iters
If
sens_interval = TRUE, integer (default = 20L) indicates the number of resampled DTMs to produced for sensitivity intervals- parallel
Logical (default =
FALSE), whether to parallelize estimate- threads
If
parallel = TRUE, an integer indicating attempts to connect to master before failing.- setup_timeout
If
parallel = TRUE, maximum number of seconds a worker attempts to connect to master before failing.
Value
Returns a data frame with the first column as document ids and each
subsequent column as the CMD engagement corresponding to each
concept word or concept vector. The upper and lower bound
estimates will follow each unique CMD if sens_interval = TRUE.
Details
CMDist() requires three things: a (1) document-term matrix (DTM), a (2)
matrix of word embedding vectors, and (3) concept words or concept vectors.
The function uses word counts from the DTM and word similarities
from the cosine similarity of their respective word vectors in a
word embedding model. The "cost" of transporting all the words in a
document to a single vector or a few vectors (denoting a
concept of interest) is the measure of engagement, with higher costs
indicating less engagement. For intuitiveness the output of CMDist()
is inverted such that higher numbers will indicate more engagement
with a concept of interest.
The vector, or vectors, of the concept are specified in several ways. The simplest involves selecting a single word from the word embeddings, the analyst can also specify the concept by indicating a few words. The algorithm then splits the overall flow between each concept word (roughly) depending on which word in the document is nearest. The words need not be in the DTM, but they must be in the word embeddings (the function will either stop or remove words not in the embeddings).
Instead of selecting a word already in the embedding space, the function can
also take a vector extracted from the embedding space in the form of a
centroid (which averages the vectors of several words) ,a direction (which
uses the offset of several juxtaposing words), or a region (which is built
by clustering words into $k$ regions). The get_centroid(),
get_direction(), and get_regions() functions will extract these.
References
Stoltz, Dustin S., and Marshall A. Taylor. (2019)
'Concept Mover's Distance' Journal of Computational
Social Science 2(2):293-313.
doi:10.1007/s42001-019-00048-6
.
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic
directions with concept mover's distance to measure binary concept
engagement.' Journal of Computational Social Science 1-12.
doi:10.1007/s42001-020-00075-8
.
Taylor, Marshall A., and Dustin S. Stoltz.
(2020) 'Concept Class Analysis: A Method for Identifying Cultural
Schemas in Texts.' Sociological Science 7:544-569.
doi:10.15195/v7.a23
.
Examples
# load example word embeddings
data(ft_wv_sample)
# load example text
data(jfk_speech)
# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
# example 1
cm.dists <- CMDist(dtm,
cw = "space",
wv = ft_wv_sample
)
# example 2
space <- c("spacecraft", "rocket", "moon")
cen <- get_centroid(anchors = space, wv = ft_wv_sample)
cm.dists <- CMDist(dtm,
cv = cen,
wv = ft_wv_sample
)
