Concept Mover's Distance classifies documents of any length along a continuous measure of engagement with a given concept of interest using word embeddings.
CMDist(
dtm,
cw = NULL,
cv = NULL,
wv,
missing = "stop",
scale = TRUE,
sens_interval = FALSE,
alpha = 1,
n_iters = 20L,
parallel = FALSE,
threads = 2L,
setup_timeout = 120L
)
cmdist(
dtm,
cw = NULL,
cv = NULL,
wv,
missing = "stop",
scale = TRUE,
sens_interval = FALSE,
alpha = 1,
n_iters = 20L,
parallel = FALSE,
threads = 2L,
setup_timeout = 120L
)
Document-term matrix with words as columns. Works with DTMs
produced by any popular text analysis package, or using the
dtm_builder()
function.
Vector with concept word(s) (e.g., c("love", "money")
,
c("critical thinking")
)
Concept vector(s) as output from get_direction()
,
get_centroid()
, or get_regions()
Matrix of word embedding vectors (a.k.a embedding model) with rows as words.
Indicates what action to take if words are not in embeddings.
If action = "stop"
(default), the function is stopped
and an error messages states which words are missing.
If action = "remove"
, output is the same as terms but
missing words or rows with missing words are removed.
Missing words will be printed as a message.
Logical (default = FALSE
) uses scale()
on output. This will
set zero to the mean of the estimates, and scale by the
standard deviation of the estimates. Document estimates will,
therefore, be relative to other documents within that specific
run, but not necessarily across discrete runs.
logical (default = FALSE
), if TRUE
several CMDs
will be estimate on N resampled DTMs, sensitivity
intervals are produced by returning the 2.5 and 97.5
percentiles of estimated CMDs for a given concept word
or concept vector.
If sens_interval = TRUE
, a number indicating the proportion
of the document length to be resampled for sensitivity intervals.
Default is 1 or 100 percent of each documents' length.
If sens_interval = TRUE
, integer (default = 20L) indicates
the number of resampled DTMs to produced for
sensitivity intervals
Logical (default = FALSE
), whether to parallelize estimate
If parallel = TRUE
, an integer indicating
attempts to connect to master before failing.
If parallel = TRUE
, maximum number of seconds a worker
attempts to connect to master before failing.
Returns a data frame with the first column as document ids and each
subsequent column as the CMD engagement corresponding to each
concept word or concept vector. The upper and lower bound
estimates will follow each unique CMD if sens_interval = TRUE
.
CMDist()
requires three things: a (1) document-term matrix (DTM), a (2)
matrix of word embedding vectors, and (3) concept words or concept vectors.
The function uses word counts from the DTM and word similarities
from the cosine similarity of their respective word vectors in a
word embedding model. The "cost" of transporting all the words in a
document to a single vector or a few vectors (denoting a
concept of interest) is the measure of engagement, with higher costs
indicating less engagement. For intuitiveness the output of CMDist()
is inverted such that higher numbers will indicate more engagement
with a concept of interest.
The vector, or vectors, of the concept are specified in several ways. The simplest involves selecting a single word from the word embeddings, the analyst can also specify the concept by indicating a few words. The algorithm then splits the overall flow between each concept word (roughly) depending on which word in the document is nearest. The words need not be in the DTM, but they must be in the word embeddings (the function will either stop or remove words not in the embeddings).
Instead of selecting a word already in the embedding space, the function can
also take a vector extracted from the embedding space in the form of a
centroid (which averages the vectors of several words) ,a direction (which
uses the offset of several juxtaposing words), or a region (which is built
by clustering words into $k$ regions). The get_centroid()
,
get_direction()
, and get_regions()
functions will extract these.
Stoltz, Dustin S., and Marshall A. Taylor. (2019)
'Concept Mover's Distance' Journal of Computational
Social Science 2(2):293-313.
doi:10.1007/s42001-019-00048-6
.
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic
directions with concept mover's distance to measure binary concept
engagement.' Journal of Computational Social Science 1-12.
doi:10.1007/s42001-020-00075-8
.
Taylor, Marshall A., and Dustin S. Stoltz.
(2020) 'Concept Class Analysis: A Method for Identifying Cultural
Schemas in Texts.' Sociological Science 7:544-569.
doi:10.15195/v7.a23
.
# load example word embeddings
data(ft_wv_sample)
# load example text
data(jfk_speech)
# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
# example 1
cm.dists <- CMDist(dtm,
cw = "space",
wv = ft_wv_sample
)
# example 2
space <- c("spacecraft", "rocket", "moon")
cen <- get_centroid(anchors = space, wv = ft_wv_sample)
cm.dists <- CMDist(dtm,
cv = cen,
wv = ft_wv_sample
)