vignettes/semantic-directions-CMD.Rmd
semantic-directions-CMD.Rmd
pacman::p_load(
tidyverse, gutenbergr,
text2vec, MASS,
ggplot2, stringi,
textclean, ggpubr,
ggrepel, text2map,
install = TRUE
)
# if you'd like to match the plot aesthetics load
text2map.theme::set_theme()
We are interested in the relationship between how much – i.e., the magnitude – one of Shakespeare’s plays engages with the concept of death and the number of actual people who die in a play (i.e. body counts). First, we need to get the text of the plays.
Go ahead and download all of Shakespeare’s plays from Project
Gutenberg using the gutenbergr
package. The output will be
a data.frame with each row being one line, but we want to collapse all
of the lines for each play. We will first use group_by
to
group each play together, and then summarize
the text
column within each group.
data("meta_shakespeare")
# Grab the text from Project GUTENBERG
df_plays <- meta_shakespeare |>
dplyr::select(gutenberg_id) |>
gutenberg_download() |>
group_by(gutenberg_id) |>
summarize(text = paste(text, collapse = ", "))
Next, we will take care of some preprocessing. Make sure there are no special characters, remove punctuation, take care of curly quotes (which can be a pain), and then smooshing contractions. Finally, we will remove any digits that might be floating around (this is downloaded from the Internet after all), and then make sure there is only a single space between words.
df_plays <- df_plays |> mutate(
## transliterate and lowercase
text = replace_curly_quote(text),
text = stri_trans_general(text, id = "Any-Latin; Latin-ASCII"),
text = tolower(text),
## punctuation
text = gsub("(\\w+[_'-]+\\w+)|[[:punct:]]+", "\\1", text),
text = replace_contraction(text),
## numbers and spaces
text = gsub("[[:digit:]]+", " ", text),
text = gsub("[[:space:]]+", " ", text),
text = trimws(text)
)
Finally, we will need to convert the text of the plays into a
Document-Term Matrix. The text2map
package has an efficient
DTM builder called dtm_builder()
– it has very few options,
but the straightforward unigram DTM is typically all we need.
dtm_plays <- df_plays |>
dtm_builder(text, gutenberg_id)
# text2map's handy DTM stats function
tbl <- dtm_stats(dtm_plays)
knitr::kable(tbl, row.names = TRUE)
You will need to load pre-trained word embeddings. We use the fastText
English embeddings. You can load it using
text2map.pretrained
:
You can also download the file directly into your R session using the following:
# install if necessary
remotes::install_gitlab("culturalcartography/text2map.pretrained")
library(text2map.pretrained)
# download once per machine
download_pretrained("vecs_fasttext300_wiki_news")
# load with data() once per session
data("vecs_fasttext300_wiki_news")
Next we will want to extract a “semantic direction” from the
embedding space using the get_direction()
function.
adds <- c("death", "casualty", "demise", "dying", "fatality")
subs <- c("life", "survivor", "birth", "living", "endure")
terms <- cbind(adds, subs)
sem_dir <- get_direction(terms, vecs_fasttext300_wiki_news)
The next step is to use CMDist
to measure how “close”
each of the words in the plays (defined by the DTM) engages with the
location of the semantic direction. We also want to compare engagement
with semantic direction that points away from life toward death to
engagement with the individual terms, “life” and “death.”
# Run CMD for concept words and semantic direction
cmd_plays <- dtm_plays |>
CMDist(
cw = c("life", "death"),
cv = sem_dir,
wv = vecs_fasttext300_wiki_news,
scale = TRUE
)
# combine with metadata
cmd_plays <- cmd_plays |>
mutate(doc_id = as.numeric(doc_id)) |>
left_join(meta_plays,
by = c("doc_id" = "gutenberg_id")
)
Below we show that when we only measure a document’s distance to life or death (as opposed to the semantic direction), there is a positive correlation between the two measures. And, with other juxtaposing concepts, the correlation is considerably higher.
Below we compare how well engagement with life, death, or the life-to-death direction predicts the actual number of people who die in a given play. To decide on an estimator, we first look at the distribution of our outcome variable, body_count.
# Decide on estimator
ggplot(cmd_plays, aes(body_count)) +
geom_histogram(binwidth = 1) # Should do a count model
This tells us we should use a count model.
m1 <- glm.nb(body_count ~ death_pole,
data = cmd_plays
)
m2 <- glm(body_count ~ death_pole,
data = cmd_plays, family = "poisson"
)
m3 <- glm.nb(body_count ~ life,
data = cmd_plays
)
m4 <- glm(body_count ~ life,
data = cmd_plays, family = "poisson"
)
m5 <- glm.nb(body_count ~ death,
data = cmd_plays
)
m6 <- glm(body_count ~ death,
data = cmd_plays, family = "poisson"
)
# Specifically an nbreg
pchisq(2 * (logLik(m1) - logLik(m2)), df = 1, lower.tail = FALSE)
pchisq(2 * (logLik(m3) - logLik(m4)), df = 1, lower.tail = FALSE)
pchisq(2 * (logLik(m5) - logLik(m6)), df = 1, lower.tail = FALSE)
# Models
glm.nb(body_count ~ death_pole, data = cmd_plays) |> summary()
glm.nb(body_count ~ poly(death_pole, 2), data = cmd_plays) |> summary()
glm.nb(body_count ~ life, data = cmd_plays) |> summary()
glm.nb(body_count ~ death, data = cmd_plays) |> summary()
# Predicted Body Counts
new_data <- data.frame(death_pole = -1:1) # death pole
new_data$phat <- predict(m1, new_data, type = "response")
new_data
new_data <- data.frame(death = -1:1) # death
new_data$phat <- predict(m5, new_data, type = "response")
new_data