vignettes/articles/testing-anchor-sets-semantic-relations.Rmd
testing-anchor-sets-semantic-relations.Rmd
Word embeddings are commonly used to measure the extent a set of target terms are “biased” along a unidimensional semantic relation – a.k.a. dimension, axis, or direction – ranging from “masculine” to “feminine.” Generalizing from this “gender relation,” analyst now use the same basic procedure to measure all sorts of relations, like old to young, big to small, liberal to conservative, rich to poor, and so on.
While there are several ways one could derive a “dimension,” all
procedures involve selecting terms to “anchor” the “poles” of the
juxtaposition. For example, get_anchors()
provides several
anchor sets as starting points for defining relations:
get_anchors(relation = "purity")
add subtract
1 pure impure
2 purity impurity
3 cleanliness uncleanliness
4 clean dirty
5 pureness impureness
6 stainless stain
7 untainted tainted
8 immaculate filthy
9 purity dirt
10 fresh stale
11 sanitation stain
Boutyline and Johnston (2023) demonstrate a few methods to determine how well each juxtaposing pair of anchor terms in a given set defines a relation. We implement one such method which they call “PairDir”:
“We find that PairDir – a measure of parallelism between the offset vectors (and thus of the internal reliability of the estimated relation) – consistently outperforms other reliability metrics in explaining axis accuracy.”
Below, we walk through how to replicate a portion of Boutyline and Johnston (2023), namely Table 4.
We will need text2map
(version 0.1.9):
We will also need the well-known Google News word2vec
embeddings. We can do this using
text2map.pretrained
:
# remotes::install_gitlab("culturalcartography/text2map.pretrained")
library(text2map.pretrained)
After loading, we need to download the model (once per machine) and then load it into the session (it’s rather large, it will take a minute or so to load).
download_pretrained("vecs_cbow300_googlenews")
data("vecs_cbow300_googlenews")
Boutyline and Johnston (2023) take
anchors used by Kozlowski et al. (2019) to
define a soft-to-hard relation. We can load our
anchors into R
using this:
df_anchors <- data.frame(
a = c("soft", "supple", "delicate", "pliable", "fluffy", "mushy", "softer", "softest"),
z = c("hard", "tough", "dense", "rigid", "firm", "solid", "harder", "hardest")
)
Then, we use the following function to test the quality of these pairs using the PairDir method:
test_anchors(df_anchors, vecs_cbow300_googlenews)
anchor_pair pair_dir
1 AVERAGE 0.168506001
2 soft-hard 0.296905021
3 supple-tough 0.192454715
4 delicate-dense -0.003922452
5 pliable-rigid 0.123870932
6 fluffy-firm 0.143672458
7 mushy-solid 0.110253793
8 softer-harder 0.229002074
9 softest-hardest 0.255811469
Boutyline and Johnston (2023, 26) use these results to guide the selection of new anchor pairs:
After identifying “delicate” as the term we want to replace, we iteratively substitute it with each of the roughly 100,000 words in our embedding’s vocabulary (but not already in this anchor set) and calculate the resulting PairDir score for each substitution. We then take the 100 terms that yielded the highest PairDir scores and manually examine them as candidate replacements, looking for a term that, when contrasted with “dense”, best conceptually describes the latent cultural dimension this axis is meant to measure
We have 3 million words in our embeddings – that’s too many for our demonstration! First, let’s remove words already in our anchor set. Second, let’s also remove terms larger than an unigram or those that include punctuation, and remove any with capital letters as they tend to be proper nouns or acronyms.
candidates <- rownames(vecs_cbow300_googlenews)
candidates <- candidates[!candidates %in% unlist(df_anchors)]
candidates <- candidates[!grepl("_", candidates, fixed = TRUE)]
candidates <- candidates[!grepl("[[:punct:]]", candidates)]
candidates <- candidates[!grepl("[[:upper:]]", candidates)]
length(candidates)
158882
That is still a lot of words. Normally, we would select a set of candidate terms to test, but just as a demonstration, let’s randomly sample a manageable number from our vocabulary. We will put these in a data.frame, all juxtaposed against “dense.”
# randomly sample 100
set.seed(61761)
idx_samp <- sample(length(candidates), 100)
# create data.frame
df_alts <- data.frame(
a = candidates[idx_samp],
z = "dense"
)
Now, we’ll use a for-loop to add each candidate pair, one at a time, to our previous anchor set and grab the candidate’s PairDir score (this takes about 3-4 minutes with 100 candidate pairs)
ls_res <- list()
ptm <- proc.time()
for(i in seq_len(nrow(df_alts))) {
ls_res[[i]] <- test_anchors(
rbind(df_anchors, df_alts[i, ]),
vecs_cbow300_googlenews
)[10, ]
}
proc.time() - ptm
Now, we can check to see which have the highest PairDir scores:
anchor_pair pair_dir
1 vomitous-dense 0.05675224
2 doosras-dense 0.05661879
3 lopper-dense 0.04916034
4 shinier-dense 0.04835953
5 tranquil-dense 0.04774737
6 gummi-dense 0.04327697
7 ponderous-dense 0.04104144
8 telephoto-dense 0.03877452
9 womanhood-dense 0.03816549
10 dowels-dense 0.03704921
None of these randomly constructed pairs are very good! But, we get a sense of how we could iterate through possible candidate pairs and test them using the PairDir method.