Word embedding semantic direction extractor — get

get_direction() outputs a vector corresponding to one pole of a "semantic direction" built from sets of antonyms or juxtaposed terms. The output can be used as an input to CMDist() and CoCA(). Anchors must be a two-column data.frame or a list of length == 2.

get_direction(anchors, wv, method = "paired", missing = "stop", n_dirs = 1L)

Arguments

anchors: A data frame or list of juxtaposed 'anchor' terms
wv: Matrix of word embedding vectors (a.k.a embedding model) with rows as terms.
method: Indicates the method used to generate vector offset. Default is 'paired'. See details.
missing: what action to take if terms are not in embeddings. If action = "stop" (default), the function is stopped and an error messages states which terms are missing. If action = "remove", missing terms or rows with missing terms are removed. Missing terms will be printed as a message.
n_dirs: If method = "PCA", an integer indicating how many directions to return. Default = 1L, indicating a single, bipolar, direction.

Value

returns a one row matrix

Details

Semantic directions can be estimated in using a few methods:

'paired' (default): each individual term is subtracted from exactly one other paired term. there must be the same number of terms for each side of the direction (although one word may be used more than once).
'pooled': terms corresponding to one side of a direction are first averaged, and then these averaged vectors are subtracted. A different number of terms can be used for each side of the direction.
'L2': the vector is calculated the same as with 'pooled' but is then divided by the L2 'Euclidean' norm
'PCA': vector offsets are calculated for each pair of terms, as with 'paired', and if n_dirs = 1L (the default) then the direction is the first principal component. Users can return more than one direction by increasing the n_dirs parameter.

References

Bolukbasi, T., Chang, K. W., Zou, J., Saligrama, V., and Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arXiv preprint https://arxiv.org/abs/1606.06121v1.
Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai (2016). 'Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.' Proceedings of the 30th International Conference on Neural Information Processing Systems. 4356-4364. https://dl.acm.org/doi/10.5555/3157382.3157584.
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Concept Class Analysis: A Method for Identifying Cultural Schemas in Texts.' Sociological Science 7:544-569. doi:10.15195/v7.a23 .
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic directions with concept mover's distance to measure binary concept engagement.' Journal of Computational Social Science 1-12. doi:10.1007/s42001-020-00075-8 .
Kozlowski, Austin C., Matt Taddy, and James A. Evans. (2019). 'The geometry of culture: Analyzing the meanings of class through word embeddings.' American Sociological Review 84(5):905-949. doi:10.1177/0003122419877135 .
Arseniev-Koehler, Alina, and Jacob G. Foster. (2020). 'Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat.' arXiv preprint https://arxiv.org/abs/2003.12133v2.

Author

Dustin Stoltz

Examples


# load example word embeddings
data(ft_wv_sample)

# create anchor list
gen <- data.frame(
  add = c("woman"),
  subtract = c("man")
)

dir <- get_direction(anchors = gen, wv = ft_wv_sample)

dir <- get_direction(
  anchors = gen, wv = ft_wv_sample,
  method = "PCA", n = 1L
)