Word embedding semantic region extractor — get

Given a set of word embeddings of \(d\) dimensions and \(v\) vocabulary, get_regions() finds \(k\) semantic regions in \(d\) dimensions. This, in effect, learns latent topics from an embedding space (a.k.a. topic modeling), which are directly comparable to both terms (with cosine similarity) and documents (with Concept Mover's distance using CMDist()).

get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)

Arguments

wv: Matrix of word embedding vectors (a.k.a embedding model) with rows as words.
k_regions: Integer indicating the k number of regions to return
max_iter: Integer indicating the maximum number of iterations before k-means terminates.
seed: Integer indicating a random seed. Default is 0, which calls 'std::time(NULL)'.

Value

returns a matrix of class "dgCMatrix" with k rows and d dimensions

Details

To group words into more encompassing "semantic regions" we use \(k\)-means clustering. We choose \(k\)-means primarily for it's ubiquity and the wide range of available diagnostic tools for \(k\)-means cluster.

A word embedding matrix of \(d\) dimensions and \(v\) vocabulary is "clustered" into \(k\) semantic regions which have \(d\) dimensions. Each region is represented by a single point defined by the \(d\) dimensional vector. The process discretely assigns all word vectors are assigned to a given region so as to minimize some error function, however as the resulting regions are in the same dimensions as the word embeddings, we can measure each terms similarity to each region. This, in effect, is a mixed membership topic model similar to topic modeling by Latent Dirichlet Allocation.

We use the KMeans_arma function from the ClusterR package which uses the Armadillo library.

References

Butnaru, Andrei M., and Radu Tudor Ionescu. (2017) 'From image to text classification: A novel approach based on clustering word embeddings.' Procedia computer science. 112:1783-1792. doi:10.1016/j.procs.2017.08.211 .
Zhang, Yi, Jie Lu, Feng Liu, Qian Liu, Alan Porter, Hongshu Chen, and Guangquan Zhang. (2018). 'Does Deep Learning Help Topic Extraction? A Kernel K-Means Clustering Method with Word Embedding.' Journal of Informetrics. 12(4):1099-1117. doi:10.1016/j.joi.2018.09.004 .
Arseniev-Koehler, Alina and Cochran, Susan D and Mays, Vickie M and Chang, Kai-Wei and Foster, Jacob Gates (2021) 'Integrating topic modeling and word embedding to characterize violent deaths' doi:10.31235/osf.io/nkyaq

Author

Dustin Stoltz

Examples


# load example word embeddings
data(ft_wv_sample)

my.regions <- get_regions(
  wv = ft_wv_sample,
  k_regions = 10L,
  max_iter = 10L,
  seed = 01984
)