Given a set of word embeddings of \(d\) dimensions and \(v\) vocabulary,
get_regions()
finds \(k\) semantic regions in \(d\) dimensions.
This, in effect, learns latent topics from an embedding space (a.k.a.
topic modeling), which are directly comparable to both terms (with
cosine similarity) and documents (with Concept Mover's distance
using CMDist()
).
get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)
Matrix of word embedding vectors (a.k.a embedding model) with rows as words.
Integer indicating the k number of regions to return
Integer indicating the maximum number of iterations before k-means terminates.
Integer indicating a random seed. Default is 0, which calls 'std::time(NULL)'.
returns a matrix of class "dgCMatrix" with k rows and d dimensions
To group words into more encompassing "semantic regions" we use \(k\)-means clustering. We choose \(k\)-means primarily for it's ubiquity and the wide range of available diagnostic tools for \(k\)-means cluster.
A word embedding matrix of \(d\) dimensions and \(v\) vocabulary is "clustered" into \(k\) semantic regions which have \(d\) dimensions. Each region is represented by a single point defined by the \(d\) dimensional vector. The process discretely assigns all word vectors are assigned to a given region so as to minimize some error function, however as the resulting regions are in the same dimensions as the word embeddings, we can measure each terms similarity to each region. This, in effect, is a mixed membership topic model similar to topic modeling by Latent Dirichlet Allocation.
We use the KMeans_arma
function from the ClusterR
package which
uses the Armadillo library.
Butnaru, Andrei M., and Radu Tudor Ionescu. (2017)
'From image to text classification: A novel approach
based on clustering word embeddings.'
Procedia computer science. 112:1783-1792.
doi:10.1016/j.procs.2017.08.211
.
Zhang, Yi, Jie Lu, Feng Liu, Qian Liu, Alan Porter,
Hongshu Chen, and Guangquan Zhang. (2018).
'Does Deep Learning Help Topic Extraction? A Kernel
K-Means Clustering Method with Word Embedding.'
Journal of Informetrics. 12(4):1099-1117.
doi:10.1016/j.joi.2018.09.004
.
Arseniev-Koehler, Alina and Cochran, Susan D and
Mays, Vickie M and Chang, Kai-Wei and Foster,
Jacob Gates (2021) 'Integrating topic modeling
and word embedding to characterize violent deaths'
doi:10.31235/osf.io/nkyaq
# load example word embeddings
data(ft_wv_sample)
my.regions <- get_regions(
wv = ft_wv_sample,
k_regions = 10L,
max_iter = 10L,
seed = 01984
)