get_regions: Word embedding semantic region extractor

View source: R/utils-embedding.R

get_regionsR Documentation

Word embedding semantic region extractor

Description

Given a set of word embeddings of d dimensions and v vocabulary, get_regions() finds k semantic regions in d dimensions. This, in effect, learns latent topics from an embedding space (a.k.a. topic modeling), which are directly comparable to both terms (with cosine similarity) and documents (with Concept Mover's distance using CMDist()).

Usage

get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)

Arguments

wv

Matrix of word embedding vectors (a.k.a embedding model) with rows as words.

k_regions

Integer indicating the k number of regions to return

max_iter

Integer indicating the maximum number of iterations before k-means terminates.

seed

Integer indicating a random seed. Default is 0, which calls 'std::time(NULL)'.

Details

To group words into more encompassing "semantic regions" we use k-means clustering. We choose k-means primarily for it's ubiquity and the wide range of available diagnostic tools for k-means cluster.

A word embedding matrix of d dimensions and v vocabulary is "clustered" into k semantic regions which have d dimensions. Each region is represented by a single point defined by the d dimensional vector. The process discretely assigns all word vectors are assigned to a given region so as to minimize some error function, however as the resulting regions are in the same dimensions as the word embeddings, we can measure each terms similarity to each region. This, in effect, is a mixed membership topic model similar to topic modeling by Latent Dirichlet Allocation.

We use the KMeans_arma function from the ClusterR package which uses the Armadillo library.

Value

returns a matrix of class "dgCMatrix" with k rows and d dimensions

Author(s)

Dustin Stoltz

References

Butnaru, Andrei M., and Radu Tudor Ionescu. (2017) 'From image to text classification: A novel approach based on clustering word embeddings.' Procedia computer science. 112:1783-1792. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.procs.2017.08.211")}.
Zhang, Yi, Jie Lu, Feng Liu, Qian Liu, Alan Porter, Hongshu Chen, and Guangquan Zhang. (2018). 'Does Deep Learning Help Topic Extraction? A Kernel K-Means Clustering Method with Word Embedding.' Journal of Informetrics. 12(4):1099-1117. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.joi.2018.09.004")}.
Arseniev-Koehler, Alina and Cochran, Susan D and Mays, Vickie M and Chang, Kai-Wei and Foster, Jacob Gates (2021) 'Integrating topic modeling and word embedding to characterize violent deaths' \Sexpr[results=rd]{tools:::Rd_expr_doi("10.31235/osf.io/nkyaq")}

Examples


# load example word embeddings
data(ft_wv_sample)

my.regions <- get_regions(
  wv = ft_wv_sample,
  k_regions = 10L,
  max_iter = 10L,
  seed = 01984
)

text2map documentation built on July 9, 2023, 6:35 p.m.