In prodriguezsosa/labelProp: Label propagation algorithm for semi-supervised learning

knitr::opts_chunk$set(collapse = FALSE,
                      comment = "##",
                      tidy = TRUE)

Installing the package

devtools::install_github("prodriguezsosa/labelProp")

Load package

library(labelProp)

# other libraries used in this guide
library(dplyr)
library(quanteda)
library(conText)

Preliminaries

labelProp offers a semi-supervised approach to labeling text data, leveraging vector representations of text and network methods. The general problem labelProp solves is the following: given a collection of nodes --these can be words, short texts etc.-- for which we have vector representations --importantly, on the same space--, including a small subset for which we have labels --which we call seeds--, how do we determine the labels of the remaining set of nodes? Currently labelProp offers two methods:

Nearest neighbors (nns): each unlabeled node is assigned a score based on its mean cosine similarity to seed nodes. This algorithm requires vector representations for the full set of nodes, labeled and unlabeled.
Random walk (rw): each unlabeled node is assigned a score based on the probability of a random walk from the seed set hitting that node. This algorithm is adapted from Zhou et al. (2004) and Hamilton et al. (2016). This algorithm requires a transition matrix be estimated from the vector representations for the full set of nodes, labeled and unlabeled. For very large node sets ($10^6$), this can be computationally intensive in which case users should use the nns algorithm. The advantage of rw over nns is that...

Data

labelProp comes with three datasets used in the examples below:

A (data.frame) anes2016 with open-ended responses to the "most important issues facing the country" question in ANES 2016.
A set of (GloVe) pre-trained embeddings corresponding to the full set of features (with min. freq. of 5) appearing in the open-ended responses.
A transformation matrix estimated by Khodak et al. for the Stanford GloVe 300 embeddings.

Labeling words using seed words

Dictionary-based approaches are popular in the social sciences. They are transparent and easy to implement. However, while there are plenty of off-the-shelf dictionaries available --e.g. LIWIC--, they are often ill-suited to the corpus and/or concept of interest. On the other hand, building a dictionary from scratch however labor-intensive task. Given a small set of seed dictionary terms, labelProp scores unlabeled terms, giving users a principled way of "expanding" their locally-tailored dictionaries.

To use the rw algorithm we first build a transition matrix using our set of pre-trained vector-representations as input. This can be done using the labelProp::build_transition_matrix() function which is a wrapper around parallelDist, allowing for parallelization. beta is the only other function parameter in labelProp specific to the rw algorithm, it specifies the extent to which the algorithm favors local (similar labels for neighbors) vs. global (correct labels on seed nodes) consistency -- lower (higher) values emphasize local (global) consistency. In practice we've found results to be fairly robust to the specification of beta. To use the nns algorithm, users must provide a set of vector representations, one for each node. The user must then provide a set of labeled nodes --seeds-- in the form of a named list.

Users can futher choose to compute standard errors for the assigned scores using a bootstrapping procedure --the algorithm is replicated num_bootstrap times using a sample proportion --prop_seeds-- of the seeds. A permutation-based approach is used to compute an empirical p-value. If permute, then seeds are sampled at random num_permutations times, and an empirical null is computed with which to compare the estimated score of each node. Users can also choose to normalize scores by the sum of the scores across the various classes -- this is useful if for examle we are building a sentiment dictionary and are looking to rank words according to their polarity score. This can be done using the softmax parameter. Note, the unlabeled nodes with the highest scores will likely be different depending on the algorithm. It will often make sense to use both -- for example when combined with human selection of words.

# to use the random-walkd algorithm we first build a transition matrix
transition_matrix <- build_transition_matrix(x = anes2016_glove, threads = 6L)

# define seeds (labeled nodes), if list is unlabeled, "class1", "class2" etc. will be used as labels
seeds = list("immigration" = c("immigration", "immigrants", "immigrant"), "economy" = c("jobs", "unemployment", "wages"))

# propagate label using rw
rw_labels <- labelProp(x = transition_matrix, seeds = seeds, method = "rw", beta = 0.5)

# propagate label using nns, notice the main input, x, are the vector representations
nns_labels <- labelProp(x = anes2016_glove, seeds = seeds, method = "nns")

# check output for economy
rw_labels[["economy"]] %>% arrange(-score) %>% head()
nns_labels[["economy"]] %>% arrange(-score) %>% head()

# check output for immigration
rw_labels[["immigration"]] %>% arrange(-score) %>% head()
nns_labels[["immigration"]] %>% arrange(-score) %>% head()

Document scoring

labelProp is not limited to words, indeed it can be applied to any set of inputs for which we have or can estimate vector representations. In the two examples that follow we use conText together with a subset of Stanford GloVe (300 dimensional) embeddings and their corresponding transformation matrix --estimated by Khodak et al.-- to embed our short ANES open-ended responses.

Labeling documents using seed words

Since these document embeddings lie on the same space as the our set of pre-trained word embeddings, we can use our seed terms to score documents according to their relevance to each class.

# tokenize documents
anes_toks <- unique(anes2016$response) %>% tokens()

# embed documents using conText
doc_vs <- anes_toks %>% dfm %>% dem(pre_trained = anes2016_glove, transform = TRUE, transform_matrix = khodakA, verbose = FALSE) %>% as.matrix()

# seed word vectors (from pre-trained embeddings)
word_vs <- anes2016_glove[unlist(seeds),]

# combine word and document vectors into a single matrix of node vectors
node_vs <- rbind(doc_vs, word_vs)

# transition matrix
transition_matrix <- build_transition_matrix(x = node_vs, threads = 6L)

# label propagate
rw_labels <- labelProp(x = transition_matrix, seeds = seeds, method = "rw", beta = 0.5)

# output (filtering to just document nodes)
economy_docs <-rw_labels[["economy"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10)
immigration_docs <- rw_labels[["immigration"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10)

# combine with texts%>% 
tokens_subset(anes_toks, docid(anes_toks) %in% economy_docs$node) %>% sapply(function(x) paste(x, collapse = " "))
tokens_subset(anes_toks, docid(anes_toks) %in% immigration_docs$node) %>% sapply(function(x) paste(x, collapse = " "))

Labeling documents using seed words

We can also use embedded documents --for which we know their label-- as seeds.

# documents as seeds
doc_seeds <- list("immigration" = c("text4771", "text4024"), "economy" = c("text2010", "text5925"))

# transition matrix
transition_matrix <- build_transition_matrix(x = doc_vs, threads = 6L)

# label propagate
rw_labels <- labelProp(x = transition_matrix, seeds = doc_seeds, method = "rw", beta = 0.5)

# output (filtering to just document nodes)
economy_docs <-rw_labels[["economy"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10)
immigration_docs <- rw_labels[["immigration"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10)

# combine with texts%>% 
tokens_subset(anes_toks, docid(anes_toks) %in% economy_docs$node) %>% sapply(function(x) paste(x, collapse = " "))
tokens_subset(anes_toks, docid(anes_toks) %in% immigration_docs$node) %>% sapply(function(x) paste(x, collapse = " "))