knitr::opts_chunk$set(collapse = FALSE, comment = "##", tidy = TRUE)
devtools::install_github("prodriguezsosa/labelProp")
library(labelProp) # other libraries used in this guide library(dplyr) library(quanteda) library(conText)
labelProp
offers a semi-supervised approach to labeling text data, leveraging vector representations of text and network methods. The general problem labelProp
solves is the following: given a collection of nodes --these can be words, short texts etc.-- for which we have vector representations --importantly, on the same space--, including a small subset for which we have labels --which we call seeds--, how do we determine the labels of the remaining set of nodes? Currently labelProp
offers two methods:
nns
algorithm. The advantage of rw
over nns
is that...labelProp
comes with three datasets used in the examples below:
Dictionary-based approaches are popular in the social sciences. They are transparent and easy to implement. However, while there are plenty of off-the-shelf dictionaries available --e.g. LIWIC--, they are often ill-suited to the corpus and/or concept of interest. On the other hand, building a dictionary from scratch however labor-intensive task. Given a small set of seed dictionary terms, labelProp
scores unlabeled terms, giving users a principled way of "expanding" their locally-tailored dictionaries.
To use the rw
algorithm we first build a transition matrix using our set of pre-trained vector-representations as input. This can be done using the labelProp::build_transition_matrix()
function which is a wrapper around parallelDist
, allowing for parallelization. beta
is the only other function parameter in labelProp
specific to the rw
algorithm, it specifies the extent to which the algorithm favors local (similar labels for neighbors) vs. global (correct labels on seed nodes) consistency -- lower (higher) values emphasize local (global) consistency. In practice we've found results to be fairly robust to the specification of beta
. To use the nns
algorithm, users must provide a set of vector representations, one for each node. The user must then provide a set of labeled nodes --seeds
-- in the form of a named list.
Users can futher choose to compute standard errors for the assigned scores using a bootstrapping procedure --the algorithm is replicated num_bootstrap
times using a sample proportion --prop_seeds
-- of the seeds. A permutation-based approach is used to compute an empirical p-value. If permute
, then seeds are sampled at random num_permutations
times, and an empirical null is computed with which to compare the estimated score of each node. Users can also choose to normalize scores by the sum of the scores across the various classes -- this is useful if for examle we are building a sentiment dictionary and are looking to rank words according to their polarity score. This can be done using the softmax
parameter. Note, the unlabeled nodes with the highest scores will likely be different depending on the algorithm. It will often make sense to use both -- for example when combined with human selection of words.
# to use the random-walkd algorithm we first build a transition matrix transition_matrix <- build_transition_matrix(x = anes2016_glove, threads = 6L) # define seeds (labeled nodes), if list is unlabeled, "class1", "class2" etc. will be used as labels seeds = list("immigration" = c("immigration", "immigrants", "immigrant"), "economy" = c("jobs", "unemployment", "wages")) # propagate label using rw rw_labels <- labelProp(x = transition_matrix, seeds = seeds, method = "rw", beta = 0.5) # propagate label using nns, notice the main input, x, are the vector representations nns_labels <- labelProp(x = anes2016_glove, seeds = seeds, method = "nns") # check output for economy rw_labels[["economy"]] %>% arrange(-score) %>% head() nns_labels[["economy"]] %>% arrange(-score) %>% head() # check output for immigration rw_labels[["immigration"]] %>% arrange(-score) %>% head() nns_labels[["immigration"]] %>% arrange(-score) %>% head()
labelProp
is not limited to words, indeed it can be applied to any set of inputs for which we have or can estimate vector representations. In the two examples that follow we use conText
together with a subset of Stanford GloVe (300 dimensional) embeddings and their corresponding transformation matrix --estimated by Khodak et al.-- to embed our short ANES open-ended responses.
Since these document embeddings lie on the same space as the our set of pre-trained word embeddings, we can use our seed terms to score documents according to their relevance to each class.
# tokenize documents anes_toks <- unique(anes2016$response) %>% tokens() # embed documents using conText doc_vs <- anes_toks %>% dfm %>% dem(pre_trained = anes2016_glove, transform = TRUE, transform_matrix = khodakA, verbose = FALSE) %>% as.matrix() # seed word vectors (from pre-trained embeddings) word_vs <- anes2016_glove[unlist(seeds),] # combine word and document vectors into a single matrix of node vectors node_vs <- rbind(doc_vs, word_vs) # transition matrix transition_matrix <- build_transition_matrix(x = node_vs, threads = 6L) # label propagate rw_labels <- labelProp(x = transition_matrix, seeds = seeds, method = "rw", beta = 0.5) # output (filtering to just document nodes) economy_docs <-rw_labels[["economy"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10) immigration_docs <- rw_labels[["immigration"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10) # combine with texts%>% tokens_subset(anes_toks, docid(anes_toks) %in% economy_docs$node) %>% sapply(function(x) paste(x, collapse = " ")) tokens_subset(anes_toks, docid(anes_toks) %in% immigration_docs$node) %>% sapply(function(x) paste(x, collapse = " "))
We can also use embedded documents --for which we know their label-- as seeds.
# documents as seeds doc_seeds <- list("immigration" = c("text4771", "text4024"), "economy" = c("text2010", "text5925")) # transition matrix transition_matrix <- build_transition_matrix(x = doc_vs, threads = 6L) # label propagate rw_labels <- labelProp(x = transition_matrix, seeds = doc_seeds, method = "rw", beta = 0.5) # output (filtering to just document nodes) economy_docs <-rw_labels[["economy"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10) immigration_docs <- rw_labels[["immigration"]] %>% filter(grepl("text", node)) %>% arrange(-score) %>% slice(1:10) # combine with texts%>% tokens_subset(anes_toks, docid(anes_toks) %in% economy_docs$node) %>% sapply(function(x) paste(x, collapse = " ")) tokens_subset(anes_toks, docid(anes_toks) %in% immigration_docs$node) %>% sapply(function(x) paste(x, collapse = " "))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.