bootstrap_nns: Bootstrap nearest neighbors
In conText: 'a la Carte' on Text (ConText) Embedding Regression

bootstrap_nns

R Documentation

Bootstrap nearest neighbors

Description

Uses bootstrapping –sampling of of texts with replacement– to identify the top N nearest neighbors based on cosine or inner product similarity.

Usage

bootstrap_nns(
  context = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  candidates = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  N = 50,
  norm = "l2"
)

Arguments

`context`	(character) vector of texts - `context` variable in get_context output
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) - if TRUE (default) apply the a la carte transformation, if FALSE ouput untransformed averaged embedding.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`candidates`	(character) vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`.
`bootstrap`	(logical) if TRUE, bootstrap similarity values - sample from texts with replacement. Required to get std. errors.
`num_bootstraps`	(numeric) - number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`N`	(numeric) number of nearest neighbors to return.
`norm`	(character) - how to compute the similarity (see ?text2vec::sim2): `"l2"` cosine similarity `"none"` inner product

Value

a data.frame with the following columns:

feature: (character) vector of feature terms corresponding to the nearest neighbors.
value: (numeric) cosine/inner product similarity between texts and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


# find contexts of immigration
context_immigration <- get_context(x = cr_sample_corpus,
                                   target = 'immigration',
                                   window = 6,
                                   valuetype = "fixed",
                                   case_insensitive = TRUE,
                                   hard_cut = FALSE, verbose = FALSE)

# find local vocab (use it to define the candidate of nearest neighbors)
local_vocab <- get_local_vocab(context_immigration$context, pre_trained = cr_glove_subset)

set.seed(42L)
nns_immigration <- bootstrap_nns(context = context_immigration$context,
                                 pre_trained = cr_glove_subset,
                                 transform_matrix = cr_transform,
                                 transform = TRUE,
                                 candidates = local_vocab,
                                 bootstrap = TRUE,
                                 num_bootstraps = 100,
                                 confidence_level = 0.95,
                                 N = 50,
                                 norm = "l2")

head(nns_immigration)

conText documentation built on Feb. 16, 2023, 7:32 p.m.