get_nns_ratio: Given a corpus and a binary grouping variable, computes the...
In conText: 'a la Carte' on Text (ConText) Embedding Regression

get_nns_ratio

R Documentation

Given a corpus and a binary grouping variable, computes the ratio of cosine similarities over the union of their respective N nearest neighbors.

Description

This is a wrapper function for nns_ratio() that allows users to go from a tokenized corpus to results with the option to: (1) bootstrap cosine similarity ratios and get the corresponding std. errors. (2) use a permutation test to get empirical p-values for inference.

Usage

get_nns_ratio(
  x,
  N = 10,
  groups,
  numerator = NULL,
  candidates = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  stem = FALSE,
  language = "porter",
  verbose = TRUE,
  show_language = TRUE
)

Arguments

`x`	a (quanteda) tokens object
`N`	(numeric) number of nearest neighbors to return. Nearest neighbors consist of the union of the top N nearest neighbors of the embeddings in `x`. If these overlap, then resulting N will be smaller than 2*N.
`groups`	a character or factor variable equal in length to the number of documents
`numerator`	(character) defines which group is the nuemerator in the ratio.
`candidates`	(character) vector of features to consider as candidates to be nearest neighbor You may for example want to only consider features that meet a certian count threshold or exclude stop words etc. To do so you can simply identify the set of features you want to consider and supply these as a character vector in the `candidates` argument.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine similarity ratios for each sample. Required to get std. errors. If `groups` defined, sampling is automatically stratified.
`num_bootstraps`	(integer) number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`permute`	(logical) if TRUE, compute empirical p-values using permutation test
`num_permutations`	(numeric) number of permutations to use.
`stem`	(logical) - whether to stem candidates when evaluating nns. Default is FALSE. If TRUE, candidate stems are ranked by their average cosine similarity to the target. We recommend you remove misspelled words from candidate set `candidates` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`verbose`	provide information on which group is the numerator
`show_language`	(logical) if TRUE print out message with language used for stemming.

Value

a data.frame with following columns:

feature: (character) features in candidates (or all features if candidates not defined), one instance for each embedding in x.
value: (numeric) cosine similarity ratio between x and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value: (numeric) empirical p-value of bootstrapped ratio of cosine similarities if permute = TRUE, if FALSE, column is dropped.
group: (character) group in groups for which feature belongs to the top N nearest neighbors. If "shared", the feature appeared as top nearest neighbor for both groups.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 50 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 50, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compute ratio
set.seed(2021L)
immig_nns_ratio <- get_nns_ratio(x = immig_toks,
                                 N = 10,
                                 groups = docvars(immig_toks, 'party'),
                                 numerator = "R",
                                 candidates = feats,
                                 pre_trained = cr_glove_subset,
                                 transform = TRUE,
                                 transform_matrix = cr_transform,
                                 bootstrap = FALSE,
                                 # if bootstrap = TRUE, num_bootstraps should be at least 100,
                                 permute = FALSE,
                                 num_permutations = 5,
                                 verbose = FALSE)

head(immig_nns_ratio)

conText documentation built on April 12, 2026, 9:06 a.m.