contrast_nns: Contrast nearest neighbors

View source: R/contrast_nns.R

contrast_nnsR Documentation

Contrast nearest neighbors

Description

Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group.

Usage

contrast_nns(
  x,
  groups = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  candidates = NULL,
  N = 20,
  verbose = TRUE
)

Arguments

x

(quanteda) tokens-class object

groups

(numeric, factor, character) a binary variable of the same length as x

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.

transform

(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

bootstrap

(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine ratios for each sample. Required to get std. errors.

num_bootstraps

(numeric) - number of bootstraps to use

confidence_level

(numeric in (0,1)) confidence level e.g. 0.95

permute

(logical) - if TRUE, compute empirical p-values using a permutation test

num_permutations

(numeric) - number of permutations to use

candidates

(character) vector of candidate features for nearest neighbors

N

(numeric) - nearest neighbors are subset to the union of the N neighbors of each group (if NULL, ratio is computed for all features)

verbose

(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a data.frame with following columns:

feature

(character) vector of feature terms corresponding to the nearest neighbors.

value

(numeric) ratio of cosine similarities. Average over bootstrapped samples if bootstrap = TRUE.

std.error

(numeric) std. error of the ratio of cosine similarties. Column is dropped if bootsrap = FALSE.

lower.ci

(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.

upper.ci

(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

p.value

(numeric) empirical p-value. Column is dropped if permute = FALSE.

Examples


library(quanteda)

cr_toks <- tokens(cr_sample_corpus)

immig_toks <- tokens_context(x = cr_toks,
pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

set.seed(42L)
party_nns <- contrast_nns(x = immig_toks,
groups = docvars(immig_toks, 'party'),
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
permute = TRUE, num_permutations = 10,
candidates = NULL, N = 20,
verbose = FALSE)

head(party_nns)

conText documentation built on Feb. 16, 2023, 7:32 p.m.