get_ncs: Given a set of tokenized contexts, find the top N nearest...
In conText: 'a la Carte' on Text (ConText) Embedding Regression

get_ncs

R Documentation

Given a set of tokenized contexts, find the top N nearest contexts.

Description

This is a wrapper function for ncs() that allows users to go from a tokenized corpus to results with the option to bootstrap cosine similarities and get the corresponding std. errors.

Usage

get_ncs(
  x,
  N = 5,
  groups = NULL,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  as_list = TRUE
)

Arguments

`x`	a (quanteda) `tokens-class` object
`N`	(numeric) number of nearest contexts to return
`groups`	a character or factor variable equal in length to the number of documents
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from `x` with replacement and re-estimate cosine similarities for each sample. Required to get std. errors. If `groups` defined, sampling is automatically stratified.
`num_bootstraps`	(integer) number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
context: (character) contexts collapsed into single documents (i.e. untokenized).
rank: (character) rank of context in terms of similarity with x.
value: (numeric) cosine similarity between x and context.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration",
window = 6L, rm_keyword = FALSE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compare nearest contexts between groups
set.seed(2021L)
immig_party_ncs <- get_ncs(x = immig_toks,
                           N = 10,
                           groups = docvars(immig_toks, 'party'),
                           pre_trained = cr_glove_subset,
                           transform = TRUE,
                           transform_matrix = cr_transform,
                           bootstrap = TRUE,
                           num_bootstraps = 100,
                           confidence_level = 0.95,
                           as_list = TRUE)

# nearest neighbors of "immigration" for Republican party
immig_party_ncs[["D"]]

conText documentation built on Feb. 16, 2023, 7:32 p.m.