get_ncs | R Documentation |
This is a wrapper function for ncs()
that allows users to go from a
tokenized corpus to results with the option to bootstrap cosine similarities
and get the corresponding std. errors.
get_ncs( x, N = 5, groups = NULL, pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, as_list = TRUE )
x |
a (quanteda) |
N |
(numeric) number of nearest contexts to return |
groups |
a character or factor variable equal in length to the number of documents |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from |
num_bootstraps |
(integer) number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings. NA
if is.null(rownames(x))
.
context
(character) contexts collapsed into single documents (i.e. untokenized).
rank
(character) rank of context in terms of similarity with x
.
value
(numeric) cosine similarity between x
and context.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L, rm_keyword = FALSE) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # compare nearest contexts between groups set.seed(2021L) immig_party_ncs <- get_ncs(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, as_list = TRUE) # nearest neighbors of "immigration" for Republican party immig_party_ncs[["D"]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.