nns_ratio | R Documentation |
Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group. Values larger (smaller) than 1 mean the feature is more (less) discriminant of the group in the numerator (denominator).
nns_ratio( x, N = 10, numerator = NULL, candidates = character(0), pre_trained, stem = FALSE, language = "porter", verbose = TRUE, show_language = TRUE )
x |
a (quanteda) |
N |
(numeric) number of nearest neighbors to return. Nearest neighbors
consist of the union of the top N nearest neighbors of the embeddings in |
numerator |
(character) defines which group is the nuemerator in the ratio |
candidates |
(character) vector of features to consider as candidates to be nearest neighbor
You may for example want to only consider features that meet a certian count threshold
or exclude stop words etc. To do so you can simply identify the set of features you
want to consider and supply these as a character vector in the |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
verbose |
report which group is the numerator and which group is the denominator. |
show_language |
(logical) if TRUE print out message with language used for stemming. |
a data.frame
with following columns:
feature
(character) features in candidates
(or all features if candidates
not defined), one instance for each embedding in x
.
value
(numeric) ratio of cosine similarities.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # compute the cosine similarity between each party's # embedding and a specific set of features nns_ratio(x = immig_wv_party, N = 10, numerator = "R", candidates = immig_wv_party@features, pre_trained = cr_glove_subset, verbose = FALSE) # with stemming nns_ratio(x = immig_wv_party, N = 10, numerator = "R", candidates = immig_wv_party@features, pre_trained = cr_glove_subset, stem = TRUE, verbose = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.