get_grouped_similarity: Get averaged similarity scores between target word(s) and one...
In prodriguezsosa/conText: 'a la Carte' on Text (ConText) Embedding Regression

get_grouped_similarity

R Documentation

Get averaged similarity scores between target word(s) and one or two vectors of candidate words.

Description

Get similarity scores between a target word or words and a comparison vector of one candidate word or words. When two vectors of candidate words are provided (second_vec is not NULL), the function calculates the cosine similarity between a composite index of the two vectors. This is operationalized as the mean similarity of the target word to the first vector of terms plus negative one multiplied by the mean similarity to the second vector of terms.

Usage

get_grouped_similarity(
  x,
  target,
  first_vec,
  second_vec,
  pre_trained,
  transform_matrix,
  group_var,
  window = window,
  norm = "l2",
  remove_punct = FALSE,
  remove_symbols = FALSE,
  remove_numbers = FALSE,
  remove_separators = FALSE,
  valuetype = "fixed",
  hard_cut = FALSE,
  case_insensitive = TRUE
)

Arguments

`x`	a (quanteda) `corpus` object
`target`	(character) vector of words
`first_vec`	(character) vector of words
`second_vec`	(character) vector of words
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings, usually trained on the same corpus as that used for `x`. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`group_var`	(character) variable name in corpus object defining grouping variable
`window`	(numeric) - defines the size of a context (words around the target)
`norm`	(character) - "l2" for l2 normalized cosine similarity and "none" for dot product
`remove_punct`	(logical) - if `TRUE` remove all characters in the Unicode "Punctuation" `⁠[P]⁠` class
`remove_symbols`	(logical) - if `TRUE` remove all characters in the Unicode "Symbol" `⁠[S]⁠` class
`remove_numbers`	(logical) - if `TRUE` remove tokens that consist only of numbers, but not words that start with digits, e.g. `⁠2day⁠`
`remove_separators`	(logical) - if `TRUE` remove separators and separator characters (Unicode "Separator" `⁠[Z]⁠` and "Control" `⁠[C]⁠` categories)
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`case_insensitive`	(logical) - if `TRUE`, ignore case when matching a target patter

Value

a data.frame with the following columns:

group: the grouping variable specified for the analysis
val: (numeric) cosine similarity scores

Examples

quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_grouped_similarity(cr_sample_corpus,
                                    group_var = "year",
                                    target = "immigration",
                                    first_vec = c("left", "lefty"),
                                    second_vec = c("right", "rightwinger"),
                                    pre_trained = cr_glove_subset,
                                    transform_matrix = cr_transform,
                                    window = 12L,
                                    norm = "l2")

prodriguezsosa/conText documentation built on April 23, 2024, 7:04 p.m.