get_grouped_similarity: Get averaged similarity scores between target word(s) and one...

View source: R/get_grouped_similarity.R

get_grouped_similarityR Documentation

Get averaged similarity scores between target word(s) and one or two vectors of candidate words.

Description

Get similarity scores between a target word or words and a comparison vector of one candidate word or words. When two vectors of candidate words are provided (second_vec is not NULL), the function calculates the cosine similarity between a composite index of the two vectors. This is operationalized as the mean similarity of the target word to the first vector of terms plus negative one multiplied by the mean similarity to the second vector of terms.

Usage

get_grouped_similarity(
  x,
  target,
  first_vec,
  second_vec,
  pre_trained,
  transform_matrix,
  group_var,
  window = window,
  norm = "l2",
  remove_punct = FALSE,
  remove_symbols = FALSE,
  remove_numbers = FALSE,
  remove_separators = FALSE,
  valuetype = "fixed",
  hard_cut = FALSE,
  case_insensitive = TRUE
)

Arguments

x

a (quanteda) corpus object

target

(character) vector of words

first_vec

(character) vector of words

second_vec

(character) vector of words

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings, usually trained on the same corpus as that used for x. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

group_var

(character) variable name in corpus object defining grouping variable

window

(numeric) - defines the size of a context (words around the target)

norm

(character) - "l2" for l2 normalized cosine similarity and "none" for dot product

remove_punct

(logical) - if TRUE remove all characters in the Unicode "Punctuation" ⁠[P]⁠ class

remove_symbols

(logical) - if TRUE remove all characters in the Unicode "Symbol" ⁠[S]⁠ class

remove_numbers

(logical) - if TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. ⁠2day⁠

remove_separators

(logical) - if TRUE remove separators and separator characters (Unicode "Separator" ⁠[Z]⁠ and "Control" ⁠[C]⁠ categories)

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching

hard_cut

(logical) - if TRUE then a context must have window x 2 tokens, if FALSE it can have window x 2 or fewer (e.g. if a doc begins with a target word, then context will have window tokens rather than window x 2)

case_insensitive

(logical) - if TRUE, ignore case when matching a target patter

Value

a data.frame with the following columns:

group

the grouping variable specified for the analysis

val

(numeric) cosine similarity scores

Examples

quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_grouped_similarity(cr_sample_corpus,
                                    group_var = "year",
                                    target = "immigration",
                                    first_vec = c("left", "lefty"),
                                    second_vec = c("right", "rightwinger"),
                                    pre_trained = cr_glove_subset,
                                    transform_matrix = cr_transform,
                                    window = 12L,
                                    norm = "l2")

prodriguezsosa/conText documentation built on April 23, 2024, 7:04 p.m.