get_seq_cos_sim: Calculate cosine similarities between target word and...
In conText: 'a la Carte' on Text (ConText) Embedding Regression

get_seq_cos_sim

R Documentation

Calculate cosine similarities between target word and candidates words over sequenced variable using ALC embedding approach

Description

Calculate cosine similarities between target word and candidates words over sequenced variable using ALC embedding approach

Usage

get_seq_cos_sim(
  x,
  seqvar,
  target,
  candidates,
  pre_trained,
  transform_matrix,
  window = 6,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)

Arguments

`x`	(character) vector - this is the set of documents (corpus) of interest
`seqvar`	ordered variable such as list of dates or ordered iseology scores
`target`	(character) vector - target word
`candidates`	(character) vector of features of interest
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`window`	(numeric) - defines the size of a context (words around the target).
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`verbose`	(logical) - if TRUE, report the total number of target instances found.

Value

a data.frame with one column for each candidate term with corresponding cosine similarity values and one column for seqvar.

Examples


library(quanteda)

# gen sequence var (here: year)
docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_seq_cos_sim(x = cr_sample_corpus,
seqvar = docvars(cr_sample_corpus, 'year'),
target = "equal",
candidates = c("immigration", "immigrants"),
pre_trained = cr_glove_subset,
transform_matrix = cr_transform)

conText documentation built on Feb. 16, 2023, 7:32 p.m.