conText: Embedding regression
In conText: 'a la Carte' on Text (ConText) Embedding Regression

conText

R Documentation

Embedding regression

Description

Estimates an embedding regression model with options to use bootstrapping to estimate confidence intervals and a permutation test for inference (see https://github.com/prodriguezsosa/conText for details.)

Usage

conText(
  formula,
  data,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stratify = FALSE,
  permute = TRUE,
  num_permutations = 100,
  window = 6L,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)

Arguments

`formula`	a symbolic description of the model to be fitted with a target word as a DV e.g. `immigrant ~ party + gender`. To use a phrase as a DV, place it in quotations e.g. `"immigrant refugees" ~ party + gender`. To use all covariates included in the data, you can use `.` on RHS, e.g.`immigrant ~ .`. If you wish to treat the full document as you DV, rather than a single target word, use `.` on the LHS e.g. `. ~ party + gender`. If you wish to use all covariates on the RHS use `immigrant ~ .`. Any `character` or `factor` covariates will automatically be converted to a set of binary (`0/1`s) indicator variables for each group, leaving the first level out of the regression.
`data`	a quanteda `tokens-class` object with the necessary document variables. Covariates must be either binary indicator variables or "trasnformable" into binary indicator variables. conText will automatically transform any non-indicator variables into binary indicator variables (multiple if more than 2 classes), leaving out a "base" category.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-run regression on each sample. Required to get std. errors.
`num_bootstraps`	(numeric) number of bootstraps to use (at least 100)
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`stratify`	(logical) if TRUE, stratify by discrete covariates when bootstrapping.
`permute`	(logical) if TRUE, compute empirical p-values using permutation test
`num_permutations`	(numeric) number of permutations to use
`window`	the number of context words to be displayed around the keyword
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`verbose`	(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a conText-class object - a D x M matrix with D = dimensions of the pre-trained feature embeddings provided and M = number of covariates including the intercept. These represent the estimated regression coefficients. These can be combined to compute ALC embeddings for different combinations of covariates. The object also includes various informative attributes, importantly a data.frame with the following columns:

coefficient: (character) name of (covariate) coefficient.
value: (numeric) norm of the corresponding beta coefficient.
std.error: (numeric) (if bootstrap = TRUE) std. error of the norm of the beta coefficient.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value: (numeric) (if permute = TRUE) empirical p.value of the norm of the coefficient.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

## given the target word "immigration"
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
                 data = toks,
                 pre_trained = cr_glove_subset,
                 transform = TRUE, transform_matrix = cr_transform,
                 bootstrap = TRUE,
                 num_bootstraps = 100,
                 confidence_level = 0.95,
                 stratify = FALSE,
                 permute = TRUE, num_permutations = 10,
                 window = 6, case_insensitive = TRUE,
                 verbose = FALSE)

# notice, character/factor covariates are automatically "dummified"
rownames(model1)

# the beta coefficient 'partyR' in this case corresponds to the alc embedding
# of "immigration" for Republican party speeches

# (normed) coefficient table
model1@normed_coefficients

conText documentation built on Feb. 16, 2023, 7:32 p.m.