conText: Embedding regression

View source: R/conText.R

conTextR Documentation

Embedding regression

Description

Estimates an embedding regression model with options to use bootstrapping to estimate confidence intervals and a permutation test for inference (see https://github.com/prodriguezsosa/conText for details.)

Usage

conText(
  formula,
  data,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stratify = FALSE,
  permute = TRUE,
  num_permutations = 100,
  window = 6L,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)

Arguments

formula

a symbolic description of the model to be fitted with a target word as a DV e.g. immigrant ~ party + gender. To use a phrase as a DV, place it in quotations e.g. "immigrant refugees" ~ party + gender. To use all covariates included in the data, you can use . on RHS, e.g.immigrant ~ .. If you wish to treat the full document as you DV, rather than a single target word, use . on the LHS e.g. . ~ party + gender. If you wish to use all covariates on the RHS use immigrant ~ .. Any character or factor covariates will automatically be converted to a set of binary (0/1s) indicator variables for each group, leaving the first level out of the regression.

data

a quanteda tokens-class object with the necessary document variables. Covariates must be either binary indicator variables or "trasnformable" into binary indicator variables. conText will automatically transform any non-indicator variables into binary indicator variables (multiple if more than 2 classes), leaving out a "base" category.

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.

transform

(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

bootstrap

(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-run regression on each sample. Required to get std. errors.

num_bootstraps

(numeric) number of bootstraps to use (at least 100)

confidence_level

(numeric in (0,1)) confidence level e.g. 0.95

stratify

(logical) if TRUE, stratify by discrete covariates when bootstrapping.

permute

(logical) if TRUE, compute empirical p-values using permutation test

num_permutations

(numeric) number of permutations to use

window

the number of context words to be displayed around the keyword

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

hard_cut

(logical) - if TRUE then a context must have window x 2 tokens, if FALSE it can have window x 2 or fewer (e.g. if a doc begins with a target word, then context will have window tokens rather than window x 2)

verbose

(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a conText-class object - a D x M matrix with D = dimensions of the pre-trained feature embeddings provided and M = number of covariates including the intercept. These represent the estimated regression coefficients. These can be combined to compute ALC embeddings for different combinations of covariates. The object also includes various informative attributes, importantly a data.frame with the following columns:

coefficient

(character) name of (covariate) coefficient.

value

(numeric) norm of the corresponding beta coefficient.

std.error

(numeric) (if bootstrap = TRUE) std. error of the norm of the beta coefficient.

lower.ci

(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.

upper.ci

(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

p.value

(numeric) (if permute = TRUE) empirical p.value of the norm of the coefficient.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

## given the target word "immigration"
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
                 data = toks,
                 pre_trained = cr_glove_subset,
                 transform = TRUE, transform_matrix = cr_transform,
                 bootstrap = TRUE,
                 num_bootstraps = 100,
                 confidence_level = 0.95,
                 stratify = FALSE,
                 permute = TRUE, num_permutations = 10,
                 window = 6, case_insensitive = TRUE,
                 verbose = FALSE)

# notice, character/factor covariates are automatically "dummified"
rownames(model1)

# the beta coefficient 'partyR' in this case corresponds to the alc embedding
# of "immigration" for Republican party speeches

# (normed) coefficient table
model1@normed_coefficients


conText documentation built on Feb. 16, 2023, 7:32 p.m.