knitr::opts_chunk$set(collapse = FALSE, comment = "##", tidy = FALSE)
We describe this all in more detail below, but the key for applied researchers is that for now we recommend the use of the jackknife bias correction, which is now built into to our main conText()
function.
We are hugely thankful to Will Hobbs and Breanna Green for bringing to our attention clear examples where the bias was larger than we had anticipated. We are actively collaborating with them to propose additional fixes.
conText
's regression framework allows users to estimate the "semantic" shift in the usage of a term between two or more groups, while controlling for other group covariates. The resulting coefficients will be D-dimensional, where D equals the dimensionality of the pre-trained embeddings used (say, 300). In Rodriguez et al. (2023) we propose using the Euclidean norm to summarize the magnitude of these coefficients, noting however that:
"...estimates of the norms have a finite-sample bias for rare words. Thus care is needed when comparing words or groups with substantially different amounts of available data."
We now give more precise discussion of the potential problems---and a way to mitigate these issues.
While the D-dimensional OLS coefficients resulting from a conText
regression are approximately unbiased, functions of these coefficients, such as the norm, are not. The magnitude of the bias, and hence potential for mistaken inferences, increases with the variance of the embedding position. Thus, the magnitude of the bias is a function of sample size.
As one might expect: the smaller the sample---for any of the slices of the data resulting from the set of covariates---the higher the variance, the larger the bias. In the case of conText
, a multivariate multiple regression, the problem is more acute the larger the dimensionality of the embedding space, with higher dimensionality (say 500 instead of 100 length vectors) resulting in higher variance in small samples, all else equal.
Re-sampling methods offer one way to quantify and correct for (or at least mitigate) this bias. These include - permutation - bootstrapping - jackknife cross-validation
Re-sampling approaches to bias correction all work by leveraging sub-samples of the observed data (contexts in our case) to estimate the magnitude of the bias. The methods differ in how these sub-samples are generated. Ultimately, we currently suggest and implement the jackknife in our software. And we do that in light of experimental evaluations we report below.
We emphasize though that with small samples and highly imbalanced groups, non-trivial amounts of bias will remain.
For our evaluations we employ simulated data, representative of the type of data---numeric vectors i.e. embeddings---we feed into conText
. In particular, we proceed as follows:
sample D-dimensional (D = 100) vectors from two multivariate normal distributions, each representing a hypothetical population of interest (covariate groups in the data, say Republicans and Democrats). Our interest is in estimating the norm of the difference in (row)means of the two populations.
Within this sampling arrangement we then vary...
For more details on this scheme, see the code at the end of this vignette.
Figure 1 plots our results. The plots on the left capture results for when θ, the true difference in means, equals 0. For these plots, if the estimator was unbiased the point estimate would be on the red "zero" line. The plots on the right capture results for when θ equals 1. For these plots, if the estimator was unbiased the point estimate would on the red "one" line.
There are several takeaways:
The bias, with no correction (top row), is particularly problematic in the case of no differences (θ = 0); and the bias is worse given small, highly unbalanced samples (e.g. n=150, class ratio 1:100). Specifically, we are more likely to incorrectly reject the null (commit a Type 1 error). That is, the point estimate is sometimes far from zero, and the confidence interval does not cover zero. As expected, the bias decreases as we increase sample size and/or both groups are more equally balanced.
When comparing the various re-sampling-based bias correction methods, we observe bootstrapping performs poorly (compared to permutation and jackknife) in the absence of a true difference between the two population means. That is, given a true θ of 0, bootstrapping shows a remaining strong positive bias post-correction, for small, highly unbalanced samples.
The permutation-based correction on the other hand performs relatively poorly in the presence of a true difference between the two population means (θ = 1). That is, permutation shows a strong negative bias for small, highly unbalanced samples. And surprisingly this persists even for larger, more balanced samples. The permutation-based correction is however the best-performer in the case of no true differences (θ = 0). While the permutation-based correction does not perform well generally, it still correctly controls the Type 1 error when used as a test.
The jackknife performs relatively well in both scenarios, although it does not entirely eliminate the bias in the case of θ = 0 (under performing the permutation-based correction).
Figure 1:
Given the above results, we proceeded to update conText
to allow users to implement jackknife bias-correction, see Quick Start Guide for examples. Specifically:
- the permutation argument in conText
remains unchanged, implementing a permutation-test to estimate empirical p-values.
- we recommend users combine the Jackknife bias correction with the permutation test when evaluating the significance of the norm of the coefficients.
- we further recommend users avoid relying on results with small samples (for any of the slices of the data implicit in the regression model) and always follow up any regression-based results with a qualitative comparison of nearest neighbors.
Note: The current version of conText
continues to allow for bootstrapping, this argument however does not implement any bias correction and will eventually be deprecated.
# load libraries library(dplyr) library(progress) library(ggplot2) library(gridExtra) library(furrr) # --------------------- # Functions # --------------------- rmse <- function(x) {sqrt(mean(x^2))} plugin <- function(x,y) { rmse(rowMeans(x) - rowMeans(y)) } permute <- function(z,nx,ny) { ind <- sample(1:(nx+ny),nx) xtilde <- z[,ind] ytilde <- z[,-ind] plugin(xtilde,ytilde) } permutation <- function(x,y) { z <- cbind(x,y) plugin(x,y) - mean(replicate(100,permute(z,ncol(x),ncol(y)))) } jackknife <- function(x,y) { cov <- c(rep(1,ncol(x)),rep(0, ncol(y))) z <- t(cbind(x,y)) xbar <- rowMeans(x) ybar <- rowMeans(y) jhat <- mean(sapply(1:nrow(z),jk1, z=z,cov=cov,xbar=xbar,ybar=ybar,nx=ncol(x),ny=ncol(y))) thetahat <- plugin(x,y) return(nrow(z)*thetahat - (nrow(z)-1)*jhat) } jk1 <- function(z,cov,i,xbar,ybar,nx,ny) { if(cov[i]==0) { #y case ytilde <- (ybar*ny - z[i,])/(ny-1) return(rmse(xbar - ytilde)) } else { #x case xtilde <- (xbar*nx - z[i,])/(nx-1) return(rmse(xtilde - ybar)) } #rmse(colMeans(zni[covni==1,]) - colMeans(zni[covni==0,])) } boot <- function(x,y,strat=TRUE) { if(strat) { indx <- sample(1:ncol(x), ncol(x), replace=TRUE) indy <- sample(1:ncol(y), ncol(y), replace=TRUE) return(rmse(rowMeans(x[,indx]) - rowMeans(y[,indy]))) } else { cov <- c(rep(1,ncol(x)),rep(0, ncol(y))) z <- cbind(x,y) covb <- c(1,1) while(length(unique(covb))==1) { ind <- sample(1:ncol(z),ncol(z),replace=TRUE) covb <- cov[ind] } zb <- z[,ind] return(rmse(rowMeans(zb[,covb==1, drop=FALSE]) - rowMeans(zb[,covb==0,drop=FALSE]))) } } bootstrap <- function(x,y,strat=TRUE) { 2*plugin(x,y) - mean(replicate(500, boot(x,y,strat=strat))) } # sampling function samp_function <- function(mux, muy, samp_size, class_ratio,D) { # set sample size nx <- ceiling(class_ratio*samp_size/(1+class_ratio)) ny <- samp_size-nx # sample x <- replicate(nx,rnorm(D,mean=mux)) y <- replicate(ny,rnorm(D,mean=muy)) out <- data.frame( samp_size = samp_size, class_ratio = class_ratio, #method = c("none", "Jackknife", "permutation", "bootstrap", "bootstrap (stratified)"), method = c("none", "Jackknife", "permutation", "bootstrap"), value = c( plugin(x,y), jackknife(x,y), permutation(x,y), #bootstrap(x,y,strat=FALSE), bootstrap(x,y) ) ) return(out) } # --------------------- # Simulations # --------------------- sim_function <- function(theta, D, samp_size, class_ratio) { mux <- rep(0,D) muy <- rep(theta,D) # set sample size nx <- ceiling(class_ratio*samp_size/(1+class_ratio)) ny <- samp_size-nx # sample x <- replicate(nx,rnorm(D,mean=mux)) y <- replicate(ny,rnorm(D,mean=muy)) out <- data.frame( D = D, theta = theta, samp_size = samp_size, class_ratio = class_ratio, method = c("none", "Jackknife", "permutation", "bootstrap"), value = c( plugin(x,y), jackknife(x,y), permutation(x,y), bootstrap(x,y) ) ) return(out) } params <- expand.grid( samp_size=c(150, 250, 500, 1000, 2500), class_ratio=c(0.01, 0.1, 0.5, 1), theta=c(0, 1), D = c(10,100) ) plan(multisession, workers = 12) sim_out_df <- 1:100 %>% future_map_dfr( ~params %>% pmap( sim_function ), .progress=TRUE, .options =furrr_options(seed = 2023L) ) # --------------------- # Plot # --------------------- plot_df <- sim_out_df %>% mutate(method = factor(method, levels = c("none", "bootstrap", "permutation", "Jackknife"))) %>% group_by(D, theta, samp_size, class_ratio, method) %>% summarize(avg = mean(value), p025 = quantile(value, 0.025), p975 = quantile(value, 0.975), .groups = "drop_last") # new facet label names ratio.labs <- c("class ratio\n1:100", "class ratio\n1:10", "class ratio\n1:2", "class ratio\n1:1") names(ratio.labs) <- c("0.01", "0.1", "0.5", "1") # plot w. theta = 0 plot1 <- ggplot(subset(plot_df, theta == 0 & D == 100), aes(x = factor(samp_size), y = avg, group =1)) + geom_point() + geom_errorbar(aes(ymin = p025 , ymax = p975), width = 0.2) + geom_hline(aes(yintercept = 0), color = 'red', linetype = 'dashed') + facet_grid(method~class_ratio, labeller = labeller(class_ratio = ratio.labs)) + labs(title = paste0("\u03b8"," = 0\n"), y = expression(hat(theta)), x = "Sample Size", caption = paste0("Note: red dashed line = ", "\u03b8.")) + theme( strip.text.x = element_text(size=16, face = 'bold'), strip.text.y = element_text(size=16, face = 'bold'), axis.text.x = element_text(angle=90, vjust = 0.5, hjust = 1, size = 14), axis.text.y = element_text(size = 14), axis.title.y = element_text(size=16, margin = margin(t = 0, r = 15, b = 0, l = 15), face = 'bold'), axis.title.x = element_text(size=16, margin = margin(t = 15, r = 0, b = 15, l = 0), face = 'bold'), plot.caption = element_text(hjust = 0, size = 16), plot.title = element_text(face = "bold", hjust = 0.5, size = 18)) # plot w. theta = 1 plot2 <- ggplot(subset(plot_df, theta == 1 & D == 100), aes(x = factor(samp_size), y = avg, group =1)) + geom_point() + geom_errorbar(aes(ymin = p025 , ymax = p975), width = 0.2) + geom_hline(aes(yintercept = 1), color = 'red', linetype = 'dashed') + facet_grid(method~class_ratio, labeller = labeller(class_ratio = ratio.labs)) + labs(title = paste0("\u03b8"," = 1"), y = expression(hat(theta)), x = "Sample Size") + theme( strip.text.x = element_text(size=16, face = 'bold'), strip.text.y = element_text(size=16, face = 'bold'), axis.text.x = element_text(angle=90, vjust = 0.5, hjust = 1, size = 14), axis.text.y = element_text(size = 14), axis.title.y = element_text(size=16, margin = margin(t = 0, r = 15, b = 0, l = 15), face = 'bold'), axis.title.x = element_text(size=16, margin = margin(t = 15, r = 0, b = 15, l = 0), face = 'bold'), plot.title = element_text(face = "bold", hjust = 0.5, size = 18)) grid.arrange(plot1, plot2, ncol=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.