kbal: Kernel Balancing
In chadhazlett/KBAL: Kernel Balancing

View source: R/functions.R

kbal	R Documentation

Kernel Balancing

Description

Kernel balancing (kbal) is non-parametric weighting tool to make two groups have a similar distribution of covariates, not only in terms of means or marginal distributions but also on (i) general smooth functions of the covariates, including on (ii) a smoothing estimator of the joint distribution of the covariates. It was originally designed (Hazlett, 2017) to make control and treated groups look alike, as desired when estimating causal effects under conditional ignorability. This package also facilitates use of this approach for more general distribution-alignment tasks, such as making a sampled group have a similar distribution of covariates as a target population, as in survey reweighting. The examples below provide an introduction to both settings.

To proceed in the causal effect setting, kbal assumes that the expectation of the non-treatment potential outcome conditional on the covariates falls in a large, flexible space of functions associated with a kernel. It then constructs linear bases for this function space and achieves approximate balance on these bases. The approximation is one that minimizes the worst-case bias that could persist due to remaining imbalances.

The kbal function implements kernel balancing using a gaussian kernel to expand the features of X_i to infinite dimensions. It finds approximate mean balance for the control or sample group and treated group or target population in this expanded feature space by using the first numdims dimensions of the singular value decomposition of the gaussian kernel matrix. It employs entropy balancing to find the weights for each unit which produce this approximate balance. When numdims is not user-specified, it searches through increasing dimensions of the SVD of the kernel matrix to find the number of dimensions which produce weights that minimizes the worst-case bias bound with a given hilbertnorm. It then returns these optimal weights, along with the minimized bias, the kernel matrix, a record of the number of dimensions used and the corresponding bias, as well as an original bias using naive group size weights for comparison. Note that while kernel balancing goes far beyond simple mean balancing, it may not result in perfect mean balance. Users who wish to require mean balancing can specify meanfirst = T to require mean balance on as many dimensions of the data as optimally feasible. Alternatively, users can manually specify constraint to append additional vector constraints to the kernel matrix in the bias bound optimization, requiring mean balance on these columns. Note further that kbal supports three types of input data: fully categorical, fully continuous, or mixed. When data is only categorical, as is common with demographic variables for survey reweighting, users should use argument cat_data = TRUE and can input their data as factors, numeric, or characters and kbal will internally transform the data to a more appropriate one-hot encoding and search for the value of b, the denominator of the exponent in the Gaussian, which maximizes the variance of the kernel matrix. When data is fully continuous, users should use default settings (cat_data = FALSE and cont_data = FAlSE, which will scale all columns and again conduct an internal search for the value of b which maximizes the variance of K. Note that with continuous data, this search may take considerably more computational time than the categorical case. When data is a mix of continuous and categorical data, users should use argument mixed_data = TRUE, specify by name what columns are categorical with cat_columns, and also set the scaling of the continuous variables with cont_scale. This will result in a one-hot encoding of categorical columns concatenated with the continuous columns scaled in accordance with cont_scale and again an internal search for the value of b which maximizes the variance in the kernel matrix. Again note that compared to the categorical case, this search will take more computational time.

Usage

kbal(
  allx,
  useasbases = NULL,
  b = NULL,
  sampled = NULL,
  sampledinpop = NULL,
  treatment = NULL,
  population.w = NULL,
  K = NULL,
  K.svd = NULL,
  cat_data = FALSE,
  mixed_data = FALSE,
  cat_columns = NULL,
  cont_scale = NULL,
  scale_data = NULL,
  drop_MC = NULL,
  linkernel = FALSE,
  meanfirst = FALSE,
  mf_columns = NULL,
  constraint = NULL,
  scale_constraint = TRUE,
  numdims = NULL,
  minnumdims = NULL,
  maxnumdims = NULL,
  fullSVD = FALSE,
  incrementby = 1,
  ebal.maxit = 500,
  ebal.tol = 1e-06,
  ebal.convergence = NULL,
  maxsearch_b = 2000,
  early.stopping = TRUE,
  printprogress = TRUE
)

Arguments

`allx`	a data matrix containing all observations where rows are units and columns are covariates. When using only continuous covariates (`cat_data = F` and `mixed_data = F`), all columns must be numeric. When using categorical data (either `cat_data = T` or `mixed_data = T`), categorical columns can be characters or numerics which will be treated as factors. Users should one-hot encoded categorical covariates as this transformation occurs internally.
`useasbases`	optional binary vector to specify what observations are to be used in forming bases (columns) of the kernel matrix to get balance on. If the number of observations is under 4000, the default is to use all observations. When the number of observations is over 4000, the default is to use the sampled (control) units only.
`b`	scaling factor in the calculation of Gaussian kernel distance equivalent to the entire denominator `2\sigma^2` of the exponent. Default is to search for the value which maximizes the variance of the kernel matrix.
`sampled`	a numeric vector of length equal to the total number of units where sampled units take a value of 1 and population units take a value of 0.
`sampledinpop`	a logical to be used in combination with input `sampled` that, when `TRUE`, indicates that sampled units should also be included in the target population when searching for optimal weights.
`treatment`	an alternative input to `sampled` and `sampledinpop` that is a numeric vector of length equal to the total number of units. Current version supports the ATT estimand. Accordingly, the treated units are the target population, and the control are equivalent to the sampled. Weights play the role of making the control groups (sampled) look like the target population (treated). When specified, `sampledinpop` is forced to be `FALSE`.
`population.w`	optional vector of population weights length equal to the number of population units. Must sum to either 1 or the number of population units.
`K`	optional matrix input that takes a user-specified kernel matrix and performs SVD on it internally in the search for weights which minimize the bias bound.
`K.svd`	optional list input that takes a user-specified singular value decomposition of the kernel matrix. This list must include three objects `K.svd$u`, a matrix of left-singular vectors, `K.svd$v`, a matrix of right-singular vectors, and their corresponding singular values `K.svd$d`.
`cat_data`	logical argument that when true indicates `allx` contains only categorical data. When true, the internal construction of the kernel matrix uses a one-hot encoding of `allx` (multiplied by a factor of `\sqrt{0.5}` to compensate for double counting) and the value of `b` which maximizes the variance of this kernel matrix. When true, `mixed_data`, `scale_data`, `linkernel`, and `drop_MC` should be `FALSE`. Default is `FALSE`.
`mixed_data`	logical argument that when true indicates `allx` contains a combination of both continuous and categorical data. When true, the internal construction of the kernel matrix uses a one-hot encoding of the categorical variables in `allx` as specified by `cat_columns` (multiplied by a factor of `\sqrt{0.5}` to compensate for double counting) concatenated with the remaining continuous variables scaled to have default standard deviation of 1 or that specified in `cont_scale`. When both `cat_data` and `cat_data` are `FALSE`, the kernel matrix assumes all continuous data, does not one-hot encode any part of `allx` but still uses the value of `b` which produces maximal variance in `K`. Default is `FALSE`.
`cat_columns`	optional character argument that must be specified when `mixed_data` is `TRUE` and that indicates what columns of `allx` contain categorical variables.
`cont_scale`	optional numeric argument used when `mixed_data` is `TRUE` which specifies how to scale the standard deviation of continuous variables in `allx`. Can be either a a single value or a vector with length equal to the number of continuous variables in `allx` (columns not specified in `cat_columns`) and ordered accordingly.
`scale_data`	logical when true scales the columns of `allx` (demeans and scales variance to 1) before building the kernel matrix internally. This is appropriate when `allx` contains only continuous variables with different scales, but is not recommended when `allx` contains any categorical data. Default is `TRUE` when both `cat_data` and `mixed_data` are `FALSE` and `FALSE` otherwise.
`drop_MC`	logical for whether or not to drop multicollinear columns in `allx` before building `K`. When either `cat_data` or `mixed_data` is `TRUE`, forced to be `FALSE`. Otherwise, with continuous data only, default is `TRUE`.
`linkernel`	logical if true, uses the linear kernel `K=XX'` which achieves balance on the first moments of `X` (mean balance). Note that for computational ease, the code employs `K=X` and adjusts singular values accordingly. Default is `FALSE`.
`meanfirst`	logical if true, internally searches for the optimal number of dimensions of the svd of `allx` to append to `K` as additional constraints. This will produce mean balance on as many dimensions of `allx` as optimally feasible with specified ebalance convergence and a minimal bias bound on the remaining unbalances columns of the left singular vectors of `K`. Note that any scaling specified on `allx` will be also be applied in the meanfirst routine. Default is `FALSE`.
`mf_columns`	either character or numeric vector to specify what columns of `allx` to perform meanfirst with. If left unspecified, all columns will be used.
`constraint`	optional matrix argument of additional constraints which are appended to the front of the left singular vectors of `K`. When specified, the code conducts a constrained optimization requiring mean balance on the columns of this matrix throughout the search for the minimum bias bound over the dimensions of the left singular vectors of `K`.
`scale_constraint`	logical for whether constraints in `constraint` should be scaled before they are appended to the svd of `K`. Default is `TRUE`.
`numdims`	optional numeric argument specifying the number of dimensions of the left singular vectors of the kernel matrix to find balance bypassing the optimization search for the number of dimensions which minimize the biasbound.
`minnumdims`	numeric argument to specify the minimum number of the left singular vectors of the kernel matrix to seek balance on in the search for the number of dimensions which minimize the bias. Default minimum is 1.
`maxnumdims`	numeric argument to specify the maximum number of the left singular vectors of the kernel matrix to seek balance on in the search for the number of dimensions which minimize the bias. For a Gaussian kernel, the default is the minimum between 500 and the number of bases given by `useasbases`. With a linear kernel, the default is the minimum between 500 and the number of columns in `allx`.
`fullSVD`	logical argument for whether the full SVD should be conducted internally. When `FALSE`, the code uses truncated svd methods from the `Rspectra` package in the interest of improving run time. When `FALSE`, the code computes only the SVD up to the either 80 percent of the columns of `K` or `maxnumdims` singular vectors, whichever is larger. When the number of columns is less than 80 percent the number of rows, defaults to full svd. Default is `FALSE`.
`incrementby`	numeric argument to specify the number of dimensions to increase by from `minnumdims` to `maxnumdims` in each iteration of the search for the number of dimensions which minimizes the bias. Default is 1.
`ebal.maxit`	maximum number of iterations used by `ebalance_custom()` in optimization in the search for weights `w`. Default is `500`.
`ebal.tol`	tolerance level used by `ebalance_custom()`. Default is `1e-6`.
`ebal.convergence`	logical to require ebalance convergence when selecting the optimal `numdims` dimensions of `K` that minimize the biasbound. When constraints are appended to the left singular vectors of `K` via `meanfirst=TRUE` or `constraints`, forced to be `TRUE` and otherwise `FALSE`.
`maxsearch_b`	optional argument to specify the maximum b in search for maximum variance of `K` in `b_maxvarK()`. Default is `2000`.
`early.stopping`	logical argument indicating whether bias balance optimization should stop twenty rounds after finding a minimum. Default is `TRUE`.
`printprogress`	logical argument to print updates throughout. Default is `TRUE`.

Value

`w`	a vector of the weights found using entropy balancing on `numdims` dimensions of the SVD of the kernel matrix.
`biasbound_opt`	a numeric giving the minimal bias bound found using `numdims` as the number of dimensions of the SVD of the kernel matrix. When `numdims` is user-specified, the bias bound using this number of dimensions of the kernel matrix.
`biasbound_orig`	a numeric giving the bias bound found when all sampled (control) units have a weight equal to one over the number of sampled (control) units and all target units have a weight equal to one over the number of target units.
`biasbound_ratio`	a numeric giving the ratio of `biasbound_orig` to`biasbound_opt`. Can be informative when comparing the performance of different `b` values.
`dist_record`	a matrix recording the bias bound corresponding to balance on increasing dimensions of the SVD of the kernel matrix starting from `minnumdims` increasing by `incrementby` to `maxnumdims` or until the bias grows to be 1.25 times the minimal bias found.
`numdims`	a numeric giving the optimal number of dimensions of the SVD of the kernel matrix which minimizes the bias bound.
`L1_orig`	a numeric giving the L1 distance found when all sampled (control) units have a weight equal to one over the number of sampled (control) units and all target units have a weight equal to one over the number of target units.
`L1_opt`	a numeric giving the L1 distance at the minimum bias bound found using `numdims` as the number of dimensions of the SVD of the kernel matrix. When `numdims` is user-specified, the L1 distance using this number of dimensions of the kernel matrix.
`K`	the kernel matrix
`onehot_dat`	when categorical data is specified, the resulting one-hot encoded categorical data used in the construction of `K`. When mixed data is specified, returns concatenated one-hot encoded categorical data and scaled continuous data used to construct `K`.
`linkernel`	logical for whether linear kernel was used
`svdK`	a list giving the SVD of the kernel matrix with left singular vectors `svdK$u`, right singular vectors `svdK$v`, and singular values `svdK$d`
`b`	numeric scaling factor used in the the calculation of gaussian kernel equivalent to the denominator `2\sigma^2` of the exponent.
`maxvar_K`	returns the resulting variance of the kernel matrix when the `b` determined internally as the argmax of the variance `K`
`bases`	numeric vector indicating what bases (rows in `allx`) were used to construct kernel matrix (columns of K)
`truncatedSVD.var`	when truncated SVD methods are used on symmetric kernel matrices, a numeric which gives the proportion of the total variance of `K` captured by the first `maxnumdims` singular values found by the truncated SVD. When the kernel matrix is non-symmetric, this is a worst case approximation of the percent variance explained, assuming the remaining unknown singular values are the same magnitude as the last calculated in the truncated SVD.
`dropped_covariates`	provides a vector of character column names for covariates dropped due to multicollinearity.
`meanfirst_dims`	when `meanfirst=TRUE` the optimal number of the singular vectors of `allx` selected and appended to the front of the left singular vectors of `K`
`meanfirst_cols`	when `meanfirst=TRUE` `meanfirst_dims` first left singular vectors of `allx` selected that are appended to the front of the left singular vectors of `K` and balanced on
`ebal_error`	when ebalance is unable to find convergent weights, the associated error message it reports

References

Hazlett, C. (2017), "Kernel Balancing: A flexible non-parametric weighting procedure for estimating causal effects." Forthcoming in Statistica Sinica. https://doi.org/10.5705/ss.202017.0555

Examples

#----------------------------------------------------------------
# Example 1: Reweight a control group to a treated to estimate ATT. 
# Benchmark using Lalonde et al.
#----------------------------------------------------------------
#1. Rerun Lalonde example with settings as in Hazlett, C (2017). Statistica Sinica paper:
set.seed(123)
data("lalonde")
# Select a random subset of 500 rows
lalonde_sample <- sample(1:nrow(lalonde), 500, replace = FALSE)
lalonde <- lalonde[lalonde_sample, ]

xvars=c("age","black","educ","hisp","married","re74","re75","nodegr","u74","u75")
 

kbalout.full= kbal(allx=lalonde[,xvars],
                   b=length(xvars),
                   treatment=lalonde$nsw, 
                   fullSVD = TRUE)
summary(lm(re78~nsw,w=kbalout.full$w, data = lalonde))  
 
 
 #2. Lalonde with categorical data only: u74, u75, nodegree, race, married
 cat_vars=c("race_ethnicity","married","nodegr","u74","u75")
 
 kbalout_cat_only = kbal(allx=lalonde[,cat_vars],
                         cat_data = TRUE,
                         treatment=lalonde$nsw,
                         fullSVD = TRUE)
 kbalout_cat_only$b
 summary(lm(re78~nsw,w=kbalout_cat_only$w, data = lalonde))
 

 #3. Lalonde with mixed categorical and continuous data
 cat_vars=c("race_ethnicity", "married")
 all_vars= c("age","educ","re74","re75","married", "race_ethnicity")
 
 kbalout_mixed = kbal(allx=lalonde[,all_vars],
                      mixed_data = TRUE, 
                      cat_columns = cat_vars,
                      treatment=lalonde$nsw,
                      fullSVD = TRUE)
 kbalout_mixed$b
 summary(lm(re78~nsw,w=kbalout_mixed$w, data = lalonde))
 
 
#----------------------------------------------------------------
# Example 1B: Reweight a control group to a treated to esimate ATT. 
# Benchmark using Lalonde et al. -- but just mean balancing now 
# via "linkernel".
#----------------------------------------------------------------

# Rerun Lalonde example with settings as in Hazlett, C (2017). Statistica paper:
kbalout.lin= kbal(allx=lalonde[,xvars],
                 b=length(xvars),
                 treatment=lalonde$nsw, 
                 linkernel=TRUE,
                 fullSVD=TRUE)

# Check balance with and without these weights:
dimw(X=lalonde[,xvars], w=kbalout.lin$w, target=lalonde$nsw)

summary(lm(re78~nsw,w=kbalout.lin$w, data = lalonde))
 
#----------------------------------------------------------------
# Example 2: Reweight a sample to a target population.
#----------------------------------------------------------------
# Suppose a population consists of four groups in equal shares: 
# white republican, non-white republican, white non-republicans, 
# and non-white non-republicans. A given policy happens to be supported 
# by all white republicans, and nobody else. Thus the mean level of 
# support in the population should be 25%. 
#
# Further, the sample is surveyed in such a way that was careful 
# to quota on party and race, obtaining 50% republican and 50% white.
# However, among republicans three-quarters are white and among non-republicans,
# three quarters are non-white. This biases the average level of support
# despite having a sample that matches the population on its marginal distributions. #'
# We'd like to reweight the sample so it resembles the population not 
# just on the margins, but in the joint distribution of characteristics.

pop <- data.frame(
republican =  c(rep(0,400), rep(1,400)),
white = c(rep(1,200), rep(0,200), rep(1,200), rep(0,200)),
support = c(rep(1,200), rep(0,600)))
  
mean(pop$support)  # Target value
 
# Survey sample: correct margins/means, but wrong joint distribution
samp <- data.frame( republican = c(rep(1, 40), rep(0,40)),
   white = c(rep(1,30), rep(0,10), rep(1,10), rep(0,30)),
   support = c(rep(1,30), rep(0,50)))
  
mean(samp$support)  # Appears that support is 37.5% instead of 25%.
 
# Mean Balancing -----------------------------------------
# Sample is already mean-balanced to the population on each 
# characteristic. However for illustrative purposes, use ebal() 
dat <- rbind(pop,samp)

# Indicate which units are sampled (1) and which are population units(0)
sampled <- c(rep(0,800), rep(1,80))
 
# Run ebal (treatment = population units = 1-sampled)
ebal_out <- ebalance_custom(Treatment = 1-sampled, 
                            X=dat[,1:2],
                            constraint.tolerance=1e-6, 
                            print.level=-1)
 
# We can see everything gets even weights, since already mean balanced.
length(unique(ebal_out$w))

# And we end up with the same estimate we started with
weighted.mean(samp[,3], w = ebal_out$w)
 
# We see that, because the margins are correct, all weights are equal
unique(cbind(samp, e_bal_weight = ebal_out$w))

# Kernel balancing for weighting to a population (i.e. kpop) -------
kbalout = kbal(allx=dat[,1:2],
                useasbases=rep(1,nrow(dat)), 
                sampled = sampled, 
                b = 1,
                sampledinpop = FALSE)
                
# The weights now vary:
plot(kbalout$w[sampled ==1], pch=16)

# And produce correct estimate:
weighted.mean(samp$support, w = kbalout$w[sampled==1])    
 
# kbal correctly downweights white republicans and non-white non-republicans
# and upweights the non-white republicans and white non-republicans
unique(round(cbind(samp[,-3], k_bal_weight = kbalout$w[sampled==1]),6))

chadhazlett/KBAL documentation built on Sept. 23, 2024, 11:48 a.m.