cem: Coarsened Exact Matching

cemR Documentation

Coarsened Exact Matching


Implementation of Coarsened Exact Matching


cem(treatment=NULL, data = NULL, datalist=NULL, cutpoints = NULL,  
    grouping = NULL, drop=NULL, eval.imbalance = FALSE, k2k=FALSE,  
	method=NULL, mpower=2, L1.breaks = NULL, L1.grouping = NULL, 
    verbose = 0, baseline.group="1",keep.all=FALSE)



character, name of the treatment variable


character, name of the baseline level treatment. See Details.


a data.frame


a list of optional multiply imputed data.frame's


named list each describing the cutpoints for numerical variables (the names are variable names). Each list element is either a vector of cutpoints, a number of cutpoints, or a method for automatic bin contruction. See Details.


named list, each element of which is a list of groupings for a single categorical variable. See Details.


a vector of variable names in the data frame to ignore during matching


Boolean. See Details.


boolean, restrict to k-to-k matching? Default = FALSE


distance method to use in k2k matching. See Details.


power of the Minkowski distance. See Details.


list of cutpoints for the calculation of the L1 measure.


as grouping but only needed in the calculation of the L1 measure not in matching.


controls level of verbosity. Default=0.


if FALSE the coarsened dataset is not returned. Default=FALSE


For multilevel (and a binary) treatment variables, the cem weights are calulated with respect to the baseline. Therefore, matched units with treatment variable equal to the baseline level receive weight 1, the others the usual cem weights. Unless specified, by default baseline is set to "1". If this level is not one of the possible values taken by the treatment variable, then the baseline is set to the first level of the treatment variable.

When specifying cutpoints, several automatic methods may be chosen, including “sturges” (Sturges' rule, the default), “fd” (Freedman-Diaconis' rule), “scott” (Scott's rule) and “ss” (Shimazaki-Shinomoto's rule). See references for a description of each rule.

The grouping option is a list where each element is itself a list. For example, suppose for variable quest1 you have the following possible levels "no answer", NA, "negative", "neutral", "positive" and you want to collect ("no answer", NA, "neutral") into a single group, then the grouping argument should contain list(quest1=list(c("no answer", NA, "neutral"))). Or if you have a discrete variable elements with values 1:10 and you want to collect it into groups “1:3,NA”, “4”, “5:9”, “10” you specify in grouping the following list list(elements=list(c(1:3,NA), 5:9)). Values not defined in the grouping are left as they are. If cutpoints and groupings are defined for the same variable, the groupings take precedence and the corresponding cutpoints are set to NULL.

verbose: a number greater or equal to 0. The higher, the more info are provided during the execution of the algorithm.

If eval.imbalance = TRUE, cem$imbalance contains the imbalance measure by absolute difference in means for numerical variables and chi-square distance for categorical variables. If FALSE (the default) then cem$imbalance is set to NULL. If data contains missing data, the imbalance measures are not calculated.

If L1.breaks is missing, the default rule to calculate cutpoints is the Scott's rule.

If k2k is set to TRUE, the algorithm return strata with the same number of treated and control units per stratum, otherwise all the matched units are returned (default). When k2k = TRUE, the user can choose a method (between 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary' and 'minkowski') for nearest neighbor matching inside each cem strata. By default method is set to 'NULL', which means random matching inside cem strata. For the Minkowski distance the power can be specified via the argument mpower'. For more information on method != NULL, refer to dist help page. If k2k is set to TRUE also keep.all is set to TRUE.

By default, cem treats missing values as distinct categories and matches observations with missing values in the same variable in the same stratum provided that all the remaining (corasened) covariates match.

If argument data is non-NULL and datalist is NULL, CEM is applied to the single data set in data.

Argument datalist is a list of (multiply imputed) data frames (i.e., with missing cell values imputed). If data is NULL, the function cem is applied independently to each element of the list, resulting in separately matched data sets with different numbers of treated and control units.

When data and datalist are both non-NULL, each multiply imputed observation is assigned to the stratum in which it has been matched most frequently. In this case, the algorithm outputs the same matching solution for each multiply imputed data set (i.e., an observation, and the number of treated and control units matched, in one data set has the same meaning in all, and is the same for all)


Returns an object of class cem.match if only data is not NULL or an object of class cem.match.list, which is a list of objects of class cem.match plus a field called unique which is true only if data and datalist are not both NULL. A cem.match object is a list with the following slots:


the call


vector of stratum number in which each observation belongs, NA if the observation has not been matched


number of strata generated


report variables names used for the match


variables removed from the match


the coarsened dataset or NULL if keep.all=FALSE


named list of cutpoints, eventually NULL


name of the treatment variable


factor, each observation belong to one group generated by the treatment variable


number of groups identified by the treatment variable


named list, index of observations belonging to each group


sizes of groups


summary table of matched by group


NULL or a vector of imbalances. See Details.


Stefano Iacus, Gary King, and Giuseppe Porro


Iacus, King, Porro (2011) doi: 10.1198/jasa.2011.tm09599

Iacus, King, Porro (2012) doi: 10.1093/pan/mpr013

Iacus, King, Porro (2019) doi: 10.1017/pan.2018.29

Shimazaki, Shinomoto (2007) doi: 10.1162/neco.2007.19.6.1503



todrop <- c("treated","re78")
imbalance(LL$treated, LL, drop=todrop)

# cem match: automatic bin choice
mat <- cem(treatment="treated", data=LL, drop="re78")

# cem match: user choiced coarsening
re74cut <- hist(LL$re74, br=seq(0,max(LL$re74)+1000, by=1000),plot=FALSE)$breaks
re75cut <- hist(LL$re75, br=seq(0,max(LL$re75)+1000, by=1000),plot=FALSE)$breaks
agecut <- hist(LL$age, br=seq(15,55, length=14),plot=FALSE)$breaks
mycp <- list(re75=re75cut, re74=re74cut, age=agecut)
mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp)

# cem match: user choiced coarsening, k-to-k matching
mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp,k2k=TRUE)

# mahalnobis matching: we use MatchIt
mah <- matchit(treated~age+education+re74+re75+black+hispanic+nodegree+married+u74+u75,
   distance="mahalanobis", data=LL)
imbalance(LL$treated, LL, drop=todrop, weights=mah$weights)

# Multiply Imputed data
# making use of Amelia for multiple imputation
 n <- dim(LL)[1]
 k <- dim(LL)[2]


 LL1 <- LL
 idx <- sample(1:n, .3*n)
 for(i in idx){
  LL1[i,sample(2:k,1)] <- NA

 imputed <- amelia(LL1,noms=c("black","hispanic","treated","married",
 imputed <- imputed$imputations[1:5]
# without information on which observation has missing values
 mat1 <- cem("treated", datalist=imputed, drop="re78")

# ATT estimation
 out <- att(mat1, re78 ~ treated, data=imputed)

# with information about missingness
 mat2 <- cem("treated", datalist=imputed, drop="re78", data=LL1)

# ATT estimation
 out <- att(mat2, re78 ~ treated, data=imputed)

cem documentation built on Sept. 8, 2022, 5:09 p.m.