Nothing
#' Polymatching
#'
#' \code{polymatch} generates matched samples in designs with up to 10 groups.
#'
#' @param formulaMatch Formula with form \code{group ~ x_1 + ... + x_p}, where \code{group} is the name of the variable
#' identifying the treatment groups/exposures and \code{x_1},...,\code{x_p} are the matching variables.
#' @param start An object specifying the starting point of the iterative algorithm. Three types of input are accepted:
#' \itemize{
#' \item \code{start="small.to.large"} (default): the starting matched set is generated by matching groups
#' from the smallest to the largest.
#' \item Users can specify the order to be used to match groups for the starting sample.
#' For example, if there are four groups with labels "A","B","C" and "D", \code{start="D-B-A-C"} generates the starting sample
#' by matching groups "D" and "B", then units from "A" to the "D"-"B"pairs, then units from "C" to the "D"-"B"-"A" triplets.
#' \item Users can provide the starting matched set and the algorithm will explore possible reductions in the total
#' distance. In this case, \code{start} must be a vector with the IDs of the matched sets, i.e.,
#' a vector with length equal to the number of rows of \code{data} where
#' matched subjects are flagged with the same value and non-matched subjects have value \code{NA}.
#' }
#' @param data The \code{data.frame} object with the data.
#' @param distance String specifying whether the distance between pairs of observations should be computed with the
#' Euclidean (\code{"euclidean"}, default) or Mahalanobis (\code{"mahalanobis"}) distance. See section 'Details' for further information.
#' @param exactMatch Formula with form \code{~ z_1 + ... + z_k}, where \code{z_1},...,\code{z_k} must
#' be factor variables. Subjects are exactly matched on \code{z_1},...,\code{z_k}, i.e., matched
#' within levels of these variables.
#' @param vectorK A named vector with the number of subjects from each group in each matched set. The names of the vector must be
#' the labels of the groups, i.e., the levels of the variable identifying the treatment groups/exposures.
#' For example, in case of four groups with labels "A","B","C" and "D" and assuming that the desired design is 1:2:3:3
#' (1 subject from A, 2 from B, 3 from C and 3 from D in each matched set), the parameter should be set to
#' \code{vectorK = c("A" = 1, "B" = 2, "C" = 3, "D" = 3)}. By default, the generated matched design includes 1 subject per group in each
#' matched set, i.e, a 1:1: ... :1 matched design.
#' @param iterate Boolean specifying whether iterations should be done (\code{iterate=TRUE}, default) or not (\code{iterate=FALSE}).
#' @param niter_max Maximum number of iterations. Default is 50.
#' @param withinGroupDist Boolean specifying whether the distances within the same treatment/exposure group should be considered in the
#' total distance. For example, in a 1:2:3 matched design among the groups A, B and C, the parameters controls whether the distance
#' between the two subjects in B and the three pairwise distances among the subjects in C should be counted in the total distance.
#' The default value is \code{TRUE}.
#' @param verbose Boolean: should text be printed in the console? Default is \code{TRUE}.
#' @return A list containing the following components:
#' \describe{
#' \item{match_id}{A numeric vector identifying the matched sets---matched units have the same identifier.}
#' \item{total_distance}{Total distance of the returned matched sample.}
#' \item{total_distance_start}{Total distance at the starting point.}
#' }
#'
#' @details The function implements the conditionally optimal matching algorithm, which iteratively uses
#' two-group optimal matching steps to generate matched samples with small total distance. In the current implementation,
#' it is possible to generate matched samples with multiple subjects per group, with the matching ratio being
#' specified by the \code{vectorK} parameter.
#'
#' The steps of the algorithm are described with the following example. Consider a 4-group design with
#' groups labels "A", "B", "C" and "D" and a 1:1:1:1 matching ratio. The algorithm requires a set of quadruplets as starting point.
#' The argument \code{start} defines the approach to be used to
#' generate such a starting point. \code{polymatch} generates the starting point by sequentially using optimal two-group matching.
#' In the default setting (\code{start="small.to.large"}), the steps are:
#' \enumerate{
#' \item optimally match the two smallest groups;
#' \item optimally match the third smallest group to the pairs generated in the first step;
#' \item optimally match the last group to the triplets generated in the second step.
#' }
#' Notably, we can use the optimal two-group algorithm in steps 2) and 3) because they are
#' two-dimensional problems: the elements of one group on one hand, fixed matched sets on the other hand. The order of the
#' groups to be considered when generating the starting point can be user-specified (e.g., \code{start="D-B-A-C"}).
#' In alternative, the user can provide a matched set that will be used as starting point.
#'
#' Given the starting matched set, the algorithm iteratively explores possible reductions in the total distance (if \code{iterate="TRUE"}),
#' by sequentially relaxing the connection to each group and rematching units of that group. In our example:
#' \enumerate{
#' \item rematch "B-C-D" triplets within the starting quadruplets to units in group "A";
#' \item rematch "A-C-D" triplets within the starting quadruplets to units in group "B";
#' \item rematch "A-B-D" triplets within the starting quadruplets to units in group "C";
#' \item rematch "A-B-C" triplets within the starting quadruplets to units in group "D".
#' }
#' If none of the sets of quadruplets generated in 1)-4) has smaller total distance than the starting point, the algorihm stops.
#' Otherwise, the set of quadruplets with smallest distance is seleceted and the process iterated, until no reduction in the total
#' distance is found or the number of maximum iterations is reached (\code{niter_max=50} by default).
#'
#' The total distance is defined as the sum of all the within-matched-set distances. The within-matched-set distance is defined as the
#' sum of the pairwise distances between pairs of units in the matched set. The type of distance is specified with the \code{distance}
#' argument. The current implementation supports Euclidean (\code{distance="euclidean"}) and Mahalanobis (\code{distance="mahalanobis"})
#' distances. In particular, for the Mahalanobis distance, the covariance matrix is defined only once on the full dataset.
#'
#' @seealso \code{\link{balance}} and \code{\link{plotBalance}} to summarize the
#' balance in the covariates.
#'
#' @examples
#' #Generate a datasets with group indicator and four variables:
#' #- var1, continuous, sampled from normal distributions;
#' #- var2, continuous, sampled from beta distributions;
#' #- var3, categorical with 4 levels;
#' #- var4, binary.
#' set.seed(1234567)
#' dat <- data.frame(group = c(rep("A",10),rep("B",20),rep("C",30)),
#' var1 = c(rnorm(10,mean=0,sd=1),
#' rnorm(20,mean=1,sd=2),
#' rnorm(30,mean=-1,sd=2)),
#' var2 = c(rbeta(10,shape1=1,shape2=1),
#' rbeta(20,shape1=2,shape2=1),
#' rbeta(30,shape1=1,shape2=2)),
#' var3 = factor(c(rbinom(10,size=3,prob=.4),
#' rbinom(20,size=3,prob=.5),
#' rbinom(30,size=3,prob=.3))),
#' var4 = factor(c(rbinom(10,size=1,prob=.5),
#' rbinom(20,size=1,prob=.3),
#' rbinom(30,size=1,prob=.7))))
#'
#' #Match on propensity score
#' #-------------------------
#'
#' #With multiple groups, need a multinomial model for the PS
#' library(VGAM)
#' psModel <- vglm(group ~ var1 + var2 + var3 + var4,
#' family=multinomial, data=dat)
#' #Estimated logits - 2 for each unit: log(P(group=A)/P(group=C)), log(P(group=B)/P(group=C))
#' logitPS <- predict(psModel, type = "link")
#' dat$logit_AvsC <- logitPS[,1]
#' dat$logit_BvsC <- logitPS[,2]
#'
#' #Match on logits of PS
#' resultPs <- polymatch(group ~ logit_AvsC + logit_BvsC, data = dat,
#' distance = "euclidean")
#' dat$match_id_ps <- resultPs$match_id
#'
#'
#' #Match on covariates
#' #--------------------
#'
#'
#' #Match on continuous covariates with exact match on categorical/binary variables
#' resultCov <- polymatch(group ~ var1 + var2, data = dat,
#' distance = "mahalanobis",
#' exactMatch = ~var3+var4)
#' dat$match_id_cov <- resultCov$match_id
#'
#' @export
polymatch <- function(formulaMatch, start = "small.to.large", data, distance = "euclidean", exactMatch = NULL, vectorK = NULL, iterate = TRUE, niter_max = 50, withinGroupDist = TRUE, verbose = TRUE) {
# #Debug/devel:
# #------------
# source("develFunction_generateData.R")
# formulaMatch <- (group~variable)
# data <- generateData(c(50,10,150,15,12,50,150))
# distance = "euclidean"
# start = "small.to.large"
# iterate = TRUE
# niter_max = 50
# verbose = TRUE
# exactMatch = NULL
# vectorK = c('1' = 2, '2' = 1, '3' = 4, '4' = 1, '5' = 1, '6' = 2, '7' = 4)
#Check types of inputs
checkInputs(formulaMatch, start, data, distance, exactMatch, iterate, niter_max, verbose, vectorK)
#Check coherence of data
resultCheckData <- checkData(formulaMatch, start, data, exactMatch, vectorK)
varGroup <- resultCheckData$varGroup
varsMatch <- resultCheckData$varsMatch
vectorSchemeStart <- resultCheckData$vectorSchemeStart
varsExactMatch <- resultCheckData$varsExactMatch
vectorK <- resultCheckData$vectorK
#Select variables of interest
data <- data[,c(varGroup,varsMatch,varsExactMatch)]
#Make grouping variable as character
data[,varGroup] <- as.character(data[,varGroup])
#Number of groups
numGroups <- length(table(data[,varGroup]))
#Factors to standardize between-group distances (if matching 1:K in any group)
#For example: if matching groups A-B-C with ratios 1:1:K,
# there are K distances between A and C and only 1 distance between A and B,
# so when computing the total distance, the distance between A and B would
# end up being less important. So, we divide each distance between two groups by the
# number of possible distances between those two groups.
dat_stdzDistances <- stdzDistances(vectorK, withinGroupDist)
#Define an id to sort observations in order provided
data$idUnits <- 1:nrow(data)
if(distance == "mahalanobis") {
Sigma <- stats::cov(as.matrix(data[,varsMatch]))
} else {
Sigma <- NULL
}
# Say some stuff
if(verbose==TRUE) {
cat("Conditional optimal matching algorithm\n")
cat("Number of observations: ",nrow(data),"\n")
cat("Number of groups: ",numGroups,"\n")
#cat("\n")
}
#First: generate and evaluate starting point
#-------------------------------------------
#IF: starting matched dataset provided
if( is.null(vectorSchemeStart)) {
data$match_id <- start
#Evaluate total distance at starting condition
resultEvaluation <- evaluateMatching(data, "match_id", varsMatch,
distance, Sigma)
total_distance_start <- resultEvaluation$total_distance
tabGroup <- table(data[,varGroup])
vectorSchemeIter <- names(sort(tabGroup))
#ELSE: starting matched dataset must be constructed
} else {
#First step: select units in first group
data1 <- data[data[,varGroup] %in% vectorSchemeStart[1], ]
#Each unit is a matched set with 1 element
data1$indexMatch1 <- NA
data1$indexMatch1[data1[,varGroup] %in% vectorSchemeStart[1]] <- 1:nrow(data1)
for(i in 2:numGroups) {
#Next step: select units from the next group
data2 <- data[data[,varGroup] %in% vectorSchemeStart[i], ]
#Each unit is a matched set with 1 element
data2$indexMatch2 <- NA
data2$indexMatch2[data2[,varGroup] %in% vectorSchemeStart[i]] <- 1:nrow(data2)
#Before combining the dataset, variables need to be the same
data2$indexMatch1 <- NA
data1$indexMatch2 <- NA
#Append data of new group to previous data
dataStep <- rbind(data1, data2)
#Optimally match unit of data2 to matched sets in data1
resultStep <- condOptMatching(data = dataStep,
varIndexMatch1 = "indexMatch1",
varIndexMatch2 = "indexMatch2",
varsMatch = varsMatch,
varGroup = varGroup,
distance = distance,
Sigma = Sigma,
varsExactMatch = varsExactMatch,
k = as.numeric(vectorK[vectorSchemeStart[i]]),
dat_stdzDistances = dat_stdzDistances)
#Erase ids of previous matching step
dataStep$indexMatch1 <- dataStep$indexMatch2 <- NULL
#Id of new matching step.
dataStep$indexMatch1 <- resultStep$match_id
data1 <- dataStep
}
data <- dataStep
data$match_id <- data$indexMatch1
data$indexMatch1 <- NULL
total_distance_start <- resultStep$total_distance
vectorSchemeIter <- vectorSchemeStart
}
# Say some stuff
if(verbose==TRUE) {
cat("Total distance of starting matched sample: ", sprintf("%.3f",total_distance_start),"\n")
#cat("\n")
}
#Second: iterations
#-------------------
best_total_distance <- total_distance_start
best_match_id <- data$match_id
if(iterate) {
for(niter in 1:niter_max) {
data$match_id <- best_match_id
improvementInIteration <- F
for(iGroupStepIter2 in 1:length(vectorSchemeIter)) {
groupStepIter2 <- vectorSchemeIter[iGroupStepIter2]
groupsStepIter1 <- setdiff(vectorSchemeIter, groupStepIter2)
#Relax connection to groupStepIter2
data$indexMatchIter1 <- data$match_id
data$indexMatchIter1[data[,varGroup] %in% groupStepIter2] <- NA
data$indexMatchIter2 <- NA
data$indexMatchIter2[data[,varGroup] %in% groupStepIter2] <- 1:sum(data[,varGroup] %in% groupStepIter2)
#Rematch groupsStepIter1 to groupStepIter2
resultIter <- condOptMatching(data = data,
varIndexMatch1 = "indexMatchIter1",
varIndexMatch2 = "indexMatchIter2",
varsMatch = varsMatch,
varGroup = varGroup,
distance = distance,
Sigma = Sigma,
varsExactMatch = varsExactMatch,
k = as.numeric(vectorK[groupStepIter2]),
dat_stdzDistances = dat_stdzDistances)
if(resultIter$total_distance < best_total_distance) {
best_match_id <- resultIter$match_id
best_total_distance <- resultIter$total_distance
improvementInIteration <- TRUE
}
}
if(improvementInIteration == F) {
break;
}
# Say some stuff
if(verbose==TRUE) {
cat("Ended iteration ", niter, " - total distance:", sprintf("%.3f",best_total_distance),"\n")
#cat("\n")
}
}
} else {
total_distance <- total_distance_start
niter <- 1
}
if(verbose == TRUE) {
cat("End \n")
cat(paste0("Number of iterations: ", niter,", total distance:", sprintf("%.3f",best_total_distance),"\n"))
cat("Number of matched sets: ", length(unique(best_match_id[!is.na(best_match_id)])),"\n")
}
if(iterate == TRUE & niter>=niter_max) {
warning("The algorithm reached the maximum number of iterations--you can increase it with the argument 'niter_max'")
}
dataToOutput <- data[order(data$idUnits), c(varGroup,varsMatch)]
return(list(match_id = best_match_id[order(data$idUnits)],
total_distance = best_total_distance,
niter = niter,
total_distance_start = total_distance_start))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.