R/polymatch.R
In polymatching: A Matching Algorithm for Designs with Multiple Groups

Documented in polymatch

#' Polymatching
#'
#' \code{polymatch} generates matched samples in designs with up to 10 groups.
#'
#' @param formulaMatch Formula with form \code{group ~ x_1 + ... + x_p}, where \code{group} is the name of the variable
#' identifying the treatment groups/exposures and \code{x_1},...,\code{x_p} are the matching variables.
#' @param start An object specifying the starting point of the iterative algorithm. Three types of input are accepted:
#' \itemize{
#'   \item \code{start="small.to.large"} (default): the starting matched set is generated by matching groups
#' from the smallest to the largest.
#'   \item Users can specify the order to be used to match groups for the starting sample.
#' For example, if there are four groups with labels "A","B","C" and "D", \code{start="D-B-A-C"} generates the starting sample
#' by matching groups "D" and "B", then units from "A" to the "D"-"B"pairs, then units from "C" to the "D"-"B"-"A" triplets.
#'   \item Users can provide the starting matched set and the algorithm will explore possible reductions in the total
#' distance. In this case, \code{start} must be a vector with the IDs of the matched sets, i.e.,
#' a vector with length equal to the number of rows of \code{data} where
#' matched subjects are flagged with the same value and non-matched subjects have value \code{NA}.
#' }
#' @param data The \code{data.frame} object with the data.
#' @param distance String specifying whether the distance between pairs of observations should be computed with the
#' Euclidean (\code{"euclidean"}, default) or Mahalanobis (\code{"mahalanobis"}) distance. See section 'Details' for further information.
#' @param exactMatch Formula with form \code{~ z_1 + ... + z_k}, where \code{z_1},...,\code{z_k} must
#' be factor variables. Subjects are exactly matched on \code{z_1},...,\code{z_k}, i.e., matched
#' within levels of these variables.
#' @param vectorK A named vector with the number of subjects from each group in each matched set. The names of the vector must be 
#' the labels of the groups, i.e., the levels of the variable identifying the treatment groups/exposures. 
#' For example, in case of four groups with labels "A","B","C" and "D" and assuming that the desired design is 1:2:3:3 
#' (1 subject from A, 2 from B, 3 from C and 3 from D in each matched set), the parameter should be set to
#' \code{vectorK =  c("A" = 1, "B" = 2, "C" = 3, "D" = 3)}. By default, the generated matched design includes 1 subject per group in each
#' matched set, i.e, a 1:1: ... :1 matched design.
#' @param iterate Boolean specifying whether iterations should be done (\code{iterate=TRUE}, default) or not (\code{iterate=FALSE}).
#' @param niter_max Maximum number of iterations. Default is 50.
#' @param withinGroupDist Boolean specifying whether the distances within the same treatment/exposure group should be considered in the 
#' total distance. For example, in a 1:2:3 matched design among the groups A, B and C, the parameters controls whether the distance 
#' between the two subjects in B and the three pairwise distances among the subjects in C should be counted in the total distance. 
#' The default value is \code{TRUE}.
#' @param verbose Boolean: should text be printed in the console? Default is \code{TRUE}.

#' @return A list containing the following components:
#' \describe{
#'   \item{match_id}{A numeric vector identifying the matched sets---matched units have the same identifier.}
#'   \item{total_distance}{Total distance of the returned matched sample.}
#'   \item{total_distance_start}{Total distance at the starting point.}
#' }
#'
#' @details The function implements the conditionally optimal matching algorithm, which iteratively uses
#' two-group optimal matching steps to generate matched samples with small total distance. In the current implementation,
#' it is possible to generate matched samples with multiple subjects per group, with the matching ratio being 
#' specified by the \code{vectorK} parameter.
#'
#' The steps of the algorithm are described with the following example. Consider a 4-group design with
#' groups labels "A", "B", "C" and "D" and a 1:1:1:1 matching ratio. The algorithm requires a set of quadruplets as starting point. 
#' The argument \code{start} defines the approach to be used to
#' generate such a starting point. \code{polymatch} generates the starting point by sequentially using optimal two-group matching.
#' In the default setting (\code{start="small.to.large"}), the steps are:
#' \enumerate{
#'  \item optimally match the two smallest groups;
#'  \item optimally match the third smallest group to the pairs generated in the first step;
#'  \item optimally match the last group to the triplets generated in the second step.
#' }
#' Notably, we can use the optimal two-group algorithm in steps 2) and 3) because they are
#' two-dimensional problems: the elements of one group on one hand, fixed matched sets on the other hand. The order of the
#' groups to be considered when generating the starting point can be user-specified (e.g., \code{start="D-B-A-C"}).
#' In alternative, the user can provide a matched set that will be used as starting point.
#'
#' Given the starting matched set, the algorithm iteratively explores possible reductions in the total distance (if \code{iterate="TRUE"}),
#' by sequentially relaxing the connection to each group and rematching units of that group. In our example:
#' \enumerate{
#'  \item rematch "B-C-D" triplets within the starting quadruplets to units in group "A";
#'  \item rematch "A-C-D" triplets within the starting quadruplets to units in group "B";
#'  \item rematch "A-B-D" triplets within the starting quadruplets to units in group "C";
#'  \item rematch "A-B-C" triplets within the starting quadruplets to units in group "D".
#' }
#' If none of the sets of quadruplets generated in 1)-4) has smaller total distance than the starting point, the algorihm stops.
#' Otherwise, the set of quadruplets with smallest distance is seleceted and the process iterated, until no reduction in the total
#' distance is found or the number of maximum iterations is reached (\code{niter_max=50} by default).
#'
#' The total distance is defined as the sum of all the within-matched-set distances. The within-matched-set distance is defined as the
#' sum of the pairwise distances between pairs of units in the matched set. The type of distance is specified with the \code{distance}
#' argument. The current implementation supports Euclidean (\code{distance="euclidean"}) and Mahalanobis (\code{distance="mahalanobis"})
#' distances. In particular, for the Mahalanobis distance, the covariance matrix is defined only once on the full dataset.
#'
#' @seealso \code{\link{balance}} and \code{\link{plotBalance}} to summarize the
#' balance in the covariates.
#'
#' @examples
#' #Generate a datasets with group indicator and four variables:
#' #- var1, continuous, sampled from normal distributions;
#' #- var2, continuous, sampled from beta distributions;
#' #- var3, categorical with 4 levels;
#' #- var4, binary.
#' set.seed(1234567)
#' dat <- data.frame(group = c(rep("A",10),rep("B",20),rep("C",30)),
#'                var1 = c(rnorm(10,mean=0,sd=1),
#'                         rnorm(20,mean=1,sd=2),
#'                         rnorm(30,mean=-1,sd=2)),
#'                var2 = c(rbeta(10,shape1=1,shape2=1),
#'                         rbeta(20,shape1=2,shape2=1),
#'                         rbeta(30,shape1=1,shape2=2)),
#'                var3 = factor(c(rbinom(10,size=3,prob=.4),
#'                                rbinom(20,size=3,prob=.5),
#'                                rbinom(30,size=3,prob=.3))),
#'                var4 = factor(c(rbinom(10,size=1,prob=.5),
#'                                rbinom(20,size=1,prob=.3),
#'                                rbinom(30,size=1,prob=.7))))
#'
#' #Match on propensity score
#' #-------------------------
#'
#' #With multiple groups, need a multinomial model for the PS
#' library(VGAM)
#' psModel <- vglm(group ~ var1 + var2 + var3 + var4,
#'                 family=multinomial, data=dat)
#' #Estimated logits - 2 for each unit: log(P(group=A)/P(group=C)), log(P(group=B)/P(group=C))
#' logitPS <- predict(psModel, type = "link")
#' dat$logit_AvsC <- logitPS[,1]
#' dat$logit_BvsC <- logitPS[,2]
#'
#' #Match on logits of PS
#' resultPs <- polymatch(group ~ logit_AvsC + logit_BvsC, data = dat,
#'                     distance = "euclidean")
#' dat$match_id_ps <- resultPs$match_id
#'
#'
#' #Match on covariates
#' #--------------------
#'
#'
#' #Match on continuous covariates with exact match on categorical/binary variables
#' resultCov <- polymatch(group ~ var1 + var2, data = dat,
#'                         distance = "mahalanobis",
#'                         exactMatch = ~var3+var4)
#' dat$match_id_cov <- resultCov$match_id
#'
#' @export
polymatch <- function(formulaMatch, start = "small.to.large", data, distance = "euclidean", exactMatch = NULL, vectorK = NULL, iterate = TRUE, niter_max = 50, withinGroupDist = TRUE, verbose = TRUE) {

  # #Debug/devel:
  # #------------
  # source("develFunction_generateData.R")
  # formulaMatch <- (group~variable)
  # data <- generateData(c(50,10,150,15,12,50,150))
  # distance = "euclidean"
  # start = "small.to.large"
  # iterate = TRUE
  # niter_max = 50
  # verbose = TRUE
  # exactMatch = NULL
  # vectorK = c('1' = 2, '2' = 1, '3' = 4, '4' = 1, '5' = 1, '6' = 2, '7' = 4)

  #Check types of inputs
  checkInputs(formulaMatch, start, data, distance, exactMatch, iterate, niter_max, verbose, vectorK)

  #Check coherence of data
  resultCheckData <- checkData(formulaMatch, start, data, exactMatch, vectorK)
  varGroup <- resultCheckData$varGroup
  varsMatch <- resultCheckData$varsMatch
  vectorSchemeStart <- resultCheckData$vectorSchemeStart
  varsExactMatch <- resultCheckData$varsExactMatch
  vectorK <- resultCheckData$vectorK

  #Select variables of interest
  data <- data[,c(varGroup,varsMatch,varsExactMatch)]

  #Make grouping variable as character
  data[,varGroup] <- as.character(data[,varGroup])

  #Number of groups
  numGroups <- length(table(data[,varGroup]))
  
  #Factors to standardize between-group distances (if matching 1:K in any group)
  #For example: if matching groups A-B-C with ratios 1:1:K, 
  # there are K distances between A and C and only 1 distance between A and B,
  # so when computing the total distance, the distance between A and B would
  # end up being less important. So, we divide each distance between two groups by the 
  # number of possible distances between those two groups. 
  dat_stdzDistances <- stdzDistances(vectorK, withinGroupDist)

  #Define an id to sort observations in order provided
  data$idUnits <- 1:nrow(data)

  if(distance == "mahalanobis") {
    Sigma <- stats::cov(as.matrix(data[,varsMatch]))
  } else {
    Sigma <- NULL
  }

  # Say some stuff
  if(verbose==TRUE) {

    cat("Conditional optimal matching algorithm\n")
    cat("Number of observations: ",nrow(data),"\n")
    cat("Number of groups: ",numGroups,"\n")
    #cat("\n")
  }

  #First: generate and evaluate starting point
  #-------------------------------------------

  #IF: starting matched dataset provided
  if( is.null(vectorSchemeStart)) {

    data$match_id <- start

    #Evaluate total distance at starting condition
    resultEvaluation <- evaluateMatching(data, "match_id", varsMatch,
                                         distance, Sigma)
    total_distance_start <- resultEvaluation$total_distance


    tabGroup <- table(data[,varGroup])
    vectorSchemeIter <- names(sort(tabGroup))

  #ELSE: starting matched dataset must be constructed
  } else {

    #First step: select units in first group
    data1 <- data[data[,varGroup] %in% vectorSchemeStart[1], ]

    #Each unit is a matched set with 1 element
    data1$indexMatch1 <- NA
    data1$indexMatch1[data1[,varGroup] %in% vectorSchemeStart[1]] <- 1:nrow(data1)

    for(i in 2:numGroups) {

      #Next step: select units from the next group
      data2 <- data[data[,varGroup] %in% vectorSchemeStart[i], ]

      #Each unit is a matched set with 1 element
      data2$indexMatch2 <- NA
      data2$indexMatch2[data2[,varGroup] %in% vectorSchemeStart[i]] <- 1:nrow(data2)

      #Before combining the dataset, variables need to be the same
      data2$indexMatch1 <- NA
      data1$indexMatch2 <- NA

      #Append data of new group to previous data
      dataStep <- rbind(data1, data2)

      #Optimally match unit of data2 to matched sets in data1
      resultStep <- condOptMatching(data = dataStep,
                                    varIndexMatch1 = "indexMatch1",
                                    varIndexMatch2 = "indexMatch2",
                                    varsMatch = varsMatch,
                                    varGroup = varGroup,
                                    distance = distance,
                                    Sigma = Sigma,
                                    varsExactMatch = varsExactMatch,
                                    k = as.numeric(vectorK[vectorSchemeStart[i]]),
                                    dat_stdzDistances = dat_stdzDistances)

      #Erase ids of previous matching step
      dataStep$indexMatch1 <- dataStep$indexMatch2 <- NULL

      #Id of new matching step.
      dataStep$indexMatch1 <- resultStep$match_id
      data1 <- dataStep

    }

    data <- dataStep
    data$match_id <- data$indexMatch1
    data$indexMatch1 <- NULL

    total_distance_start <- resultStep$total_distance
    vectorSchemeIter <- vectorSchemeStart

  }

  # Say some stuff
  if(verbose==TRUE) {
    cat("Total distance of starting matched sample: ", sprintf("%.3f",total_distance_start),"\n")
    #cat("\n")
  }

  #Second: iterations
  #-------------------

  best_total_distance <- total_distance_start
  best_match_id <- data$match_id

  if(iterate) {

    for(niter in 1:niter_max) {

      data$match_id <- best_match_id

      improvementInIteration <- F

      for(iGroupStepIter2 in 1:length(vectorSchemeIter)) {

        groupStepIter2 <- vectorSchemeIter[iGroupStepIter2]
        groupsStepIter1 <- setdiff(vectorSchemeIter, groupStepIter2)

        #Relax connection to groupStepIter2
        data$indexMatchIter1 <- data$match_id
        data$indexMatchIter1[data[,varGroup] %in% groupStepIter2] <- NA

        data$indexMatchIter2 <- NA
        data$indexMatchIter2[data[,varGroup] %in% groupStepIter2] <- 1:sum(data[,varGroup] %in% groupStepIter2)

        #Rematch groupsStepIter1 to groupStepIter2
        resultIter <- condOptMatching(data = data,
                                      varIndexMatch1 = "indexMatchIter1",
                                      varIndexMatch2 = "indexMatchIter2",
                                      varsMatch = varsMatch,
                                      varGroup = varGroup,
                                      distance = distance,
                                      Sigma = Sigma,
                                      varsExactMatch = varsExactMatch,
                                      k = as.numeric(vectorK[groupStepIter2]),
                                      dat_stdzDistances = dat_stdzDistances)

        if(resultIter$total_distance < best_total_distance) {

          best_match_id <- resultIter$match_id
          best_total_distance <- resultIter$total_distance
          improvementInIteration <- TRUE
        }

      }

      if(improvementInIteration == F) {
        break;
      }

      # Say some stuff
      if(verbose==TRUE) {
        cat("Ended iteration ", niter, " - total distance:", sprintf("%.3f",best_total_distance),"\n")
        #cat("\n")
      }


    }

  } else {

    total_distance <- total_distance_start
    niter <- 1

  }

  if(verbose == TRUE) {
    cat("End \n")
    cat(paste0("Number of iterations: ", niter,", total distance:", sprintf("%.3f",best_total_distance),"\n"))
    cat("Number of matched sets: ", length(unique(best_match_id[!is.na(best_match_id)])),"\n")
  }

  if(iterate == TRUE & niter>=niter_max) {
    warning("The algorithm reached the maximum number of iterations--you can increase it with the argument 'niter_max'")
  }

  dataToOutput <- data[order(data$idUnits), c(varGroup,varsMatch)]

  return(list(match_id = best_match_id[order(data$idUnits)],
              total_distance = best_total_distance,
              niter = niter,
              total_distance_start = total_distance_start))
}