getMatchedSets: Case-Control and Nearest-Neighbor Matching
In CGEN: An R package for analysis of case-control studies in genetic epidemiology

Description Usage Arguments Details Value References See Also Examples

Obtain matching of subjects based on a set of covariates (e.g., principal components of population stratification markers). Two types of matcing are allowed 1) Case-Control(CC) matching and/or 2) Nearest-Neighbour(NN) matching.

1 2	getMatchedSets(x, CC, NN, ccs.var=NULL, dist.vars=NULL, strata.var=NULL, size=2, ratio=1, fixed=FALSE)

`x`	Either a data frame containing variables to be used for matching, or an object returned by `dist` or `daisy` or a matrix coercible to class dist. No default.
`CC`	Logical. TRUE if case-control matching should be computed, FALSE otherwise. No default.
`NN`	Logical. TRUE if nearest-neighbor matching should be computed, FALSE otherwise. No default. At least one of CC and NN should be TRUE.
`ccs.var`	Variable name, variable number, or a vector for the case-control status. If `x` is dist object, a vector of length same as number of subjects in `x`. This must be specified if CC=TRUE. The default is NULL.
`dist.vars`	Variables numbers or names for computing a distance matrix based on which matching will be performed. Must be specified if `x` is a data frame. Ignored if `x` is a distance. Default is NULL.
`strata.var`	Optional stratification variable (such as study center) for matching within strata. A vector of mode integer or factor if `x` is a distance. If `x` is a data frame, a variable name or number is allowed. The default is NULL.
`size`	Exact size or maximum allowable size of a matched set. This can be an integer greater than 1, or a vector of such integers that is constant within each level of `strata.var`. The default is 2.
`ratio`	Ratio of cases to controls for CC matching. Currently ignored if fixed = FALSE. This can be a positive number, or a numeric vector that is constant within each level of `strata.var`. The default is 1.
`fixed`	Logical. TRUE if "size" should be interpreted as "exact size" and FALSE if it gives "maximal size" of matched sets. The default is FALSE.

If a data frame and dist.vars is provided, dist along with the euclidean metric is used to compute distances assuming conituous variables. For categorical, ordinal or mixed variables using a custom distance matrix such as that from daisy is recommended. If strata.var is provided both case-control (CC) and nearest-neighbor (NN) matching are performed within strata. size can be any integer greater than 1 but currently the matching obtained is usable in snp.matched only if size is 8 or smaller, due to memory and speed limitations.

When fixed=FALSE, NN matching is computed using a modified version of hclust, where clusters are not allowed to grow beyond the specified size. CC matching is computed similarly with the further constraint that each cluster must have at least one case and one control. Clusters are then split up into 1:k or k:1 matched sets, where k is at most size - 1 (known as full matching). For exactly optimal full matching use package optmatch.

When fixed=TRUE, both CC and NN use heuristic fixed-size clustering algorithms. These algorithms start with matches in the periphery of the data space and proceed inward. Hence prior removal of outliers is recommended. For CC matching, number of cases in each matched set is obtained by rounding size * [ratio/(1+ratio)] to the nearest integer. The matching algorithms for fixed=TRUE are faster, but in case of CC matching large number of case or controls may be discarded with this option.

A list with names "CC", "tblCC", "NN", and "tblNN". "CC" and "NN" are vectors of integer labels defining the matched sets, "tblCC" and "tblNN" are matrices summarizing the size distribution of matched sets across strata. i'th row corresponds to matched set size of i and columns represent different strata. The order of strata in columns may be different from that in strata.var, if strata.var was not coded as successive integers starting from 1.

Luca et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Amer Jour Hum Genet, 2008, 82(2):453-63.

Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N. Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only studies. American Journal of Human Genetics, 2010, 86(3):331-342.

snp.matched

 # Use the ovarian cancer data
  data(Xdata, package="CGEN")

 # Add fake principal component columns.
  set.seed(123)
  Xdata <- cbind(Xdata, PC1 = rnorm(nrow(Xdata)), PC2 = rnorm(nrow(Xdata)))

 # Assign matched set size and case/control ratio stratifying by ethnic group
  size <- ifelse(Xdata$ethnic.group == 3, 2, 4)
  ratio <- sapply(Xdata$ethnic.group, switch, 1/2 , 2 , 1)
  mx <- getMatchedSets(Xdata, CC=TRUE, NN=TRUE, ccs.var="case.control", 
                       dist.vars=c("PC1","PC2") , strata.var="ethnic.group", 
		       size = size, ratio = ratio, fixed=TRUE)
  mx$NN[1:10]
  mx$tblNN
  
  # Example of using a dissimilarity matrix using catergorical covariates with 
  #  Gower's distance
  library("cluster")
  d <- daisy(Xdata[, c("age.group","BRCA.history","gynSurgery.history")] , 
             metric = "gower")
  # Specify size = 4 as maximum matched set size in all strata
  mx <- getMatchedSets(d, CC = TRUE, NN = TRUE, ccs.var = Xdata$case.control, 
                       strata.var = Xdata$ethnic.group, size = 4, 
		       fixed = FALSE)
  mx$CC[1:10]
  mx$tblCC

Loading required package: survival
Loading required package: mvtnorm
Warning messages:
1: In getMatchedSets(Xdata, CC = TRUE, NN = TRUE, ccs.var = "case.control",  :
  There were  3  unmatched individual(s)
2: In getMatchedSets(Xdata, CC = TRUE, NN = TRUE, ccs.var = "case.control",  :
  There were  555  unmatched individual(s)
 [1] 418 309 100 118 356 112 355 158 284  84
      strat
         1  2  3
  [1,]   2  0  1
  [2,]   0  0 50
  [3,]   0  0  0
  [4,] 275 94  0
Warning message:
In daisy(Xdata[, c("age.group", "BRCA.history", "gynSurgery.history")],  :
  binary variable(s) 2, 3 treated as interval scaled
Warning message:
In getMatchedSets(d, CC = TRUE, NN = TRUE, ccs.var = Xdata$case.control,  :
  There were  10  unmatched individual(s)
 [1] 1 3 3 5 7 7 4 4 9 9
       strat
          1   2  3
   [1,]   2   3  5
   [2,] 485 179 44
   [3,]  21   1  0
   [4,]   5   3  2
   [5,]   0   0  0
   [6,]   0   0  0
   [7,]   0   0  0
   [8,]   1   0  0
   [9,]   1   0  0
  [10,]   0   0  0
  [11,]   1   0  0
  [12,]   0   0  0
  [13,]   0   0  0
  [14,]   0   0  0
  [15,]   0   0  0
  [16,]   0   0  0
  [17,]   0   0  0
  [18,]   0   0  0
  [19,]   1   0  0