Description Usage Arguments Details Value References See Also Examples
View source: R/getMatchedSets.R
Obtain matching of subjects based on a set of covariates (e.g., principal components of population stratification markers). Two types of matcing are allowed 1) Case-Control(CC) matching and/or 2) Nearest-Neighbour(NN) matching.
1 2 | getMatchedSets(x, CC, NN, ccs.var=NULL, dist.vars=NULL, strata.var=NULL,
size=2, ratio=1, fixed=FALSE)
|
x |
Either a data frame containing variables to be used for matching, or an object returned by |
CC |
Logical. TRUE if case-control matching should be computed, FALSE otherwise. No default. |
NN |
Logical. TRUE if nearest-neighbor matching should be computed, FALSE otherwise. No default. At least one of CC and NN should be TRUE. |
ccs.var |
Variable name, variable number, or a vector for the case-control status. If |
dist.vars |
Variables numbers or names for computing a distance matrix based on which matching will be performed. Must be
specified if |
strata.var |
Optional stratification variable (such as study center) for matching within strata. A vector of mode integer or factor
if |
size |
Exact size or maximum allowable size of a matched set. This can be an integer greater than 1,
or a vector of such integers that is constant within each level of |
ratio |
Ratio of cases to controls for CC matching. Currently ignored if fixed = FALSE. This can be a positive number,
or a numeric vector that is constant within each level of |
fixed |
Logical. TRUE if "size" should be interpreted as "exact size" and FALSE if it gives "maximal size" of matched sets. The default is FALSE. |
If a data frame and dist.vars
is provided, dist
along with the euclidean metric is used to compute
distances assuming conituous variables. For categorical, ordinal or mixed variables using a custom distance matrix such as that from daisy
is recommended. If strata.var
is provided both case-control (CC) and nearest-neighbor (NN) matching are performed within strata.
size
can be any integer greater than 1 but currently the matching obtained is usable in snp.matched
only if size
is 8 or smaller,
due to memory and speed limitations.
When fixed=FALSE, NN matching is computed using a modified version of hclust
, where clusters are not allowed to grow beyond the specified size
.
CC matching is computed similarly with the further constraint that each cluster must have at least one case and one control. Clusters are then split up into 1:k or k:1
matched sets, where k is at most size
- 1 (known as full matching). For exactly optimal full matching use package optmatch.
When fixed=TRUE, both CC and NN use heuristic fixed-size clustering algorithms. These algorithms start with matches in the periphery of the data space and
proceed inward. Hence prior removal of outliers is recommended.
For CC matching, number of cases in each matched set is obtained by rounding size
* [ratio
/(1+ratio
)] to the nearest integer.
The matching algorithms for fixed=TRUE
are faster, but in case of CC matching large number of case or controls may be discarded with this option.
A list with names "CC", "tblCC", "NN", and "tblNN". "CC" and "NN" are vectors of integer labels defining the matched sets,
"tblCC" and "tblNN" are matrices summarizing the size distribution of matched sets across strata. i
'th row corresponds to matched set
size of i
and columns represent different strata. The order of strata in columns may be different from that in strata.var, if strata.var was
not coded as successive integers starting from 1.
Luca et al. On the use of general control samples for genome-wide association studies: genetic matching
highlights causal variants. Amer Jour Hum Genet, 2008, 82(2):453-63.
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N. Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only studies. American Journal of Human Genetics, 2010, 86(3):331-342.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | # Use the ovarian cancer data
data(Xdata, package="CGEN")
# Add fake principal component columns.
set.seed(123)
Xdata <- cbind(Xdata, PC1 = rnorm(nrow(Xdata)), PC2 = rnorm(nrow(Xdata)))
# Assign matched set size and case/control ratio stratifying by ethnic group
size <- ifelse(Xdata$ethnic.group == 3, 2, 4)
ratio <- sapply(Xdata$ethnic.group, switch, 1/2 , 2 , 1)
mx <- getMatchedSets(Xdata, CC=TRUE, NN=TRUE, ccs.var="case.control",
dist.vars=c("PC1","PC2") , strata.var="ethnic.group",
size = size, ratio = ratio, fixed=TRUE)
mx$NN[1:10]
mx$tblNN
# Example of using a dissimilarity matrix using catergorical covariates with
# Gower's distance
library("cluster")
d <- daisy(Xdata[, c("age.group","BRCA.history","gynSurgery.history")] ,
metric = "gower")
# Specify size = 4 as maximum matched set size in all strata
mx <- getMatchedSets(d, CC = TRUE, NN = TRUE, ccs.var = Xdata$case.control,
strata.var = Xdata$ethnic.group, size = 4,
fixed = FALSE)
mx$CC[1:10]
mx$tblCC
|
Loading required package: survival
Loading required package: mvtnorm
Warning messages:
1: In getMatchedSets(Xdata, CC = TRUE, NN = TRUE, ccs.var = "case.control", :
There were 3 unmatched individual(s)
2: In getMatchedSets(Xdata, CC = TRUE, NN = TRUE, ccs.var = "case.control", :
There were 555 unmatched individual(s)
[1] 418 309 100 118 356 112 355 158 284 84
strat
1 2 3
[1,] 2 0 1
[2,] 0 0 50
[3,] 0 0 0
[4,] 275 94 0
Warning message:
In daisy(Xdata[, c("age.group", "BRCA.history", "gynSurgery.history")], :
binary variable(s) 2, 3 treated as interval scaled
Warning message:
In getMatchedSets(d, CC = TRUE, NN = TRUE, ccs.var = Xdata$case.control, :
There were 10 unmatched individual(s)
[1] 1 3 3 5 7 7 4 4 9 9
strat
1 2 3
[1,] 2 3 5
[2,] 485 179 44
[3,] 21 1 0
[4,] 5 3 2
[5,] 0 0 0
[6,] 0 0 0
[7,] 0 0 0
[8,] 1 0 0
[9,] 1 0 0
[10,] 0 0 0
[11,] 1 0 0
[12,] 0 0 0
[13,] 0 0 0
[14,] 0 0 0
[15,] 0 0 0
[16,] 0 0 0
[17,] 0 0 0
[18,] 0 0 0
[19,] 1 0 0
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.