CASI: Canonical Analysis of Set Interactions
In USCbiostats/CASI: Canonical Analysis of Set Interactions

Description Usage Arguments Details Value Author(s) References Examples

View source: R/CASI.R

Assuming a case-control type study design, two sets of features are measured in both cases and controls. The objective of CASI hypothesis testing is to evaluate evidence of statistical interactions between these sets in relation to outcome status.

1	CASI(G1_cs, G2_cs, G1_cn, G2_cn, lst.perm, fs = FALSE)

`G1_cs`	variable set 1 measured in cases. Variables are organized in columns and individuals in rows. Therefore, the number of rows equals the number of cases.
`G2_cs`	variable set 2 measured in cases. Variables are organized in columns and individuals in rows. Therefore, the number of rows equals the number of cases.
`G1_cn`	variable set 1 measured in controls. Variables are organized in columns and individuals in rows. Therefore, the number of rows equals the number of controls.
`G2_cn`	variable set 2 measured in controls. Variables are organized in columns and individuals in rows. Therefore, the number of rows equals the number of controls.
`lst.perm`	a list with length equal to the number of permutations to be conducted. Elements are vectors of intergers, each of length equal to the total sample size (n = n.cases + n.controls). Each vector contains indices specifying a random permutation for case-control labels, where the observed order is represented by vector 1:n. The same lst.perm should be used for multiple tests to capture possible dependencies among tests for use in fdrci. For example: lst.perm[[1]] <- sample(1:n), lst.perm[[2]] <- sample(1:n), ...
`fs`	typically the larger the CASI statistic, the more extreme. However, if fs is set to TRUE the CASI statistic is flipped so that the smaller it is, the more extreme it is, like a p-value. This facilitates the use of the fdrci R package for computing FDR estimates and corresponding confidence intervals.

The two feature sets could be sets of SNPs underlying two genes, a set of environmental exposures and a set of SNPs, a set of SNPs and a set of DNA methylation probes, etc. The null hypothesis is that there are no two linear combinations of the two sets whose correlation differs between cases and controls. Such a difference in correlation would imply a statistical interaction. The distribution of the CASI statistic is unknown, so rather than provide a p-value, the CASI function generates the test statistic under the observed and null hypothesis conditions. The null conditions are imposed by randomly permuting case-control labels and repeating the analysis n.perm times. It is anticipated by the developers that an FDR approach will be applied to CASI function results generated from applications to multiple, perhaps thousands of pairs of features. However, a permutation-based p-value could be computed if the number of permutations is sufficiently large.

A vector of length n.perm + 1, where the first element of the vector is the CASI statistic from the observed data and the following n.perm elements are CASI statistics computed using the permuted data.

Joshua Millstein, joshua.millstein@usc.edu, Vladimir Kogan

Vladimir Kogan and Joshua Millstein. 2018. Genetic-Epigenetic Interactions in Asthma Revealed by a Genome-Wide Gene-Centric Search. Human Heredity (in review)

n.case = 100
n.control = 100
m.set1 = 10
m.set2 = 10
n.effects = 5
beta.vec = runif(n.effects, 1, 2)
set1.nms = paste("V1.", 1:m.set1, sep="")
set2.nms = paste("V2.", 1:m.set1, sep="")
nms = c("case", set1.nms, set2.nms)
mydat = as.data.frame(matrix(NA, nrow=0, ncol=length(nms)))
names(mydat) = nms
logit = function(p) log(p / (1-p))
logistic = function(a) exp(a) / (exp(a)+1)
risk.base = .05

rowno = 0
ind.case = 0
ind.control = 0
while(rowno < (n.case+n.control)){
	vec1 = rbinom(m.set1, 2, .2)
	vec2 = rbinom(m.set2, 2, .2)
	lc = logit(risk.base)
	for(i in 1:n.effects) lc = lc + beta.vec[i]*vec1[i]*vec2[i]
	myrisk = logistic(lc)
	if( runif(1) < myrisk ){ 
		if(ind.case < n.case){ 
			ind.case = ind.case + 1
			rowno = ind.case + ind.control
			mydat[ rowno,"case"] = 1
			mydat[ rowno, c(set1.nms, set2.nms) ] = c(vec1, vec2)
		}
	} else { 
		if(ind.control < n.control){ 
			ind.control = ind.control + 1
			rowno = ind.case + ind.control
			mydat[ rowno,"case"] = 0
			mydat[ rowno, c(set1.nms, set2.nms) ] = c(vec1, vec2)
		}
	}
} # end while()

case.status = mydat[, "case"]
G1_cs = mydat[ is.element(case.status, 1), set1.nms ]
G2_cs = mydat[ is.element(case.status, 1), set2.nms ]
G1_cn = mydat[ is.element(case.status, 0), set1.nms ]
G2_cn = mydat[ is.element(case.status, 0), set2.nms ]
n.perm = 10
lst.perm = vector('list', n.perm)
for(p in 1:n.perm) lst.perm[[p]] = sample(1:(n.case+n.control))
CASI(G1_cs, G2_cs, G1_cn, G2_cn, lst.perm)