Gene Sets Net Correlations Analysis
Description
Performs Gene Sets Net Correlation Analysis (GSNCA) test to detect differentially coexpressed gene sets.
Usage
1 2 
Arguments
object 
a numeric matrix with columns and rows respectively corresponding to samples and features. 
group 
a numeric vector indicating group associations for samples. Possible values are 1 and 2. 
nperm 
number of permutations used to estimate the null distribution of the test statistic. If not given, a default value 1000 is used. 
cor.method 
a character string indicating which correlation
coefficient is to be computed. Possible values are “ 
check.sd 
logical. Should the standard deviations of features checked for small values before the intergene correlations are computed? Default is TRUE (recommended). 
min.sd 
the minimum allowed standard deviation for any feature. If
any feature has a standard deviation smaller than 
max.skip 
maximum number of skipped random permutations which yield
any feature with a standard deviation less than 
pvalue.only 
logical. If 
Details
This function performs the Gene Sets Net Correlations Analysis (GSNCA), a twosample nonparametric multivariate differential coexpression test that accounts for the correlation structure between features (genes). The test assigns weight factors to features under one condition and adjust these weights simultaneously such that equality is achieved between each feature's weight and the sum of its weighted absolute correlations with other features in the feature set. The problem is solved as an eigenvector problem with a unique solution (see Rahmatallah et. al. 2014 for details). The test statistic w_{GSNCA} is given by the first norm between the scaled weight vectors w^{(1)} and w^{(2)} (each vector is multiplied by its norm) between two conditions
w = ∑_{i=1}^{p}  w_{i}^{(1)}  w_{i}^{(2)} 
This test statistic tests the null hypothesis that w=0 against
the alternative that w does not equal to zero.
The performance of this test was thoroughly examind in
Rahmatallah et. al. (2014). The null distribution of the test statistic is
estimated by permuting sample labels nperm
times and calculating the
test statistic for each. Pvalue is calculated as
p.value = \frac{∑_{k=1}^{nperm} I ≤ft[ W_{k} ≥q W_{obs} \right] + 1}{nperm+1}
where W_{k} is the test statistic for permutation k
, W_{obs} is the
observed test statistic, and I
is the indicator function.
In the case of RNAseq count data, some nonexpressed genes may have zero
counts across the samples under one or two conditions. Such situation
results in zero or tiny standard deviation for one or more features.
Such case produces an error in command cor
used to compute the
correlation coefficients between features.
To avoid this situation, standard deviations are checked in advance when
check.sd
is TRUE
(default) and if any is found below the minimum
limit min.sd
(default is 1e3
), the execution stops and an error
message is returned indicating the number of feature causing the problem
(if only one the index of that feature is given too).
If a feature has nearly a constant level for some samples under both
conditions, permuting sample labels may group such samples under one condition
and produce a standard deviation smaller than min.sd
. To allow the test
to skip such permutations without causing excessive delay, we set an upper
limit for the number of allowed skips by the argument max.skip
(default is 10). If the upper limit is exceeded, an error message is
returned. Allowing this skipping may or may not solve the issue depending on
the proportion of samples causing the problem in the feature set.
If the user is certain that the tested feature sets contain no feature with
nearly equal levels over many samples (such as the case with microarrays),
the checking stage for tiny standard deviations can be skipped by setting
check.sd
to FALSE
in order to reduce the execution time.
Value
When pvalue.only=TRUE
(default), function GSNCAtest
returns
the pvalue indicating the attained significance level. When
pvalue.only=FALSE
, function GSNCAtest
produces a list of
length 3 with the following components:
statistic 
the value of the observed test statistic. 
perm.stat 
numeric vector of the resulting test statistic for

p.value 
pvalue indicating the attained significance level. 
Author(s)
Yasir Rahmatallah and Galina Glazko
References
Rahmatallah Y., EmmertStreib F. and Glazko G. (2014) Gene sets net correlations analysis (GSNCA): a multivariate differential coexpression test for gene sets. Bioinformatics 30, 360–368.
See Also
findMST2
, plotMST2.pathway
.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  ## generate a feature set of length 20 in two conditions
## each condition has 20 samples
## use multivariate normal distribution with different covariance matrices
library(MASS)
ngenes < 20
nsamples < 40
zero_vector < array(0,c(1,ngenes))
## create a covariance matrix with low offdiagonal elements
cov_mtrx1 < diag(ngenes)
cov_mtrx1[!diag(ngenes)] < 0.1
## create a covariance matrix with high offdiagonal elements
## for the first 5 features and low for the rest 15 features
cov_mtrx2 < diag(ngenes)
cov_mtrx2[!diag(ngenes)] < 0.1
mask < diag(ngenes/4)
mask[!diag(ngenes/4)] < 0.6
cov_mtrx2[1:(ngenes/4),1:(ngenes/4)] < mask
gp1 < mvrnorm((nsamples/2), zero_vector, cov_mtrx1)
gp2 < mvrnorm((nsamples/2), zero_vector, cov_mtrx2)
gp < rbind(gp1,gp2)
dataset < aperm(gp, c(2,1))
## first 20 samples belong to group 1
## second 20 samples belong to group 2
pvalue < GSNCAtest(object=dataset, group=c(rep(1,20),rep(2,20)))
