Multivariate KolmogorovSmirnov Test of Means
Description
Performs twosample nonparametric multivariate test of means based on the minimum spanning tree (MST) and KolmogorovSmirnov statistic. It tests the null hypothesis that a set of features has the same mean in two conditions versus different means.
Usage
1 
Arguments
object 
a numeric matrix with columns and rows respectively corresponding to samples and features. 
group 
a numeric vector indicating group associations for samples. Possible values are 1 and 2. 
nperm 
number of permutations used to estimate the null distribution of the test statistic. If not given, a default value 1000 is used. 
pvalue.only 
logical. If 
Details
This function tests the null hypothesis that a set of features has
no shift between two conditions. It performs a twosample nonparametric
multivariate test based on the minimum spanning tree (MST) and
KolmogorovSmirnov statistic as proposed by Friedman and Rafsky (1979).
The MST of the weighted undirectional graph created from the samples is found.
The nodes of the MST are ranked based on their position in the MST. The MST
is rooted at the node with largest geodisic distance (rank 1) and then nodes
are ranked in the High Directed Preorder (HDP) traversal of the tree
(Rahmatallah et. al. 2012). The quantity d_i = (r_i / n_1)  (s_i / n_2)
is calculated where r_i(s_i) is the number of nodes (samples)
from condition 1(2) which ranked lower than i, 1 ≤ i ≤ N and
N is the total number of samples. The KolmogorovSmirnov statistic is
given by the maximum absolute difference
D = √{\frac{n_{1}n_{2}}{n_{1}+n_{2}}} maxd_i. The performance of this
test under different alternative hypotheses was thoroughly examind
in Rahmatallah et. al. (2012). The null distribution of the test statistic
is estimated by permuting sample labels nperm
times and calculating
the test statistic for each. Pvalue is calculated as
p.value = \frac{∑_{k=1}^{nperm} I ≤ft[ D_{k} ≥q D_{obs} \right] + 1}{nperm + 1}
where D_{k} is the test statistic for permutation k
, D_{obs} is the
observed test statistic, and I
is the indicator function.
Value
When pvalue.only=TRUE
(default), function KStest
returns
the pvalue indicating the attained significance level. When
pvalue.only=FALSE
, function KStest
produces a list of
length 3 with the following components:
statistic 
the value of the observed test statistic. 
perm.stat 
numeric vector of the resulting test statistic for

p.value 
pvalue indicating the attained significance level. 
Note
This function invokes function HDP.ranking
which does not
work properly if there is any node in the MST with more than 26 links.
However, this situation is almost impossible for a dataset composed of a few
hundreds or less of samples.
Author(s)
Yasir Rahmatallah and Galina Glazko
References
Rahmatallah Y., EmmertStreib F. and Glazko G. (2012) Gene set analysis for selfcontained tests: complex null and specific alternative hypotheses. Bioinformatics 28, 3073–3080.
Friedman J. and Rafsky L. (1979) Multivariate generalization of the WaldWolfowitz and Smirnov twosample tests. Ann. Stat. 7, 697–717.
See Also
MDtest
, WWtest
, RKStest
,
RMDtest
, HDP.ranking
.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  ## generate a feature set of length 20 in two conditions
## each condition has 20 samples
## use multivariate normal distribution
library(MASS)
ngenes < 20
nsamples < 40
## let the mean vector have zeros of length 20 in both conditions
zero_vector < array(0,c(1,ngenes))
## set the covariance matrix to be an identity matrix for both conditions
cov_mtrx < diag(ngenes)
gp < mvrnorm(nsamples, zero_vector, cov_mtrx)
## apply a mean shift of 3 to half of the features under condition 1
gp[1:20,1:10] < gp[1:20,1:10] + 3
dataset < aperm(gp, c(2,1))
## first 20 samples belong to condition 1
## second 20 samples belong to condition 2
pvalue < KStest(object=dataset, group=c(rep(1,20),rep(2,20)))
