This is the main function to perform snowball analysis. It requires a minimum input with many default operating parameters set.
1 2 3
a factor variable for mutation status
data.frame containing gene expression data. The
number of processors to use for parallel
the size of gene subset for gene level resampling. See references on d in X_d^x
bootstrap size, which is B in J_n(x), defining the total number of gene subsets used to estimate J_n,
bootstrap size deployed on each child job in parallel mode
number of samples drawn from the subject
level resampling, denoted as K in J_n(x). It
is ignored if
this defines how the subject level
resampling is performed. The possible values are
this specifies how the subjects are
counted for subject level leave-k-out random sampling,
and whether the stratification by group is applied. The
possible input values are
A numerical value specifies the number
of subjects left out during the subject level resampling.
It is an integer number if
A data.frame containing two variables:
weights are the J_n(x)
values for all genes and positives are indicators to
whether a specific J_n(x) is above or below the
median of all J_n(x)'s.
The resampling is applied on two dimensions (see
references): gene level resamping and subject level
resampling. The gene level resampling is straightforward -
each time it takes
d number of genes randomly from
all the genes in
X. The subject level resampling is
specified by the combination of values given in
k.resample. The flat
resampling on all subjects regardless of grouping,
specified by letting
simply a leave-k-out random sampling, where k is given by
k.resample. In more complex cases, the subject level
resampling can be stratified based on the groups defined on
y, in which case,
resample.method takes the
value of either
resample.method = "sample", it applies a leave-k-out
random sampling within each group and finally only
sample.n samples are generated from the resampling.
resample.method = "combn", all possible
combinations after conditioning on the restrictions given
k.resample are included.
In this case, the total number of resampled samples varies
depending on the sample size of the study.
"percent.class" defines two ways to calculate the
number of subjects to be left out in the random sampling.
The value of "count.class" indicates the exact number to be
left out and "percent.class" indicates the percentage of
total subjects to be left out. In all cases,
k.resample specifies the number of subjects left out
in the leave-k-out sampling. If
k.resample is only a
scalar integer number, the subjects will be sampled with
k.resample subjects left out, either across
all the subjects in the case of flat sampling, or within
each group in the case of stratified resampling by group.
k.resample a vector with two integer
numbers, the sampling will leave out the number of subjects
from the two groups based on the two numbers provided. The
order of which number is taken for which group is based on
that the first number is assigned to the first factor level
and the second number is assigned to the second factor
factor(y) to see
how the two numbers in
k.resample would be assigned
to the two groups. A vector with two values for
k.resample produces error if
"flat". This flexible way of defining the sampling scheme
allows easy specification for balanced sample size between
groups. See references for more details.
Xu, Y., Guo, X., Sun, J. and Zhao. Z. Snowball: resampling combined with distance-based regression to discover transcriptional consequences of driver mutation, manuscript.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
require(DESnowball) data(snowball.demoData) # check the demo dataset print(sb.mutation) head(sb.expression) ## A test run Bn <- 10000 ncore <-4 # call Snowball ## Not run: sb <- snowball(y=sb.mutation,X=sb.expression, ncore=ncore,d=100,B=Bn, sample.n=1) # process the gene ranking and selection sb.sel <- select.features(sb) # plot the Jn values plotJn(sb, sb.sel) # get the significant gene list top.genes <- toplist(sb.sel) ## End(Not run)