main function for Snowball analysis

Description

This is the main function to perform snowball analysis. It requires a minimum input with many default operating parameters set.

Usage

1
2
3
snowball(y, X, ncore = 1, d = 300, B = 10000, B.i = 2000,
  sample.n = 100, resample.method = c("sample", "none", "combn"),
  mode.resample = c("count.class", "flat", "percent.class"), k.resample = 1)

Arguments

y

a factor variable for mutation status

X

data.frame containing gene expression data. The columns of X should be aligned with y on samples

ncore

number of processors to use for parallel computation. Set ncore = 1 or NULL for non-parallel computation mode

d

the size of gene subset for gene level resampling. See references on d in X_d^x

B

bootstrap size, which is B in J_n(x), defining the total number of gene subsets used to estimate J_n,

J_n(x)=\frac{1}{B}∑_{i=1}^{B}(\frac{1}{K}∑_{j=1}^{K}φ_n(g(X_{i,j}),κ))

B.i

bootstrap size deployed on each child job in parallel mode

sample.n

number of samples drawn from the subject level resampling, denoted as K in J_n(x). It is ignored if resample.method="none" or "combn"

resample.method

this defines how the subject level resampling is performed. The possible values are "sample", "none" and "combn". Let resample.method = "sample" for random sampling with replacement, "none" for no resampling on subject dimension, and "combn" for all combinations by permuting the subjects in each group. See Note for more information.

mode.resample

this specifies how the subjects are counted for subject level leave-k-out random sampling, and whether the stratification by group is applied. The possible input values are "count.class", "percent.class" or "no". "no" implies that no stratification is applied and the resampling is performed on all subjects pooled together from the both groups. "count.class" implies the resampling leaves out a subset of subjects based on the number provided, and "percent.class" implies the number of subjects left out was calculated based on the percentage of the total subjects in each group. See Note for more information.

k.resample

A numerical value specifies the number of subjects left out during the subject level resampling. It is an integer number if mode.resample = "count.class" and a numerical number between 0 and 1 if mode.resample = "percent.class". See Note for more information.

Value

A data.frame containing two variables: weights and positives. weights are the J_n(x) values for all genes and positives are indicators to whether a specific J_n(x) is above or below the median of all J_n(x)'s.

Note

The resampling is applied on two dimensions (see references): gene level resamping and subject level resampling. The gene level resampling is straightforward - each time it takes d number of genes randomly from all the genes in X. The subject level resampling is specified by the combination of values given in sample.n, resample.method, mode.resample and k.resample. The flat resampling on all subjects regardless of grouping, specified by letting resample.method="none", is simply a leave-k-out random sampling, where k is given by k.resample. In more complex cases, the subject level resampling can be stratified based on the groups defined on y, in which case, resample.method takes the value of either "sample" or "combn". When resample.method = "sample", it applies a leave-k-out random sampling within each group and finally only sample.n samples are generated from the resampling. When resample.method = "combn", all possible combinations after conditioning on the restrictions given by mode.resample and k.resample are included. In this case, the total number of resampled samples varies depending on the sample size of the study. mode.resample="count.class" or "percent.class" defines two ways to calculate the number of subjects to be left out in the random sampling. The value of "count.class" indicates the exact number to be left out and "percent.class" indicates the percentage of total subjects to be left out. In all cases, k.resample specifies the number of subjects left out in the leave-k-out sampling. If k.resample is only a scalar integer number, the subjects will be sampled with exactly k.resample subjects left out, either across all the subjects in the case of flat sampling, or within each group in the case of stratified resampling by group. Instead, if k.resample a vector with two integer numbers, the sampling will leave out the number of subjects from the two groups based on the two numbers provided. The order of which number is taken for which group is based on that the first number is assigned to the first factor level and the second number is assigned to the second factor level of factor(y). Check factor(y) to see how the two numbers in k.resample would be assigned to the two groups. A vector with two values for k.resample produces error if mode.resample = "flat". This flexible way of defining the sampling scheme allows easy specification for balanced sample size between groups. See references for more details.

References

Xu, Y., Guo, X., Sun, J. and Zhao. Z. Snowball: resampling combined with distance-based regression to discover transcriptional consequences of driver mutation, manuscript.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
require(DESnowball)
data(snowball.demoData)
# check the demo dataset
print(sb.mutation)
head(sb.expression)
## A test run
Bn <- 10000
ncore <-4
# call Snowball
## Not run: 
sb <- snowball(y=sb.mutation,X=sb.expression,
	          ncore=ncore,d=100,B=Bn,
	          sample.n=1)
# process the gene ranking and selection
sb.sel <- select.features(sb)
# plot the Jn values
plotJn(sb, sb.sel)
# get the significant gene list
top.genes <- toplist(sb.sel)

## End(Not run)