subtype: Cluster analysis to find molecular subtypes and their...
In subtype: Cluster analysis to find molecular subtypes and their assessment

Description Usage Arguments Details Value Author(s) References Examples

View source: R/subtype.R

subtype performs a biclustering procedure on a input dataset and assess whether resulting clusters are promising subtypes.

subtype(GEset, outcomeLabels, treatment=NULL, Npermutes=10, Nchunks = 25, minClusterSizeB = 20, NclustersASet = 100, FDRpermutation = TRUE, nFDRperm = 50, seed = NULL, testMode="quick",survivaltimes=NULL,method="penalized", top_best_probes=100, Niter=20, showMovie=0, redefineSubtypeMembers=0,holdOut=10 )

`GEset`	p-by-n data matrix, where p is the number of variables (e.g. genes) and n is the number of subjects. Row and column names are necessary.
`outcomeLabels`	n-by-1 vector. Binary prognosis labels assigned to the subjects. The order of subjects should be equalized to that of GEset.
`treatment`	NULL.
`Npermutes`	Number of permutations for the variables. For each permutation, the variables belong to different chunks.
`Nchunks`	Number of chunks of the variables. When the number of variables is too large for clustering analysis, we split the variables into several(=Nchunks) chunks.
`minClusterSizeB`	The minimum number of subjects per each selected subtype. The default is 20.
`NclustersASet`	Cut a tree from hierarchical clustering into several groups. The default is 100.
`FDRpermutation`	Determine whether FDR computation is based on permutation procedure. The default is TRUE.
`nFDRperm`	Number of permutation to compute FDR. The default is 50.
`seed`	seed number for reproducibility.
`testMode`	the mode is fixed at "quick".
`survivaltimes`	NULL.
`method`	penalized is used.
`top_best_probes`	top-ranked probes are used in t-test, and this is input for penalized. The default is 100.
`Niter`	The number of iterations of (TrainingSet, TestSet)->training->test->recordResults . The defualt is 20.
`showMovie`	display RUC/Surv curves and heatmaps. The default is 0.
`redefineSubtypeMembers`	detect subtype members after every hold-out. The defualt is 0.
`holdOut`	out of the subtype, i.e. Nsubtype - holdOut = Ntraining_set. The defualt is 10.

This implements a biclustering algorithm to find hidden subtypes in a dataset. summary provides a measure based on FDR and its p-value for assessing the subtypes. Note that the R-package rsmooth should be installed before implementing subtype. rsmooth can be downloaded from http://www.meb.ki.se/~yudpaw. For large dataset, the computation can be heavy, so it is desirable for users to consider parallel processing in R.

resultsAll:	a matrix including subtypeID and summary statistics for each subtypeID. For a specific subtypeID, it includes the number of genes, the number of subjects, area of low p-values (low_pValue_Area).
GenesDefiningSubtypes:	Variables in each subtypeID. This can be identified with "subtypeID".
SubtypePatients:	Subjects in each subtypeID. This can be identified with subtypeID.

Andrey Alexeyenko, Woojoo Lee (maintainer:lwj221@gmail.com) and Yudi Pawitan

Alexeyenko, A. et al. (2011) Estimation of false discovery rate in a heterogeneous population.

set.seed(1234)
p<-100   #num.variables
n1<-5    #number of sample in population 1
n2<-5    #num.samples from population 2 

group<-c(rep(1,length.out=n1),rep(2,length.out=n2))
data<-matrix(rnorm((n1+n2)*p),(n1+n2),p)

############################

dimnames(data)[[1]]<-as.character(paste("P",runif(nrow(data),0,1),sep="")) ### making row names
dimnames(data)[[2]]<-as.character(paste("G",runif(ncol(data),0,1),sep="")) ### making column names

### The following procedure takes ~ 1 minute.
A=subtype(
   GEset = t(data),
   outcomeLabels = group,
   Npermutes = 2, 
   Nchunks = 5, 
   NclustersASet = 3,
   seed=1234
)

summary(A,f.out=0)  ### f.out can be used for filtering out uninteresting subtypes. e.g. if f.out=2, we ignore subtypes having N01_0<=2.