adSplit: Annotation-Driven Splits
In adSplit: Annotation-Driven Clustering

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/adSplit.R

This function searches for annotation-driven splits of patients in microarray data. A split is a partitioning of patients into two groups. In order to do so it refers to GO terms and KEGG pathways. In addition, a significance measure can be computed by simulating a random distribution of scores. DLD-scores are used to judge the quality of a split.

adSplit(mydata, annotation.ids, chip.name, 
        min.probes = 20, max.probes = NULL, 
        B = NULL, min.group.size = 5, ngenes = 50, 
        ignore.genes = 5)

`mydata`	either an expression set as defined by the package `Biobase` or a matrix of expression levels (rows=genes, columns=samples).
`annotation.ids`	a vector of GO or KEGG identifiers in the form "GO:..." or "KEGG:..." respectively. The prefix "KEGG:" is removed from the KEGG-identifiers before accessing the chip's "...PATH2PROBES" hash.
`chip.name`	the name of the chip by which the expression set is measured. `adSplit` attempts to load a library of the same name and expects to find a hash called "<chip-name>GO2ALLPROBES" and one called "<chip-name>PATH2PROBES" there.
`min.probes`	annotation identifiers with fewer than this associated genes are skipped.
`max.probes`	annotation identifiers with more than this associated genes are skipped. The default is ten percent of the genes on the chip.
`B`	the number of random gene set samplings to be performed to compute empirical p-values.
`min.group.size`	filter criteria to avoid splits suggesting tiny groups. Splits where one of the two suggested groups are smaller than this number are removed from the split set.
`ngenes`	number of genes used to compute DLD scores.
`ignore.genes`	number of best scoring genes to be ignored when computing DLD scores.

This function applies the same splitting procedure to all annotation identifiers provided. Firstly, the associated genes for one identifier are determined and extracted from the expression data. Then the diana2means function is applied to the restricted data and the different splits generated are collected into a single splitSet object.

As annotation identifiers vectors of identifiers of the KEGG:nnnnn and GO:nnnnnn are valid. In addition, the keywords "KEGG", "GO" and "all" are allowed, representing all terms in the corresponding ontology.

If B is set to a integer number this number of samplings are used to generate a null-distribution of DLD-scores. This distribution is used to compute empirical p-values for each split. If more than one valid split is found, multiple testing is corrected for by applying Benjamini-Hochbergs correction from the multtest package.

Returns an object of class splitSet with the following list elements:

`cuts`	a matrix of split attributions. One row per annotation identifier (GO term or KEGG pathway for which a split has been generated. One column per object in the dataset.
`score`	one score per generated split.
`pvalue`	one empirical p-value per generated split, or `NULL`
`qvalue`	one q-value computed according Benjamini-Hochberg's correction for multiple testing per generated split, or `NULL`

Claudio Lottaz, Joern Toedling

diana2means, randomDiana2means, image.splitSet

 
# prepare data
library(golubEsets) 
data(Golub_Merge) 

# generate annotation-driven splits for apoptosis and signal transduction
x <- adSplit(Golub_Merge, "GO:0006915", "hu6800")
x <- adSplit(Golub_Merge, c("GO:0007165","GO:0006915"), "hu6800", max.probes=7000)

# generate a split for glutamate metabolism including 
# an empirical p-value
x <- adSplit(Golub_Merge, "KEGG:00251", "hu6800", B=100)

## Not run: 
# generate splits for all KEGG pathways.
x <- adSplit(Golub_Merge, "KEGG", "hu6800")
image(x)

## End(Not run)