Function for finding best features associated with clusters

Description

Calls limma on input data to determine features most associated with found clusters (based on an F-statistic, pairwise comparisons, or following a tree that clusters the clusters).

Usage

1
2
3
4
5
6
7
8
9
## S4 method for signature 'matrix'
getBestFeatures(x, cluster, contrastType = c("F", "Dendro",
  "Pairs", "OneAgainstAll"), dendro = NULL, pairMat = NULL,
  contrastAdj = c("All", "PerContrast", "AfterF"), isCount = FALSE,
  normalize.method = "none", ...)

## S4 method for signature 'ClusterExperiment'
getBestFeatures(x, contrastType = c("F",
  "Dendro", "Pairs", "OneAgainstAll"), isCount = FALSE, ...)

Arguments

x

data for the test. Can be a numeric matrix or a ClusterExperiment.

cluster

A numeric vector with cluster assignments. “-1” indicates the sample was not assigned to a cluster.

contrastType

What type of test to do. ‘F’ gives the omnibus F-statistic, ‘Dendro’ traverses the given dendrogram and does contrasts of the samples in each side, ‘Pairs’ does pair-wise contrasts based on the pairs given in pairMat (if pairMat=NULL, does all pairwise), and ‘OneAgainstAll’ compares each cluster to the average of all others. Passed to clusterContrasts

dendro

The dendrogram to traverse if contrastType="Dendro". Note that this should be the dendrogram of the clusters, not of the individual samples.

pairMat

matrix giving the pairs of clusters for which to do pair-wise contrasts (must match to elements of cl). If NULL, will do all pairwise of the clusters in cluster (excluding "-1" categories). Each row is a pair to be compared and must match the names of the clusters in the vector cluster.

contrastAdj

What type of FDR correction to do for contrasts tests (i.e. if contrastType='Dendro' or 'Pairs').

isCount

logical as to whether input data is count data, in which case to perform voom correction to data. See details.

normalize.method

character value, passed to voom in limma package. Only used if countData=TRUE. Note that the default value is set to "none", which is not the default value of voom.

...

options to pass to topTable or topTableF (see limma package)

Details

getBestFeatures returns the top ranked features corresponding to a cluster assignment. It uses limma to fit the models, and limma's functions topTable or topTableF to find the best features. See the options of these functions to put better control on what gets returned (e.g. only if significant, only if log-fc is above a certain amount, etc.). In particular, set 'number=' to define how many significant features to return (where number is per contrast for the 'Pairs' or 'Dendro' option)

When 'contrastType' argument implies that the best features should be found via contrasts (i.e. 'contrastType' is 'Pairs' or 'Dendro'), then then 'contrastAdj' determines the type of multiple testing correction to perform. 'PerContrast' does FDR correction for each set of contrasts, and does not guarantee control across all the different contrasts (so probably not the preferred method). 'All' calculates the corrected p-values based on FDR correction of all of the contrasts tested. 'AfterF' controls the FDR based on a hierarchical scheme that only tests the contrasts in those genes where the omnibus F statistic is significant. If the user selects 'AfterF', the user must also supply an option 'p.value' to have any effect, and then only those significant at that p.value level will be returned. Note that currently the correction for 'AfterF' is not guaranteed to control the FDR; improvements will be added in the future.

Note that the default option for topTable is to not filter based on adjusted p-values (p.value = 1) and return only the top 10 most significant (number = 10) – these are options the user can change (these arguments are passed via the ... in getBestFeatures). In particular, it only makes sense to set requireF = TRUE if p.value is meaningful (e.g. 0.1 or 0.05); the default value of p.value = 1 will not result in any effect on the adjusted p-value otherwise.

isCount triggers whether the "voom" correction will be performed in limma. If the input data is a matrix is counts (or a 'ClusterExperiment' object with counts as the primary data before transformation) this should be set to TRUE and they will be log-transformed internally by voom for the differential expression analysis in a way that accounts for the difference in the mean-variance relationships. Otherwise, dat should be on the correct (log) scale for differential expression analysis without a need a variance stabilization (e.g. microarray data). Currently the default is set to FALSE, simply because the isCount has not been heavily tested. If the But TRUE with x being counts really should be the default for RNA-Seq data. If the input data is a 'ClusterExperiment' object, setting 'isCount=TRUE' will cause the program to ignore the internally stored transformation function and instead use voom with log2(x+0.5). Alternatively, 'isCount=FALSE' for a 'ClusterExperiment' object will cause the DE to be performed with 'limma' after transforming the data with the stored transformation. Although some writing about "voom" seem to suggest that it would be appropriate for arbitrary transformations, the authors have cautioned against using it for anything other than count data on mailing lists. For this reason we are not implementing it for arbitrary transformations at this time (e.g. log(FPKM+epsilon) transformations).

Value

A data.frame in the same format as topTable, except for the following additional or changed columns:

  • Feature This is the column called 'ProbeID' by topTable

  • IndexInOriginal Gives the index of the feature to the original input dataset, x

  • Contrast The contrast that the results corresponds to (if applicable, depends on contrastType argument)

  • ContrastName The name of the contrast that the results corresponds to. For dendrogram searches, this will be the node of the tree of the dendrogram.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
data(simData)

#create a clustering, for 8 clusters (truth was 4)
cl <- clusterSingle(simData, clusterFunction="pam", subsample=FALSE,
sequential=FALSE, clusterDArgs=list(k=8))

#basic F test, return all, even if not significant:
testF <- getBestFeatures(cl, contrastType="F", number=nrow(simData),
isCount=FALSE)

#Do all pairwise, only return significant, try different adjustments:
pairsPerC <- getBestFeatures(cl, contrastType="Pairs", contrastAdj="PerContrast",
p.value=0.05, isCount=FALSE)
pairsAfterF <- getBestFeatures(cl, contrastType="Pairs", contrastAdj="AfterF",
p.value=0.05, isCount=FALSE)
pairsAll <- getBestFeatures(cl, contrastType="Pairs", contrastAdj="All",
p.value=0.05, isCount=FALSE)

#not useful for this silly example, but could look at overlap with Venn
allGenes <- paste("Row", 1:nrow(simData),sep="")
if(require(limma)){
 vennC <- vennCounts(cbind(PerContrast= allGenes %in% pairsPerC$Feature,
 AllJoint=allGenes %in% pairsAll$Feature, FHier=allGenes %in%
 pairsAfterF$Feature))
vennDiagram(vennC, main="FDR Overlap")
}

#Do one cluster against all others
oneAll <- getBestFeatures(cl, contrastType="OneAgainstAll", contrastAdj="All",
p.value=0.05)

#Do dendrogram testing
hcl <- makeDendrogram(cl)
allDendro <- getBestFeatures(hcl, contrastType="Dendro", contrastAdj=c("All"),
number=ncol(simData), p.value=0.05)

# do DE on counts using voom
# compare results to if used simData instead (not on count scale).
# Again, not relevant for this silly example, but basic principle useful
testFVoom <- getBestFeatures(simCount, primaryCluster(cl), contrastType="F",
number=nrow(simData), isCount=TRUE)
plot(testF$P.Value[order(testF$Index)],
testFVoom$P.Value[order(testFVoom$Index)],log="xy")