getBestFeatures: Function for finding best features associated with clusters
In epurdom/clusterCells: Compare Clusterings for Single-Cell Sequencing

getBestFeatures

R Documentation

Function for finding best features associated with clusters

Description

Calls limma on input data to determine features most associated with found clusters (based on an F-statistic, pairwise comparisons, or following a tree that clusters the clusters).

Usage

## S4 method for signature 'matrixOrHDF5'
getBestFeatures(
  x,
  cluster,
  contrastType = c("F", "Dendro", "Pairs", "OneAgainstAll"),
  dendro = NULL,
  pairMat = NULL,
  weights = NULL,
  contrastAdj = c("All", "PerContrast", "AfterF"),
  DEMethod = c("edgeR", "limma", "limma-voom"),
  dgeArgs = NULL,
  ...
)

## S4 method for signature 'ClusterExperiment'
getBestFeatures(
  x,
  contrastType = c("F", "Dendro", "Pairs", "OneAgainstAll"),
  whichCluster = "primary",
  whichAssay = 1,
  DEMethod,
  weights = if ("weights" %in% assayNames(x)) "weights" else NULL,
  ...
)

Arguments

`x`	data for the test. Can be a numeric matrix or a `ClusterExperiment`.
`cluster`	A numeric vector with cluster assignments. “-1” indicates the sample was not assigned to a cluster.
`contrastType`	What type of test to do. ‘F’ gives the omnibus F-statistic, ‘Dendro’ traverses the given dendrogram and does contrasts of the samples in each side, ‘Pairs’ does pair-wise contrasts based on the pairs given in pairMat (if pairMat=NULL, does all pairwise), and ‘OneAgainstAll’ compares each cluster to the average of all others. Passed to `clusterContrasts`
`dendro`	The dendrogram to traverse if contrastType="Dendro". Note that this should be the dendrogram of the clusters, not of the individual samples, either of class "dendrogram" or "phylo4"
`pairMat`	matrix giving the pairs of clusters for which to do pair-wise contrasts (must match to elements of cl). If NULL, will do all pairwise of the clusters in `cluster` (excluding "-1" categories). Each row is a pair to be compared and must match the names of the clusters in the vector `cluster`.
`weights`	weights to use in by edgeR. If `x` is a matrix, then weights should be a matrix of weights, of the same dimensions as `x`. If `x` is a `ClusterExperiment` object `weights` can be a either a matrix, as previously described, or a character or numeric index to an assay in `x` that contains the weights. We recommend that weights be stored as an assay with name `"weights"` so that the weights will also be used with `mergeClusters`, and this is the default. Setting `weights=NULL` ensures that weights will NOT be used, and only the standard edgeR.
`contrastAdj`	What type of FDR correction to do for contrasts tests (i.e. if contrastType='Dendro' or 'Pairs').
`DEMethod`	character vector describing how the differential expression analysis should be performed (replaces previous argument `isCount`. See details.
`dgeArgs`	a list of arguments to pass to `DGEList` which is the starting point for both `edgeR` and `limma-voom` methods of DE. This includes normalization factors/total count values etc.
`...`	If `x` is a matrix, these are options to pass to `topTable` (see `limma` package). If `x` is a `ClusterExperiment` object, these arguments can also be those to pass to the matrix version.
`whichCluster`	argument that can be a single numeric or character value indicating the single clustering to be used. Giving values that result in more than one clustering will result in an error. See details of `getClusterIndex`.
`whichAssay`	numeric or character specifying which assay to use. See `assay` for details.

Details

getBestFeatures returns the top ranked features corresponding to a cluster assignment. It uses either limma or edgeR to fit the models, and limma/edgeR functions topTable to find the best features. See the options of this function to put better control on what gets returned (e.g. only if significant, only if log-fc is above a certain amount, etc.). In particular, set 'number=' to define how many significant features to return (where number is per contrast for the 'Pairs' or 'Dendro' option)

DEMethod triggers what type of differential expression analysis will be performed. Three options are available: limma, edgeR, and limma with a voom corrections. The last two options are only appropriate for count data. If the input x is a ClusterExperiment object, and DEMethod="limma", then the data analyzed for DE will be after taking the transformation of the data (as given in the transformation slot of the object). For the options "limma-voom" and "edgeR", the transformation slot will be ignored and only the counts data (as specified by the whichAssay slot) will be passed to the programs. Note that for "limma-voom" this implies that the data will be transformed by voom with the function log2(x+0.5). If weights is not NULL, and DEMethod="edgeR", then the function glmWeightedF from the zinbwave package is run; otherwise glmLRT from edgeR.

Note that the argument DEMethod replaces the previous option isCount, to decide on the method of DE.

When 'contrastType' argument implies that the best features should be found via contrasts (i.e. 'contrastType' is 'Pairs' or 'Dendro'), then then 'contrastAdj' determines the type of multiple testing correction to perform. 'PerContrast' does FDR correction for each set of contrasts, and does not guarantee control across all the different contrasts (so probably not the preferred method). 'All' calculates the corrected p-values based on FDR correction of all of the contrasts tested. 'AfterF' controls the FDR based on a hierarchical scheme that only tests the contrasts in those genes where the omnibus F statistic is significant. If the user selects 'AfterF', the user must also supply an option 'p.value' to have any effect, and then only those significant at that p.value level will be returned. Note that currently the correction for 'AfterF' is not guaranteed to control the FDR; improvements will be added in the future.

Note that the default option for topTable is to not filter based on adjusted p-values (p.value = 1) and return only the top 10 most significant (number = 10) – these are options the user can change (these arguments are passed via the ... in getBestFeatures). In particular, it only makes sense to set requireF = TRUE if p.value is meaningful (e.g. 0.1 or 0.05); the default value of p.value = 1 will not result in any effect on the adjusted p-value otherwise.

Value

A data.frame in the same format as topTable, or topTags. The output differs between these two programs, mainly in the naming of columns. Furthermore, if weights are used, an additional column padjFilter is included as the result of running glmWeightedF with default option independentFiltering = TRUE. The following column names are the same between all of the DE methods.

Feature This is the column called 'ProbeID' by topTable
IndexInOriginal Gives the index of the feature to the original input dataset, x
Contrast The contrast that the results corresponds to (if applicable, depends on contrastType argument)
ContrastName The name of the contrast that the results corresponds to. For dendrogram searches, this will be the node of the tree of the dendrogram. If x is a ClusterExperiment object, this name will make use of the user defined names of the cluster or node in x.
InternalName Only present if x is a ClusterExperiment object. In this case this column will give the name of the contrast using the internal ids of the clusters and nodes, not the user-defined names. This provides stability in matching the contrast if the user has changed the names since running getBestFeatures
P.Value The unadjusted p-value (changed from PValue in topTags)
adj.P.Val The adjusted p-value (changed from FDR or FWER in topTags)

References

Ritchie, ME, Phipson, B, Wu, D, Hu, Y, Law, CW, Shi, W, and Smyth, GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47. http://nar.oxfordjournals.org/content/43/7/e47

Law, CW, Chen, Y, Shi, W, and Smyth, GK (2014). Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15, R29. http://genomebiology.com/2014/15/2/R29

Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, Volume 3, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf

Examples

data(simData)

#create a clustering, for 8 clusters (truth was 4)
cl <- clusterSingle(simData, subsample=FALSE,
sequential=FALSE, mainClusterArgs=list(clusterFunction="pam", clusterArgs=list(k=8)))

#basic F test, return all, even if not significant:
testF <- getBestFeatures(cl, contrastType="F", number=nrow(simData),
DEMethod="limma")

#Do all pairwise, only return significant, try different adjustments:
pairsPerC <- getBestFeatures(cl, contrastType="Pairs", contrastAdj="PerContrast",
p.value=0.05, DEMethod="limma")
pairsAfterF <- getBestFeatures(cl, contrastType="Pairs", contrastAdj="AfterF",
p.value=0.05, DEMethod="limma")
pairsAll <- getBestFeatures(cl, contrastType="Pairs", contrastAdj="All",
p.value=0.05, DEMethod="limma")

#not useful for this silly example, but could look at overlap with Venn
allGenes <- paste("Row", 1:nrow(simData),sep="")
if(require(limma)){
 vennC <- vennCounts(cbind(PerContrast= allGenes %in% pairsPerC$Feature,
 AllJoint=allGenes %in% pairsAll$Feature, FHier=allGenes %in%
 pairsAfterF$Feature))
vennDiagram(vennC, main="FDR Overlap")
}

#Do one cluster against all others
oneAll <- getBestFeatures(cl, contrastType="OneAgainstAll", contrastAdj="All",
p.value=0.05,DEMethod="limma")

#Do dendrogram testing
hcl <- makeDendrogram(cl)
allDendro <- getBestFeatures(hcl, contrastType="Dendro", contrastAdj=c("All"),
number=ncol(simData), p.value=0.05,DEMethod="limma")

# do DE on counts using voom
# compare results to if used simData instead (not on count scale).
testFVoom <- getBestFeatures(simCount, primaryCluster(cl), contrastType="F",
number=nrow(simData), DEMethod="limma-voom")
plot(testF$P.Value[order(testF$Index)],
testFVoom$P.Value[order(testFVoom$Index)],log="xy")

# do DE on counts using edgeR, compare voom
testFEdge <- getBestFeatures(simCount, primaryCluster(cl), contrastType="F",
n=nrow(simData), DEMethod="edgeR")
plot(testFVoom$P.Value[order(testFVoom$Index)],
testFEdge$P.Value[order(testFEdge$Index)],log="xy")

epurdom/clusterCells documentation built on April 28, 2024, 8:14 p.m.