denoisePCA | R Documentation |
Denoise log-expression data by removing principal components corresponding to technical noise.
getDenoisedPCs(x, ...)
## S4 method for signature 'ANY'
getDenoisedPCs(
x,
technical,
subset.row,
min.rank = 5,
max.rank = 50,
fill.missing = FALSE,
BSPARAM = bsparam(),
BPPARAM = SerialParam()
)
## S4 method for signature 'SummarizedExperiment'
getDenoisedPCs(x, ..., assay.type = "logcounts")
denoisePCA(
x,
...,
value = c("pca", "lowrank"),
preserve.shape = TRUE,
assay.type = "logcounts",
name = NULL
)
denoisePCANumber(var.exp, var.tech, var.total)
x |
For For |
... |
For the For the |
technical |
An object containing the technical components of variation for each gene in
|
subset.row |
A logical, character or integer vector specifying the rows of |
min.rank , max.rank |
Integer scalars specifying the minimum and maximum number of PCs to retain. |
fill.missing |
Logical scalar indicating whether entries in the rotation matrix should be imputed for genes that were not used in the PCA.
Only relevant if |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA. |
BPPARAM |
A BiocParallelParam object to use for parallel processing. |
assay.type |
A string specifying which assay values to use. |
value |
String specifying the type of value to return.
|
preserve.shape |
Logical scalar indicating whether or not the output SingleCellExperiment should be subsetted to |
name |
String containing the name which which to store the results.
Defaults to |
var.exp |
A numeric vector of the variances explained by successive PCs, starting from the first (but not necessarily containing all PCs). |
var.tech |
A numeric scalar containing the variance attributable to technical noise. |
var.total |
A numeric scalar containing the total variance in the data. |
This function performs a principal components analysis to eliminate random technical noise in the data. Random noise is uncorrelated across genes and should be captured by later PCs, as the variance in the data explained by any single gene is low. In contrast, biological processes should be captured by earlier PCs as more variance can be explained by the correlated behavior of sets of genes in a particular pathway. The idea is to discard later PCs to remove noise and improve resolution of population structure. This also has the benefit of reducing computational work for downstream steps.
The choice of the number of PCs to discard is based on the estimates of technical variance in technical
.
This argument accepts a number of different values, depending on how the technical noise is calculated - this generally involves functions such as modelGeneVarWithSpikes
or modelGeneVarByPoisson
.
The percentage of variance explained by technical noise is estimated by summing the technical components across genes and dividing by the summed total variance.
Genes with negative biological components are ignored during downstream analyses to ensure that the total variance is greater than the overall technical estimate.
Now, consider the retention of the first d
PCs.
For a given value of d
, we compute the variance explained by all of the later PCs.
We aim to find the smallest value of d
such that the sum of variances explained by the later PCs is still less than the variance attributable to technical noise.
This choice of d
represents a lower bound on the number of PCs that can be retained before biological variation is definitely lost.
We use this value to obtain a “reasonable” dimensionality for the PCA output.
Note that d
will be coerced to lie between min.rank
and max.rank
.
This mitigates the effect of occasional extreme results when the percentage of noise is very high or low.
For getDenoisedPCs
, a list is returned containing:
components
, a numeric matrix containing the selected PCs (columns) for all cells (rows).
This has number of columns between min.rank
and max.rank
inclusive.
rotation
, a numeric matrix containing rotation vectors (columns) for some or all genes (rows).
This has number of columns between min.rank
and max.rank
inclusive.
var.explained
, a numeric vector containing the variance explained by the first max.rank
PCs.
percent.var
, a numeric vector containing the percentage of variance explained by the first max.rank
PCs.
Note that this may not sum to 100% if max.rank
is smaller than the total number of PCs.
used.rows
, a integer vector specifying the rows of x
that were used in the PCA.
denoisePCA
will return a modified x
with:
the PC results stored in the reducedDims
as a "PCA"
entry, if type="pca"
.
a low-rank approximation as a new "lowrank"
assay, if type="lowrank"
.
This is represented as a LowRankMatrix.
denoisePCANumber
will return an integer scalar specifying the number of PCs to retain.
This is equivalent to the output from getDenoisedPCs
after setting value="n"
, but ignoring any setting of min.rank
or max.rank
.
We can use subset.row
to perform the PCA on a subset of genes of interest.
This is typically used to subset to HVGs to reduce computational time and increase the signal-to-noise ratio of downstream analyses.
Note that only rows with positive components are actually used in the PCA, even if we explicitly specified them in subset.row
.
The final set of genes used in the PCA is returned in used.rows
.
If fill.missing=TRUE
, entries of the rotation matrix are imputed for all genes in x
.
This includes “unselected” genes, i.e., with negative biological components or that were not selected with subset.row
.
Rotation vectors are extrapolated to these genes by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes.
This is useful for guaranteeing that any low-rank approximation has the same dimensions as the input x
.
For example, calling denoisePCA
with preserve.shape=TRUE
will use fill.missing=TRUE
internally,
which guarantees that any value="lowrank"
setting will be of the same dimensions as the input x
.
Otherwise, if fill.missing=FALSE
and preserve.shape=FALSE
, the output is exactly the same as if the function had been run on x[subset.row,]
.
The function's choice of d
is only optimal if the early PCs capture all the biological variation with minimal noise.
This is unlikely to be true as the PCA cannot distinguish between technical noise and weak biological signal in the later PCs.
In practice, the chosen d
can only be treated as a lower bound for the retention of signal, and it is debatable whether this has any particular relation to the “best” choice of the number of PCs.
For example, many aspects of biological variation are not that interesting (e.g., transcriptional bursting, metabolic fluctuations) and it is often the case that we do not need to retain this signal, in which case the chosen d
- despite being a lower bound - may actually be higher than necessary.
Interpretation of the choice of d
is even more complex if technical
was generated with modelGeneVar
rather than modelGeneVarWithSpikes
or modelGeneVarByPoisson
.
The former includes “uninteresting” biological variation in its technical component estimates, increasing the proportion of variance attributed to technical noise and yielding a lower value of d
.
Indeed, use of results from modelGeneVar
often results in d
being set to to min.rank
, which can be problematic if secondary factors of biological variation are discarded.
Aaron Lun
Lun ATL (2018). Discussion of PC selection methods for scRNA-seq data. https://github.com/LTLA/PCSelection2018
modelGeneVarWithSpikes
and modelGeneVarByPoisson
, for methods of computing technical components.
runSVD
, for the underlying SVD algorithm(s).
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)
# Modelling the variance:
var.stats <- modelGeneVar(sce)
hvgs <- getTopHVGs(var.stats, n=5000)
# Denoising:
pcs <- getDenoisedPCs(sce, technical=var.stats)
head(pcs$components)
head(pcs$rotation)
head(pcs$percent.var)
# Automatically storing the results.
sce <- denoisePCA(sce, technical=var.stats, subset.row=hvgs)
reducedDimNames(sce)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.