fixedPCA: PCA with a fixed number of components

View source: R/fixedPCA.R

fixedPCAR Documentation

PCA with a fixed number of components

Description

Perform a PCA where the desired number of components is known ahead of time.

Usage

fixedPCA(
  x,
  rank = 50,
  value = c("pca", "lowrank"),
  subset.row,
  preserve.shape = TRUE,
  assay.type = "logcounts",
  name = NULL,
  BSPARAM = bsparam(),
  BPPARAM = SerialParam()
)

Arguments

x

A SingleCellExperiment object containing a log-expression amtrix.

rank

Integer scalar specifying the number of components.

value

String specifying the type of value to return. "pca" will return the PCs, "n" will return the number of retained components, and "lowrank" will return a low-rank approximation.

subset.row

A logical, character or integer vector specifying the rows of x to use in the PCA. Defaults to NULL (i.e., all rows used) with a warning.

preserve.shape

Logical scalar indicating whether or not the output SingleCellExperiment should be subsetted to subset.row. Only used if subset.row is not NULL.

assay.type

A string specifying which assay values to use.

name

String containing the name which which to store the results. Defaults to "PCA" in the reducedDimNames for value="pca" and "lowrank" in the assays for value="lowrank".

BSPARAM

A BiocSingularParam object specifying the algorithm to use for PCA.

BPPARAM

A BiocParallelParam object to use for parallel processing.

Details

In theory, there is an optimal number of components for any given application, but in practice, the criterion for the optimum is difficult to define. As a result, it is often satisfactory to take an a priori-defined “reasonable” number of PCs for downstream analyses. A good rule of thumb is to set this to the upper bound on the expected number of subpopulations in the dataset (see the reasoning in getClusteredPCs.

We can use subset.row to perform the PCA on a subset of genes. This is typically used to subset to HVGs to reduce computational time and increase the signal-to-noise ratio of downstream analyses. If preserve.shape=TRUE, the rotation matrix is extrapolated to include loadings for “unselected” genes, i.e., not in subset.row. This is done by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes. By doing so, we ensure that the output always has the same number of rows as x such that any value="lowrank" can fit into the assays.

Otherwise, if preserve.shape=FALSE, the output is subsetted by any non-NULL value of subset.row. This is equivalent to the return value after calling the function on x[subset.row,].

Value

A modified x with:

  • the PC results stored in the reducedDims as a "PCA" entry, if type="pca". The attributes contain the rotation matrix, the variance explained and the percentage of variance explained. (Note that the last may not sum to 100% if max.rank is smaller than the total number of PCs.)

  • a low-rank approximation stored as a new "lowrank" assay, if type="lowrank". This is represented as a LowRankMatrix.

Author(s)

Aaron Lun

See Also

denoisePCA, where the number of PCs is automatically chosen.

getClusteredPCs, another method to choose the number of PCs.

Examples

library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Modelling the variance:
var.stats <- modelGeneVar(sce)
hvgs <- getTopHVGs(var.stats, n=1000)

# Defaults to pulling out the top 50 PCs.
set.seed(1000)
sce <- fixedPCA(sce, subset.row=hvgs)
reducedDimNames(sce)

# Get the percentage of variance explained. 
attr(reducedDim(sce), "percentVar")


MarioniLab/scran documentation built on March 7, 2024, 1:45 p.m.