chooseMarchenkoPastur: Choosing PCs with the Marchenko-Pastur limit
In PCAtools: PCAtools: Everything Principal Components Analysis

Description Usage Arguments Details Value Author(s) See Also Examples

Use the Marchenko-Pastur limit to choose the number of top PCs to retain.

1	chooseMarchenkoPastur(x, .dim = dim(x), var.explained, noise)

`x`	The data matrix used for the PCA, containing variables in rows and observations in columns. Ignored if `dim` is supplied.
`.dim`	An integer vector containing the dimensions of the data matrix used for PCA. The first element should contain the number of variables and the second element should contain the number of observations.
`var.explained`	A numeric vector containing the variance explained by successive PCs. This should be sorted in decreasing order. Note that this should be the variance explained, NOT the percentage of variance explained!
`noise`	Numeric scalar specifying the variance of the random noise.

For a random matrix with i.i.d. values, the Marchenko-Pastur (MP) limit defines the maximum eigenvalue. Let us assume that x is the sum of some low-rank truth and some i.i.d. random matrix with variance noise. We can use the MP limit to determine the maximum variance that could be explained by a fully random PC; all PCs that explain more variance are thus likely to contain real structure and should be retained.

Of course, this has some obvious caveats such as the unrealistic i.i.d. assumption and the need to estimate noise. Moreover, PCs below the MP limit are not necessarily uninformative or lacking structure; it is just that their variance explained does not match the most extreme case that random noise has to offer.

An integer scalar specifying the number of PCs with variance explained beyond the MP limit. The limit itself is returned in the attributes.

Aaron Lun

chooseGavishDonoho, parallelPCA and findElbowPoint, for other approaches to choosing the number of PCs.

truth <- matrix(rnorm(1000), nrow=100)
truth <- truth[,sample(ncol(truth), 1000, replace=TRUE)]
obs <- truth + rnorm(length(truth), sd=2)

# Note, we need the variance explained, NOT the percentage
# of variance explained! 
pcs <- pca(obs)
chooseMarchenkoPastur(obs, var.explained=pcs$sdev^2, noise=4)