In schoonees/spca: Lasso-Penalized Singular Value Decomposition and Sparse Principal Components Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7, 
  fig.height = 7,
  fig.align = "center",
  out.width = "70%"
)

This vignette compares the results from the PMA package's implementation of sparse principal components analysis (SPCA) to the implementation in the spca package. Both these are based on the $L_1$-penalized matrix decomposition of Witten, Hastie and Tibshirani (2009). The PMA package also includes functions for other analysis methods, such as canonical correlation analysis (CCA), and the fused lasso penalty. These are not currently implemented in spca.

We will use the following packages:

library("spca")
library("PMA")
library("tidyverse")

Application to random data

We will use a random data matrix to compare the two implementations:

set.seed(801)
X <- matrix(rnorm(100), nrow = 20, ncol = 5)

We column-centre this so that we do not have to ask the PCA implementations to do it first:

X <- sweep(X, MARGIN = 2, STATS = colMeans(X), FUN = `-`)

This is important since the implementations use different standardizations. See Details under ?spca for some information.

We start by conducting sparse PCA on X using the spca package:

spca_X <- mspca(X, k = 5, c = 1.25, center = FALSE, scale = FALSE)

Here the value of $c$ must be in the interval $[1, \sqrt{5}]$, so that 1.25 should result is a relatively sparse results.

We can do the same analysis using SPC() from PMA:

PMA_X <- SPC(X, K = 5, sumabsv = 1.25, trace = FALSE, center = FALSE, niter = 100)

Comparing the loadings

The loadings for spca are:

spca_X$loadings

The loadings for PMA are:

PMA_X$v

These are similar, but not identical.

The $L_1$-norms of the columns are as follows:

colSums(abs(spca_X$loadings))
colSums(abs(PMA_X$v))

The $L_2$-norms of the columns are as follows:

colSums(spca_X$loadings^2)
colSums(PMA_X$v^2)

Clearly, both solutions satisfy the constraints.

Comparing the explained variance

The explained variance for spca is:

spca_X$pve

The result for PMA is:

PMA_X$prop.var.explained

These are identical. Both packages use the method proposed by Shen & Huang (2008) to calculate the rank-$k$ approximation of a matrix $\mathbf{X}$ as: $$ \mathbf{X}_k = \mathbf{X} \mathbf{V}_k (\mathbf{V}_k^{'} \mathbf{V}_k)^{-1} \mathbf{V}_k^{'}. $$ This is necessary since the components in the columns of $\mathbf{V}_k$ are not orthogonal anymore. The total variance explained by the rank-$k$ approximation is then the trace of $\mathbf{X}_k^{'}\mathbf{X}_k$.

Ordinary PCA

The result of ordinary PCA applied to X is:

pca_X <- prcomp(X)
pca_X

The variance information is as follows:

summary(pca_X)

This can be emulated by setting $c$ high in mspca():

mspca(X, k = 5, c = sqrt(5), center = FALSE, scale = FALSE)$pve

Changing the starting values

Note that the results from spca and PMA are essentially equivalent when the same starting values are used:

spca_X_pred <- mspca(X, k = 5, c = 1.25, start = "predetermined", 
                     center = FALSE, scale = FALSE, maxit = 100, eps = 1e-12)

Accounting for sign flips, these are almost equivalent:

all.equal(abs(spca_X_pred$svd$v), abs(PMA_X$v), check.attributes = FALSE)

Application to empirical data

Here we use the fifa_nl data set including in the spca package:

data("fifa_nl")

We filter out the goalkeepers and retain the skill variables:

fifa_nl <- fifa_nl %>% filter(Position != "Gk")
fifa_x <- fifa_nl %>% select(crossing:sliding_tackle) %>% 
  as.matrix()

We column-centre this so that we do not have to ask the PCA implementations to do it first:

fifa_x <- sweep(fifa_x, MARGIN = 2, STATS = colMeans(fifa_x), FUN = `-`)

We can now apply sparse PCA using spca and PMA for three components and with $c$ quite small:

spca_fifa <- mspca(fifa_x, k = 3, c = 1.5, maxit = 20, start = "predetermined")
PMA_fifa <- SPC(fifa_x, K = 3, sumabsv = 1.5, trace = FALSE, 
                center = FALSE, niter = 20)

Note that we run the algorithms for the same number of iterations, since they have different stopping criteria. We also specify that the starting values for the $\boldsymbol{v}$ vectors should be "predetermined" from the SVD of the input matrix.

Comparing the loadings

Let's compare the loadings for these. Here are the loadings the first component:

data.frame(spca = spca_fifa$loadings[, 1], PMA = PMA_fifa$v[, 1])

These are numerically identical:

all.equal(spca_fifa$loadings[, 1], PMA_fifa$v[, 1], check.attributes = FALSE)

The second component loadings are:

data.frame(spca = spca_fifa$loadings[, 2], PMA = PMA_fifa$v[, 2])

These are also almost equivalent:

all.equal(spca_fifa$loadings[, 2], PMA_fifa$v[, 2], check.attributes = FALSE)

The same applies to the third component:

data.frame(spca = spca_fifa$loadings[, 3], PMA = PMA_fifa$v[, 3])

These are quite different:

all.equal(abs(spca_fifa$loadings[, 3]), abs(PMA_fifa$v[, 3]), check.attributes = FALSE)

Note that a sign flip was ignored here.

Note that both the loading matrices satisfy the inequality constraints. The $L_1$-norms are:

apply(spca_fifa$loadings, 2, function(x) sum(abs(x)))
apply(PMA_fifa$v, 2, function(x) sum(abs(x)))

The $L_2$-norms are:

apply(spca_fifa$loadings, 2, function(x) sqrt(sum(x^2)))
apply(PMA_fifa$v, 2, function(x) sqrt(sum(x^2)))

Comparing the variance explained

The variance explained for the spca and PMA solutions are practicallt the same:

spca_fifa$pve$cum_prop
PMA_fifa$prop.var.explained

Final remarks

The method for calculating $\mathbf{X}_k$ proposed by Shen & Huang (2008) suggests a different deflation algorithm for constructing sparse PCA solutions of rank higher than one (See spca:::mspca2()).
The possibility exists to adjust mspca() such that the lasso penalty is relaxed after selecting the variables with non-zero loadings (See spca:::mspca2()).
The convention of SPC() whereby the grand mean of the input matrix is set to zero when scale = TRUE requires explanation.
An investigation of the effect of the starting values is necessary. Note that the default starting values for mspca() gives different results for components two onwards. It seems that starting values are very important for stricter penalties.

References

Shen, H., & Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6), 1015-1034.

Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.

schoonees/spca documentation built on Jan. 31, 2021, 6:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com