knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 7, fig.align = "center", out.width = "70%" )
This vignette compares the results from the PMA package's implementation of sparse principal components analysis (SPCA) to the implementation in the spca package. Both these are based on the $L_1$-penalized matrix decomposition of Witten, Hastie and Tibshirani (2009). The PMA package also includes functions for other analysis methods, such as canonical correlation analysis (CCA), and the fused lasso penalty. These are not currently implemented in spca.
We will use the following packages:
library("spca") library("PMA") library("tidyverse")
We will use a random data matrix to compare the two implementations:
set.seed(801) X <- matrix(rnorm(100), nrow = 20, ncol = 5)
We column-centre this so that we do not have to ask the PCA implementations to do it first:
X <- sweep(X, MARGIN = 2, STATS = colMeans(X), FUN = `-`)
This is important since the implementations use different standardizations. See Details under ?spca
for some information.
We start by conducting sparse PCA on X
using the spca package:
spca_X <- mspca(X, k = 5, c = 1.25, center = FALSE, scale = FALSE)
Here the value of $c$ must be in the interval $[1, \sqrt{5}]$, so that 1.25 should result is a relatively sparse results.
We can do the same analysis using SPC()
from PMA:
PMA_X <- SPC(X, K = 5, sumabsv = 1.25, trace = FALSE, center = FALSE, niter = 100)
The loadings for spca are:
spca_X$loadings
The loadings for PMA are:
PMA_X$v
These are similar, but not identical.
The $L_1$-norms of the columns are as follows:
colSums(abs(spca_X$loadings)) colSums(abs(PMA_X$v))
The $L_2$-norms of the columns are as follows:
colSums(spca_X$loadings^2) colSums(PMA_X$v^2)
Clearly, both solutions satisfy the constraints.
The explained variance for spca is:
spca_X$pve
The result for PMA is:
PMA_X$prop.var.explained
These are identical. Both packages use the method proposed by Shen & Huang (2008) to calculate the rank-$k$ approximation of a matrix $\mathbf{X}$ as: $$ \mathbf{X}_k = \mathbf{X} \mathbf{V}_k (\mathbf{V}_k^{'} \mathbf{V}_k)^{-1} \mathbf{V}_k^{'}. $$ This is necessary since the components in the columns of $\mathbf{V}_k$ are not orthogonal anymore. The total variance explained by the rank-$k$ approximation is then the trace of $\mathbf{X}_k^{'}\mathbf{X}_k$.
The result of ordinary PCA applied to X
is:
pca_X <- prcomp(X) pca_X
The variance information is as follows:
summary(pca_X)
This can be emulated by setting $c$ high in mspca()
:
mspca(X, k = 5, c = sqrt(5), center = FALSE, scale = FALSE)$pve
Note that the results from spca and PMA are essentially equivalent when the same starting values are used:
spca_X_pred <- mspca(X, k = 5, c = 1.25, start = "predetermined", center = FALSE, scale = FALSE, maxit = 100, eps = 1e-12)
Accounting for sign flips, these are almost equivalent:
all.equal(abs(spca_X_pred$svd$v), abs(PMA_X$v), check.attributes = FALSE)
Here we use the fifa_nl
data set including in the spca package:
data("fifa_nl")
We filter out the goalkeepers and retain the skill variables:
fifa_nl <- fifa_nl %>% filter(Position != "Gk") fifa_x <- fifa_nl %>% select(crossing:sliding_tackle) %>% as.matrix()
We column-centre this so that we do not have to ask the PCA implementations to do it first:
fifa_x <- sweep(fifa_x, MARGIN = 2, STATS = colMeans(fifa_x), FUN = `-`)
We can now apply sparse PCA using spca and PMA for three components and with $c$ quite small:
spca_fifa <- mspca(fifa_x, k = 3, c = 1.5, maxit = 20, start = "predetermined") PMA_fifa <- SPC(fifa_x, K = 3, sumabsv = 1.5, trace = FALSE, center = FALSE, niter = 20)
Note that we run the algorithms for the same number of iterations, since they have different stopping criteria. We also specify that the starting values for the $\boldsymbol{v}$ vectors should be "predetermined"
from the SVD of the input matrix.
Let's compare the loadings for these. Here are the loadings the first component:
data.frame(spca = spca_fifa$loadings[, 1], PMA = PMA_fifa$v[, 1])
These are numerically identical:
all.equal(spca_fifa$loadings[, 1], PMA_fifa$v[, 1], check.attributes = FALSE)
The second component loadings are:
data.frame(spca = spca_fifa$loadings[, 2], PMA = PMA_fifa$v[, 2])
These are also almost equivalent:
all.equal(spca_fifa$loadings[, 2], PMA_fifa$v[, 2], check.attributes = FALSE)
The same applies to the third component:
data.frame(spca = spca_fifa$loadings[, 3], PMA = PMA_fifa$v[, 3])
These are quite different:
all.equal(abs(spca_fifa$loadings[, 3]), abs(PMA_fifa$v[, 3]), check.attributes = FALSE)
Note that a sign flip was ignored here.
Note that both the loading matrices satisfy the inequality constraints. The $L_1$-norms are:
apply(spca_fifa$loadings, 2, function(x) sum(abs(x))) apply(PMA_fifa$v, 2, function(x) sum(abs(x)))
The $L_2$-norms are:
apply(spca_fifa$loadings, 2, function(x) sqrt(sum(x^2))) apply(PMA_fifa$v, 2, function(x) sqrt(sum(x^2)))
The variance explained for the spca and PMA solutions are practicallt the same:
spca_fifa$pve$cum_prop PMA_fifa$prop.var.explained
spca:::mspca2()
).mspca()
such that the lasso penalty is relaxed after selecting the variables with non-zero loadings (See spca:::mspca2()
).SPC()
whereby the grand mean of the input matrix is set to zero when scale = TRUE
requires explanation.mspca()
gives different results for components two onwards. It seems that starting values are very important for stricter penalties.Shen, H., & Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6), 1015-1034.
Witten, D. M., Tibshirani, R., & Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515-534.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.