knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library("LSSPCA") library("ggplot2") data("hitters") data("hitters_labels")
Hitters data are already scaled to zero mean and unit variance.
By default lsspca computes four orthogonal USPCs explaining 95% of the variance explained by the PCs (alpha = 0.95) selecting the variables by evaluating all subsets on which the PCs are regressed.
Default values alpha = 0.95, ncomps = 4, spcaMethod = "u", variableSelection = "exhaustive", variableSelection = "e" for exhaustive .
hit_uspca95 = lsspca(hitters)
Don't worry about the warnings. They are produced by the exhaustive search
The function produces a list with the following items
names(hit_uspca95)
See the function documentation for details.
Summaries can be extracted with the utility summary_spca. This produces the net variance and cumulated variance explained by the SPCs, the percentage of cumulative variance explained with respect to the same number of PCs and the cardinality of each SPC.
summary_spca(hit_uspca95)
The matrices of loadings and contributions (scaled to unit $L_1$ norm) can be printed with
print_spca(hit_uspca95, contributions = TRUE, only.nonzero = TRUE, digits = 1, threshold = 0.001, prn = TRUE, rtn = FALSE)
By defult it prints the nonzero percentage contributions. Unit $L_2$ norm loadings are displayed by setting contributions = FALSE and also the zero losdings with only.nonzero = FALSE.
The matrices of loadings and contributions are not easy to read. A better way to display the sparse laodings and contribution is
lapply(hit_uspca95$loadingslist, function(x) round(x, 2)) ## contributions lapply(hit_uspca95$loadingslist[1:2], function(x) round(100 * x/sum(abs(x)), 1))
The contributions can be visualized in different ways. Barplot of the contribution can be created, for example, as follows
df1 = data.frame(contributions = c(hit_uspca95$contributions*100), variable = rep(hitters_labels$short.name, 4), play = rep(hitters_labels$type, 4), component = factor(rep(1:4, each = 16), labels = paste("SPC", 1:4))) ggplot(df1, aes(x = variable, y = contributions, fill = play)) + geom_bar(stat = "identity") + facet_grid(cols = vars(component)) + theme_light() + theme(panel.grid.major.x = element_blank(), axis.text.x = element_text(angle = 90, size = 8), strip.background = element_rect(fill = "white"), strip.text = element_text(size = 16, colour = "black"), legend.key.size = unit(16, "pt"), legend.text = element_text(size = 16))
The argumentss available are (the names should be self-explanatory ):
scalex = FALSE, if TRUE the variables are centered and scaled to unit variance. Variables are always centered if they aren't already.
variableSelection = c("exhaustive", "seqrep", "backward", "forward", "lasso"),
These can be vectors (appply to all SPCs) or a list of vectors
Like other variable selection algorithms, Lasso can be used also to obtain the regressed PCs
Different sparse solutions can be obtained by changing:
Of course, variance explained, cardinality and variables selected for the components are interrelated. So, expect differences.
Increase the minimal proportion of cumulative variance eexplained (with repsect to that explained by the corresponding PCs) to 99%, by setting alpha = 0.99
hit_uspca99 = lsspca(hitters, alpha = 0.99, ncomps = 4) summary_spca(hit_uspca99) hit_uspca95 = lsspca(hitters, alpha = 0.95, ncomps = 4) summary_spca(hit_uspca95)
Larger cvexp but higher cardinality as well.
Or better fit for higher order SPCs can be obtained by removing the orthoganality constraint for the last two SPCs, by setting spcaMethod = c(rep("u", 2), rep("c", 2))
hit_uandcspca95 = lsspca(hitters, alpha = 0.95, # ncomps = 4, default spcaMethod = c(rep("u", 2), rep("c", 2))) summary_spca(hit_uandcspca95) ## correlation between PCs round(hit_uandcspca95$corComp, 2) round(cor(hit_uspca95$scores), 6) ## of course $corComp not produced for USPCs
For larger matrices, or just for a second opinion, it is possible to change the variable selection algorithm. This the output when using forward selection
hit_uspca95f = lsspca(hitters, alpha = 0.95, ncomps = 4, variableSelection = "f") summary_spca(hit_uspca95f) summary_spca(hit_uspca95) ### loadings from exhaustive search hit_uspca95f$loadingslist[[1]] ### loadings from forward selection hit_uspca95$loadingslist[[1]]
The function lsspca_blocked computes SPCs of subsets of variables. This is not the same as computing standard LSSPCs with restrictions on the variables selected because blocked LSSPCA aims at maximizing only the variance explained of the block of variables considered with each SPC.
I don't recommend this procedure but, since I used in the tutorial paper, here it is. lsspca_blocked takes three extra arguments
For example, for the three types of playing statistics:
playlist = lapply(levels(hitters_labels$type), function(x, a) which(a == x), a = hitters_labels$type) hit_bspca95 = lsspca_blocked(hitters, alpha = 0.95, blocks_list = playlist, variableSelection = "e")
This will compute one component per block, more components can be obtained with the parameter ncomps_per_block.
summary_spca(hit_bspca95) summary_spca(hit_uspca95)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.