SPCR regresses a dependent variable onto a few supervised principal components computed from a large set of predictors. The steps followed by SPCR are the following:
A key aspect of the method is that both the number of PCs and the threshold value used in step 2 can be determined by cross-validation. GSPCR extends SPCR by allowing the dependent variable to be of any measurement level (i.e., ratio, interval, ordinal, nominal) by introducing likelihood-based association measures (or threshold types) in step 1. Furthermore, GSPCR allows the predictors to be of any type by combining the PCAmix framework (Kiers, 1991; Chavent et al., 2017) with SPCR in step 3.
The gspcr
R package allows to:
Before we do anything else, let us load the packages we will need for this vignette.
If you don't have these packages, please install them using install.packages()
.
# Load R packages library(gspcr)
We start this vignette by estimating gspcr
in a very simple scenario with a continuous dependent variable and a set of continuous predictors. First, we store the example dataset GSPCRexdata
(see the helpfile for details ?GSPCRexdata
) in two separate objects:
# Comment goal of code X <- GSPCRexdata$X$cont y <- GSPCRexdata$y$cont
Then, we randomly select a subset of the data to use as a training set. We use 90\% of the data as training data.
# Set a seed set.seed(20230415) # Sample a subset of the data train <- sample(x = 1:nrow(X), size = nrow(X) * .9)
Now we are ready to use the cv_gscpr()
function to cross-validate the threshold value and the number of pcs to be used.
# Train the GSPCR model out <- cv_gspcr( dv = y[train], ivs = X[train, ] )
We can then extract the cross-validated solutions from the resulting object.
# Extract solutions out$sol_table
thr_value thr_number Q standard -1268.236 2 1 oneSE -1268.236 2 1
We can visually examine the solution paths produced by the cross-validation procedure by using the plot()
functions.
# Plot the solution paths plot(out)
Figure 1: Solution paths produced by `cv_gspcr`.
In this figure, the out-of-sample fit measure obtained with a given threshold value (X-axis) and a given number of principal components is reported on the Y-axis. The values of the threshold considered are reported on the X-axis, and the X-axis title reports the type of threshold used, in this case, the simple regression model likelihoods. For a different number of components considered, a different line is reported. The number of PCs is reported on the line.
Because the fit measure used by default for a continuous dependent variable is the F-statistic, we should look for the highest point on the Y-axis of this plot. This point represents the best K-fold cross-validation fit. As you see, the standard solution reported above matches the one presented in this plot.
Once the cross-validation procedure has identified the values of the threshold and the number of PCs that should be used, we can estimate the GASPCR model on the whole training data with the function est_gspcr()
.
# Estimate GSPCR on the whole training data gspcr_est <- est_gspcr(out)
We can now obtain predictions for new unseen data using the predict()
function
# Predict new data y_hat <- predict( object = gspcr_est, newdata = X[-train, ] ) # Look at the first six predictions head(y_hat)
8 23 26 64 70 75 0.16692084 -1.10013695 1.11181527 0.73243982 0.04385471 0.73221417
Bair E, Hastie T, Paul D, Tibshirani R (2006). “Prediction by supervised principal components.” J. Am. Stat. Assoc., 101(473), 119-137.
Chavent, M., Kuentz-Simonet, V., Labenne, A., & Saracco, J. (2014). Multivariate analysis of mixed data: The R package PCAmixdata. arXiv preprint arXiv:1411.4911.
Kiers, H. A. (1991). Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika, 56(2), 197-212.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.