# ortho_projection: Orthogonal projections using principal component analysis and... In resemble: Memory-Based Learning in Spectral Chemometrics

## Description

Functions to perform orthogonal projections of high dimensional data matrices using principal component analysis (pca) and partial least squares (pls).

## Usage

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ortho_projection(Xr, Xu = NULL, Yr = NULL, method = "pca", pc_selection = list(method = "var", value = 0.01), center = TRUE, scale = FALSE, ...) pc_projection(Xr, Xu = NULL, Yr = NULL, pc_selection = list(method = "var", value = 0.01), center = TRUE, scale = FALSE, method = "pca", tol = 1e-6, max_iter = 1000, ...) pls_projection(Xr, Xu = NULL, Yr, pc_selection = list(method = "opc", value = min(dim(Xr), 40)), scale = FALSE, tol = 1e-6, max_iter = 1000, ...) ## S3 method for class 'ortho_projection' predict(object, newdata, ...) 

## Arguments

 Xr a matrix of observations. Xu an optional matrix containing data of a second set of observations. Yr if the method used in the pc_selection argument is "opc" or if method = "pls", then it must be a matrix containing the side information corresponding to the spectra in Xr. It is equivalent to the side_info parameter of the sim_eval function. In case method = "pca", a matrix (with one or more continuous variables) can also be used as input. The root mean square of differences (rmsd) is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. A single discrete variable of class factor can also be passed. In that case, the kappa index is used. See sim_eval function for more details. method the method for projecting the data. Options are: "pca": principal component analysis using the singular value decomposition algorithm. "pca.nipals": principal component analysis using the non-linear iterative partial least squares algorithm. "pls": partial least squares. pc_selection a list of length 2 which specifies the method to be used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements (in the following order): method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are: "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components of a given set of observations is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case value must be a value (larger than 0 and below min(nrow(Xr) + nrow(Xu), ncol(Xr)) indicating the maximum number of principal components to be tested. See details. "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain. "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained. "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined). indicating the minimum amount of variance that a component should explain in order to be retained. The list list(method = "var", value = 0.01) is the default. Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01. center a logical indicating if the data Xr (and Xu if specified) must be centered. If Xu is specified the data is centered on the basis of \mjeqnXr \cup XuXr U Xu. NOTE: This argument only applies to the principal components projection. For pls projections the data is always centered. scale a logical indicating if Xr (and Xu if specified) must be scaled. If Xu is specified the data is scaled on the basis of \mjeqnXr \cup XuXr U Xu. ... additional arguments to be passed to pc_projection or pls_projection. tol tolerance limit for convergence of the algorithm in the nipals algorithm (default is 1e-06). In the case of PLS this applies only to Yr with more than one variable. max_iter maximum number of iterations (default is 1000). In the case of method = "pls" this applies only to Yr matrices with more than one variable. object object of class "ortho_projection". newdata an optional data frame or matrix in which to look for variables with which to predict. If omitted, the scores are used. It must contain the same number of columns, to be used in the same order.

## Details

In the case of method = "pca", the algrithm used is the singular value decomposition in which a given data matrix (\mjeqnXX) is factorized as follows:

\mjdeqn

X = UDV^TX = UDV^\mathrmT

where \mjeqnUU and \mjeqnVV are orthogonal matrices, being the left and right singular vectors of \mjeqnXX respectively, \mjeqnDD is a diagonal matrix containing the singular values of \mjeqnXX and \mjeqnVV is the is a matrix of the right singular vectors of \mjeqnXX. The matrix of principal component scores is obtained by a matrix multiplication of \mjeqnUU and \mjeqnDD, and the matrix of principal component loadings is equivalent to the matrix \mjeqnVV.

When method = "pca.nipals", the algorithm used for principal component analysis is the non-linear iterative partial least squares (nipals).

In the case of the of the partial least squares projection (a.k.a projection to latent structures) the nipals regression algorithm is used. Details on the "nipals" algorithm are presented in Martens (1991).

When method = "opc", the selection of the components is carried out by using an iterative method based on the side information concept (Ramirez-Lopez et al. 2013a, 2013b). First let be \mjeqnPP a sequence of retained components (so that \mjeqnP = 1, 2, ...,k P = 1, 2, ...,k ). At each iteration, the function computes a dissimilarity matrix retaining \mjeqnp_ip_i components. The values in this side information variable are compared against the side information values of their most spectrally similar observations (closest Xr observation). The optimal number of components retrieved by the function is the one that minimizes the root mean squared differences (RMSD) in the case of continuous variables, or maximizes the kappa index in the case of categorical variables. In this process, the sim_eval function is used. Note that for the "opc" method Yr is required (i.e. the side information of the observations).

## Value

a list of class ortho_projection with the following components:

• scores a matrix of scores corresponding to the observations in Xr (and Xu if it was provided). The components retrieved correspond to the ones optimized or specified.

• X_loadings a matrix of loadings corresponding to the explanatory variables. The components retrieved correspond to the ones optimized or specified.

• Y_loadings a matrix of partial least squares loadings corresponding to Yr. The components retrieved correspond to the ones optimized or specified. This object is only returned if the partial least squares algorithm was used.

• weigths a matrix of partial least squares ("pls") weights. This object is only returned if the "pls" algorithm was used.

• projection_mat a matrix that can be used to project new data onto a "pls" space. This object is only returned if the "pls" algorithm was used.

• variance a matrix indicating the standard deviation of each component (sd), the variance explained by each single component (explained_var) and the cumulative explained variance (cumulative_explained_var). These values are computed based on the data used to create the projection matrices. For example if the "pls" method was used, then these values are computed based only on the data that contains information on Yr (i.e. the Xr data). If the principal component method is used, the this data is computed on the basis of Xr and Xu (if it applies) since both matrices are employed in the computation of the projection matrix (loadings in this case).

• sdv the standard deviation of the retrieved scores. This vector can be different from the "sd" in variance.

• n_components the number of components (either principal components or partial least squares components) used for computing the global dissimilarity scores.

• opc_evaluation a matrix containing the statistics computed for optimizing the number of principal components based on the variable(s) specified in the Yr argument. If Yr was a continuous was a continuous vector or matrix then this object indicates the root mean square of differences (rmse) for each number of components. If Yr was a categorical variable this object indicates the kappa values for each number of components. This object is returned only if "opc" was used within the pc_selection argument. See the sim_eval function for more details.

• method the ortho_projection method used.

predict.ortho_projection, returns a matrix of scores proprojected for newdtata.

## References

Martens, H. (1991). Multivariate calibration. John Wiley & Sons.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

ortho_diss, sim_eval, mbl
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 library(prospectr) data(NIRsoil) # Proprocess the data using detrend plus first derivative with Savitzky and # Golay smoothing filter sg_det <- savitzkyGolay( detrend(NIRsoil$spc, wav = as.numeric(colnames(NIRsoil$spc)) ), m = 1, p = 1, w = 7 ) NIRsoil$spc_pr <- sg_det # split into training and testing sets test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ] test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)] train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)] train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil\$CEC), ] # A principal component analysis using 5 components pca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5)) pca_projected # A principal components projection using the "opc" method # for the selection of the optimal number of components pca_projected_2 <- ortho_projection( Xr = train_x, Xu = test_x, Yr = train_y, method = "pca", pc_selection = list("opc", 40) ) pca_projected_2 plot(pca_projected_2) # A partial least squares projection using the "opc" method # for the selection of the optimal number of components pls_projected <- ortho_projection( Xr = train_x, Xu = test_x, Yr = train_y, method = "pls", pc_selection = list("opc", 40) ) pls_projected plot(pls_projected) # A partial least squares projection using the "cumvar" method # for the selection of the optimal number of components pls_projected_2 <- ortho_projection( Xr = train_x, Xu = test_x, Yr = train_y, method = "pls", pc_selection = list("cumvar", 0.99) )