partial_least_squares: Fit a Partial Least Squares Regression Model (PLSR)
In brandon-mosqueda/SKM: Sparse Kernels Methods

partial_least_squares

R Documentation

Fit a Partial Least Squares Regression Model (PLSR)

Description

partial_least_squares() is a wrapper of the pls::plsr() function to fit a partial least squares regression model. You can fit univariate and multivariate models for numeric responses only.

Usage

partial_least_squares(
  x,
  y,
  method = "kernel",
  scale = FALSE,
  validate_params = TRUE,
  seed = NULL,
  verbose = TRUE
)

Arguments

`x`	(`matrix`) The predictor (independet) variable(s). It must be a numeric matrix. You can use `to_matrix()` function to convert your data to a `matrix`.
`y`	(`data.frame` \| `vector` \| `matrix`) The response (dependent) variable(s). If it is a `data.frame` or a `matrix` with 2 or more columns, a multivariate model is assumed, a univariate model otherwise. All the variables are coerced to numeric before training the model.
`method`	(`character(1)`) (case not sensitive) The type of model to fit. The available options are the kernel algorithm (`"kernel"`), the wide kernel algorithm (`"wide_kernel"`), SIMPLS (`"simpls"`), and the classical orthogonal scores algorithm (`"orthogonal"`). `"kernel"` by default.
`scale`	(`logical`) A logical vector indicating the variables in `x` to be scaled. If `scale` is of length 1, the value is recycled as many times as needed. `TRUE` by default.
`validate_params`	(`logical(1)`) Should the parameters be validated? It is not recommended to set this parameter to `FALSE` because if something fails a non meaningful error is going to be thrown. `TRUE` by default.
`seed`	(`numeric(1)`) A value to be used as internal seed for reproducible results. `NULL` by default.
`verbose`	(`logical(1)`) Should the progress information be printed? `TRUE` by default.

Details

You have to consider that all columns without variance (where all the records has the same value) are removed. Such columns positions are returned in the removed_x_cols field of the returned object.

This function performs random cross validation with 10 folds in order to find the optimal number of components to use. This optimal value is used when you call predict using the fitted model but you can specify other number of components to make the predictions.

All records with missing values (NA), either in x or in y will be removed. The positions of the removed records are returned in the removed_rows field of the returned object.

Value

An object of class "PartialLeastSquaresModel" that inherits from classes "Model" and "R6" with the fields:

fitted_model: An object of class pls::plsr() with the model.
x: The final ⁠m̀atrix⁠ used to fit the model.
y: The final vector or ⁠m̀atrix⁠ used to fit the model.
optimal_components_num: A numeric value with the optimal number of components obtained with cross validation and used to fit the model.
execution_time: A difftime object with the total time taken to tune and fit the model.
removed_rows: A numeric vector with the records' indices (in the provided position) that were deleted and not taken in account in tunning nor training.
removed_x_cols: A numeric vector with the columns' indices (in the provided positions) that were deleted and not taken in account in tunning nor training.
...: Some other parameters for internal use.

Examples

# Use all default hyperparameters -------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- partial_least_squares(x, y)

# Obtain the optimal number of components to use with predict
model$optimal_components_num

# Obtain the model's coefficients
coef(model)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Predict with a non optimal number of components ---------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- partial_least_squares(x, y, method = "orthogonal")

# Obtain the optimal number of components to use with predict
model$optimal_components_num

# Predict using the fitted model with the optimal number of components
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Predict using the fitted model without the optimal number of components
predictions <- predict(model, x, components_num = 2)
# Obtain the predicted values
predictions$predicted

# Obtain the model's coefficients
coef(model)

# Obtain the execution time taken to tune and fit the model
model$execution_time

# Multivariate analysis -----------------------------------------------------
x <- to_matrix(iris[, -c(1, 2)])
y <- iris[, c(1, 2)]
model <- partial_least_squares(x, y, method = "wide_kernel")

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values of the first response variable
predictions$Sepal.Length$predicted
# Obtain the predicted values of the second response variable
predictions$Sepal.Width$predicted

# Obtain the predictions in a data.frame not in a list
predictions <- predict(model, x, format = "data.frame")
head(predictions)

# Genomic selection ------------------------------------------------------------
data(Wheat)

# Data preparation of G
Line <- model.matrix(~ 0 + Line, data = Wheat$Pheno)
# Compute cholesky
Geno <- cholesky(Wheat$Geno)
# G matrix
X <- Line %*% Geno
y <- Wheat$Pheno$Y

# Set seed for reproducible results
set.seed(2022)
folds <- cv_kfold(records_number = nrow(X), k = 3)

Predictions <- data.frame()

# Model training and predictions
for (i in seq_along(folds)) {
  cat("*** Fold:", i, "***\n")
  fold <- folds[[i]]

  # Identify the training and testing sets
  X_training <- X[fold$training, ]
  X_testing <- X[fold$testing, ]
  y_training <- y[fold$training]
  y_testing <- y[fold$testing]

  # Model training
  model <- partial_least_squares(
    x = X_training,
    y = y_training,

    scale = TRUE,
    method = "kernel"
  )

  # Prediction of testing set
  predictions <- predict(model, X_testing)

  # Predictions for the i-th fold
  FoldPredictions <- data.frame(
    Fold = i,
    Line = Wheat$Pheno$Line[fold$testing],
    Env = Wheat$Pheno$Env[fold$testing],
    Observed = y_testing,
    Predicted = predictions$predicted
  )
  Predictions <- rbind(Predictions, FoldPredictions)
}

head(Predictions)
# Compute the summary of all predictions
summaries <- gs_summaries(Predictions)

# Summaries by Line
head(summaries$line)

# Summaries by Environment
summaries$env

# Summaries by Fold
summaries$fold

brandon-mosqueda/SKM documentation built on Feb. 8, 2025, 5:24 p.m.