fdaML_train: Train a Machine Learning model

Description Usage Arguments Value References See Also

View source: R/fdaML_train.R

Description

Train a machine learning model.

Usage

1
2
3
4
5
6
fdaML_train(X, y, Z = NULL, task, model = NULL, reduction,
  intercept = TRUE, smooth_w = NULL, balanced = FALSE, reps = 100,
  Q_vec = NULL, Q_len = NULL, Q_opt = NULL, tau_Q_opt = 0,
  lam_cv_type = "n", lam_vec = NULL, split_size = c(0.5, 0.25, 0.25),
  estimation_w = NULL, bspline_dim = ncol(X), t_range = 350:2500,
  verbose = TRUE, ll = NULL)

Arguments

X

A numeric functional predictor matrix of size N*P, where N is the number of observations (spectra) and P is the number of points (wavelengths) at which spectra are measured.

y

A numeric or factor response vector of size N, where N is the number of observations (spectra).

Z

A numeric non-functional predictor matrix of size N*S, where N is the number of observations (spectra) and S is the number of non-functional predictors after bining/one-hot-encoding has taken place.

task

Regression ("regr") for continuous response problems or Classification ("clas") for categorical response problems.

model

Currently Linear Model ("lm") for Regression or Generalised Linear Model ("glm") for Classification.

reduction

Partial Least Squares ("pls"), Principal Component Analysis ("pca"), or no further dimension reduction ("n").

intercept

Whether to include a model intercept (TRUE) or not (FALSE).

smooth_w

A numeric vector of length equal to length(t_tange) with weights for smoothing spectra. If spectra smoothign is desired, this vector has to be specified. For unweighted smoothing, choose all elements the same (e.g., rep(1,length(t_range))).

balanced

Whether the dataset should be balanced (TRUE) or not (FALSE). If TRUE, observations are discarded so that the number of observations for each level of the response variable is approximately the same. It applies to both Classification and Regression with integer-valued response.

reps

Number of randomisations of the training/validating/testing subsets to average over in cross-validation.

Q_vec

Vector of numbers of PCA/PLS components to be tried in cross-validation. The dedault is a vector of evenly spaced values (approximately, due to rounding) between 2 and min(80, N_train-1), where N_train is the number of observations in the training subset.

Q_len

Length of Q_vec when the later is not supplied. The default is 30.

Q_opt

Optimal number of PCA/PLS components. If this is supplied, cross-validation for the number of PCA/PLS components is bypassed.

tau_Q_opt

Threshold for choosing the optimal parameters. If tau_Q_opt=0, then Q is the value that minimises RMSD (for Regression) or maximises AUC (for Classification). If tau_Q_opt>0, then Q is the smallest value which gives a RMSD/AUC within a margin tau_Q_opt of the optimal RMSD/AUC.

lam_cv_type

Cross-validation strategy to be used when choosing the penalty parameter lambda: ordinary cross-validation ("ocv"), generalised cross-validation ("gcv") or no penalisation ("n").

lam_vec

Vector of penalty parameters to be tried in cross-validation. The default is a set of 10 values between 0.001 and 20 on an exponential scale.

split_size

Either a vector of length 3 specifying the proportion of observations to be assigned to the training, validation and testing subsets, in this order; or a scalar specifying the proportion of observations to be assigned to the training subset, in which case the validation and testing subsets are assigned a proportion (1-split_size)/2 of observations each.

estimation_w

A numeric vector of length equal to length(y) with weights for coeficient function estimation.

bspline_dim

The dimension of the cubic B-spline functional representation system.

t_range

A numeric vector giving the wavelenghts at which spectra were measured.

verbose

Whether to print a progress bar (TRUE) or not (FALSE).

ll

A list whose named elements are the parameters for this function.Provide either the function parameters as usual, or this list, but not both.

Value

An object of class fdaModel, which is a list containing the trained model.

References

P.M. Esperança, Thomas S. Churcher (2019). "Machine learning based epidemiological vector control monitoring using functional data analysis techniques for near-infrared spectral data". arXiv.

P.T. Reiss, R.T. Ogden (2007). "Functional Principal Component Regression and Functional Partial Least Squares". Journal of the American Statistical Association, 102(479), 984-996

See Also

fdaML_predict


pmesperanca/mlevcm documentation built on March 17, 2021, 10:03 p.m.