ModelTrain | R Documentation |
ModelTrain
is a generic S3 function that fits a series of
classification or regression
models to sets of descriptors and computes cross-validated measures
of model performance.
ModelTrain(...)
## Default S3 method:
ModelTrain(
x,
y,
nfolds = 10,
nsplits = 3,
seed.in = NA,
des.names = NA,
models = c("NNet", "PLS", "LAR", "Lasso", "PLSLDA", "Tree", "SVM", "KNN", "RF"),
user.params = NULL,
verbose = FALSE,
...
)
## S3 method for class 'data.frame'
ModelTrain(
d,
ids = FALSE,
xcol.lengths = ifelse(ids, length(d) - 2, length(d) - 1),
xcols = NA,
nfolds = 10,
nsplits = 3,
seed.in = NA,
des.names = NA,
models = c("NNet", "PLS", "LAR", "Lasso", "PLSLDA", "Tree", "SVM", "KNN", "RF"),
user.params = NULL,
verbose = FALSE,
...
)
... |
Additional parameters. |
x |
a list of numeric descriptor set matrices. At the moment, only binary and continuous descriptors are supported. Binary descriptors should be numeric (0 or 1). |
y |
a numeric vector containing the binary or continuous response. |
nfolds |
the number of folds to use for each cross validation split. |
nsplits |
the number of splits to use for repeated cross validation. |
seed.in |
a numeric vector with length equal to |
des.names |
a character vector specifying the names for each
descriptor
set. The length of the vector must match the number of descriptor sets.
If |
models |
a character vector specifying the regression or classification models to use. The strings must match models implemented in 'chemmodlab' (see Details). |
user.params |
a list of data frames where each data frame contains
the parameter values for a model. The list should have the format of
the list constructed by |
verbose |
verbose mode or not? |
d |
a data frame containing an (optional) ID column, a response column, and descriptor columns. The columns should be provide in this order. |
ids |
a logical. Is an ID column provided? |
xcol.lengths |
a vector of integers. It is assumed that the columns
in |
xcols |
A list of integer vectors. Each vector contains
column indices
of |
Multiple descriptor sets can be specified by the user. For each descriptor set, repeated k-fold cross validation is performed for the specified regression and/or classification models.
Not all modeling strategies will be appropriate for all response types. For example, partial least squares linear discriminant analysis ("PLSLDA") is not directly appropriate for continuous response assays such as percent inhibition, but it can be applied once a threshold value for percent inhibition is used to create a binary (active/inactive) response.
See https://jrash.github.io/chemmodlab/ for more information about the models available (including model default parameters). The default value for argument models includes only some of the possible values.
Sensible default values are selected for each
tunable model parameter, however users may set any parameter
manually using MakeModelDefaults
and user.params
.
ModelTrain
predictions are based on k-fold cross-validation,
where the dataset is randomly divided into k parts, each containing
approximately equal numbers of compounds. Treating one of these parts
as a "test set" the remaining
k-1 parts are combined together as a "training set"
and used to build a model from the desired modeling technique and
descriptor set. This model is then applied to the "test set" to obtain
predictions. The process is repeated, holding out each of the k parts
in turn. One advantage of k-fold cross-validation is reduction in bias
from using the same data to both build and assess a model. Another
advantage is the increased precision of error estimation offered by
k-fold cross validation over a one-time split.
Recognizing that the definition of folds in k-fold cross validation
may have an impact on the observed performance measures, all models
are built using the same definition of folds. This process is repeated
to obtain multiple separate k-fold cross validation runs resulting in
multiple separate definitions of folds. The number of these "splits"
is specified by nsplits
.
Observed performance measures are
assessed across all splits using CombineSplits
. This
function assesses how sensitive performance measures are to fold
assignments, or changes to the training and test sets.
Statistical tests are used to determine the best performing model and
descriptor set combination.
A list is returned of class chemmodlab
containing:
all.preds |
a list of lists of data frames. The elements of the outer
list correspond to each CV split performed by |
all.probs |
a list of lists of data frames. Constructed only if there is
a binary response. The structure is the same as |
model.acc |
a list of lists of model accuracy measures. The elements of
the outer list correspond to each CV split performed by |
.
classify |
a logical. Were classification models used for binary response? |
responses |
a numeric vector. The observed value of the response. |
data |
a list of numeric matrices. Each matrix is a descriptor set used as model input. |
params |
a list of data frames as made by
|
des.names |
a character vector specifying the descriptor set names. NA if unspecified. |
models |
a character vector specifying the models fit to the data. |
nsplits |
number of CV splits performed. |
default
: Default S3 method
data.frame
: S3 method for class 'data.frame'
Jacqueline Hughes-Oliver, Jeremy Ash, Atina Brooks
chemmodlab
, plot.chemmodlab
,
CombineSplits
,
## Not run:
# A data set with binary response and multiple descriptor sets
data(aid364)
cml <- ModelTrain(aid364, ids = TRUE, xcol.lengths = c(24, 147),
des.names = c("BurdenNumbers", "Pharmacophores"))
cml
## End(Not run)
# A continuous response
cml <- ModelTrain(USArrests, nsplits = 2, nfolds = 2,
models = c("KNN", "Lasso", "Tree"))
cml
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.