fit: Fit a supervised data mining model (classification or...

View source: R/model.R

fitR Documentation

Fit a supervised data mining model (classification or regression) model

Description

Fit a supervised data mining model (classification or regression) model. Wrapper function that allows to fit distinct data mining (16 classification and 18 regression) methods under the same coherent function structure. Also, it tunes the hyperparameters of the models (e.g., kknn, mlpe and ksvm) and performs some feature selection methods.

Usage

fit(x, data = NULL, model = "default", task = "default", 
    search = "heuristic", mpar = NULL, feature = "none", 
    scale = "default", transform = "none", 
    created = NULL, fdebug = FALSE, ...)

Arguments

x

a symbolic description (formula) of the model to be fit.
If data=NULL it is assumed that x contains a formula expression with known variables (see first example below).

data

an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.

model

Typically this should be a character object with the model type name (data mining method, as explained in valid character options).

First usage: individual fit. Valid character options are the typical R base learning functions (individual models), namely one of:

  • naive most common class (classification) or mean output value (regression)

  • ctree – conditional inference tree (classification and regression, uses ctree)

  • cv.glmnet – generalized linear model (GLM) with lasso or elasticnet regularization (classification and regression, uses cv.glmnet; note: cross-validation is used to automatically set the lambda parameter that is needed to compute the predictions)

  • rpart or dt – decision tree (classification and regression, uses rpart)

  • kknn or knn – k-nearest neighbor (classification and regression, uses kknn)

  • ksvm or svm – support vector machine (classification and regression, uses ksvm)

  • lssvm – least squares support vector machine (pure classification only, uses lssvm)

  • mlp – multilayer perceptron with one hidden layer (classification and regression, uses nnet (in this version, for both mlp and mlpe, the maximum number of weights was increased and fixed to MaxNWts=10000))

  • mlpe – multilayer perceptron ensemble (classification and regression, uses nnet)

  • randomForest or randomforest – random forest algorithm (classification and regression, uses randomForest)

  • xgboost – eXtreme Gradient Boosting (Tree) (classification and regression, uses xgboost; note: nrounds parameter is set by default to 2)

  • bagging – bagging from Breiman, 1996 (classification, uses bagging)

  • boosting – adaboost.M1 method from Freund and Schapire, 1996 (classification, uses boosting)

  • lda – linear discriminant analysis (classification, uses lda)

  • multinom or lr – logistic regression (classification, uses multinom)

  • naiveBayes or naivebayes – naive bayes (classification, uses naiveBayes)

  • qda – quadratic discriminant analysis (classification, uses qda)

  • cubist – M5 rule-based model (regression, uses cubist)

  • lm – standard multiple/linear regression (uses lm)

  • mr – multiple regression (regression, equivalent to lm but uses nnet with zero hidden nodes and linear output function)

  • mars – multivariate adaptive regression splines (regression, uses mars)

  • pcr – principal component regression (regression, uses pcr)

  • plsr – partial least squares regression (regression, uses plsr)

  • cppls – canonical powered partial least squares (regression, uses cppls)

  • rvm – relevance vector machine (regression, uses rvm)

Second usage: multiple models. model can be used to perform Automated Machine Learning (AutoML) or ensembles of several individual models:

  • auto – first, the best model is automatically set by searching all models defined in search and selecting the one with the best “validation” metric on a validation set (depending on the method defined in search); then, the selected best model is fit to all training data. When auto is used, a ranked leaderboard of the models (and their selected hyperparameters) is returned as a new $LB field of the @mpar returned slot (e.g., try: print(M@mpar$LB), where M is an object returned by fit).

  • AE, WE or SE – all individual models are first fit to the data; then an ensemble is built by: AE – Average Ensemble, majority (if task=="class") or average of the predictions; WE) – Weighted Ensemble, similar to AE but each prediction is weighted according to the validation metric (for task=="class" it is equal to AE); SE – Stacking Ensemble, applies a second-level GLM to weight the individual predictions. For any ensemble, when an individual model produces an error then it is excluded from the ensemble. After excluding invalid models, if there is just a single model then such model is returned (and no ensemble is produced).

Third usage: model can be a list with 2 possibilities of fields A) and B).
A) if you have your one fit function, then you can embed it using:

  • $fit – a fit function that accepts the arguments x, data and ..., the goal is to accept here any R classification or regression model, mainly for its use within the mining or Importance functions, or to use a hyperparameter search (via search).

  • $predict – a predict function that accepts the arguments object, newdata, this function should behave as any rminer prediction, i.e., return: a factor when task=="class"; a matrix with Probabilities x Instances when task=="prob"; and a vector when task=="reg".

  • $name – optional field with the name of the method.

B) automatically produced by some ensemble methods, for the sake of documentation the fields for the ensembles ("AE", "WE" or "SE") are listed here:

  • $m – a vector character with the fit object model names.

  • $f – a vector list with several fit objects.

  • $w – a vector with the "weighting" of the individual models.

Note: current rminer version emphasizes the use of native fitting functions from their respective packages, since these functions contain several specific hyperparameters that can now be searched or set using the search or ... arguments. For compatibility with previous rminer versions, older model options are kept.

task

data mining task. Valid options are:

  • prob (or p) – classification with output probabilities (i.e. the sum of all outputs equals 1).

  • class (or c) – classification with discrete outputs (factor)

  • reg (or r) – regression (numeric output)

  • default tries to guess the best task (prob or reg) given the model and output variable type (if factor then prob else reg)

search

used to tune hyperparameter(s) of the model, such as: kknn – number of neighbors (k); mlp or mlpe – number of hidden nodes (size) or decay; ksvm – gaussian kernel parameter (sigma); randomForestmtry parameter).
This is a very flexible argument that can be used under several options: simpler use, complex tuning of an individual model or multiple models. The simpler use is kept for compatibility issues but it is advised to define this argument via the easier mparheuristic function.
Valid options for a simpler character type search use:

  • heuristic – simple heuristic, one search parameter (e.g., size=inputs/2 for mlp or size=10 if classification and inputs/2>10, sigma is set using kpar="automatic" and kernel="rbfdot" of ksvm). Important Note: instead of the "heuristic" options, it is advisable to use the explicit mparheuristic function that is designed for a wider option of models (all "heuristic" options were kept due to compatibility issues and work only for: kknn; mlp or mlpe; ksvm, with kernel="rbfdot"; and randomForest).

  • heuristic5 – heuristic with a 5 range grid-search (e.g., seq(1,9,2) for kknn, seq(0,8,2) for mlp or mlpe, 2^seq(-15,3,4) for ksvm, 1:5 for randomRorest)

  • heuristic10 – heuristic with a 10 range grid-search (e.g., seq(1,10,1) for kknn, seq(0,9,1) for mlp or mlpe, 2^seq(-15,3,2) for ksvm, 1:10 for randomRorest)

  • UD, UD1 or UD2 – uniform design 2-Level with 13 (UD or UD2) or 21 (UD1) searches (only works for ksvm and kernel="rbfdot").

Another simpler use of the search argument is a numeric type:

  • a-vector – numeric vector with all hyperparameter values that will be searched within an internal grid-search (the number of searches is length(search) when convex=0)

A more complex but advised use of search is to use a list. Non expert users should create this list via the mparheuristic function, which is very easy to use. Nevertheless, the fields of the list for a single fit (individual model) are shown here:

  • $smethod – type of search method. Valid options are:

    • none – no search is executed, one single fit is performed.

    • matrix – matrix search (tests only n searches, all search parameters are of size n).

    • grid – normal grid search (tests all combinations of search parameters).

    • 2L - nested 2-Level grid search. First level range is set by $search and then the 2nd level performs a fine tuning, with length($search) searches around (original range/2) best value in first level (2nd level is only performed on numeric searches).

    • UD, UD1 or UD2 – uniform design 2-Level with 13 (UD or UD2) or 21 (UD1) searches (note: only works for model="ksvm" and kernel="rbfdot"). Under this option, $search should contain the first level ranges, such as c(-15,3,-5,15) for classification (gamma min and max, C min and max, after which a 2^ transform is applied) or c(-8,0,-1,6,-8,-1) for regression (last two values are epsilon min and max, after which a 2^ transform is applied).

  • $search – a-list with all hyperparameter values to be searched or character with previous described options (e.g., "heuristic", "heuristic5", "UD"). If a character, then $smethod equal to "none" or "grid" or "UD" is automatically assumed.

  • $convex – number that defines how many searches are performed after a local minimum/maximum is found (if >0, the search can be stopped without testing all grid-search values)

  • $method – type of internal (validation) estimation method used during the search (see method argument of mining for details)

  • $metric – used to compute a metric value during internal estimation. Can be a single character such as "SAD" or a list with all the arguments used by the mmetric function except y and x, such as:
    search$metric=list(metric="AUC",TC=3,D=0.7). See mmetric for more details.

A more sophisticated definition of search involves the tuning of several models (used by the model= auto, AE, WE or SE). Again, this sophisticated definition should be automatically set using the mparheuristic function. The list of fields for the multiple tuning mode are:

  • $models - a vector character with LM individual model values. This field can also include ensembles ("AE", "WE", "SE") provided they appear at the end of this vector. They will work if more than one valid individual model is included.

  • $ls - a vector list with LM search values (for each individual model, the values are the same as in individual search $search field).

  • $smethod - must have the auto value.

  • $smethod - must have the auto value.

  • $method - internal (validation) estimation method (equal to the individual search $method field).

  • $metric - internal (validation) estimation metric (equal to the individual search $metric field).

  • $convex - equal to the individual search $convex field.

Note: the mpar argument only appears due to compatibility issues. If used, then the mpar values are automatically fed into search. However, a direct use of the search argument is advised instead of mpar, since search is more flexible and powerful.

mpar

(important note: this argument only is kept in this version due to compatibility with previous rminer versions. Instead of mpar, you should use the more flexible and powerful search argument.)

vector with extra default (fixed) model parameters (used for modeling, search and feature selection) with:

  • c(vmethod,vpar,metric) – generic use of mpar (including most models);

  • c(C,epsilon,vmethod,vpar,metric) – if ksvm and C and epsilon are explicitly set;

  • c(nr,maxit,vmethod,vpar,metric) – if mlp or mlpe and nr and maxit are explicitly set;

C and epsilon are default values for svm (if any of these is =NA then heuristics are used to set the value).
nr is the number of mlp runs or mlpe individual models, while maxit is the maximum number of epochs (if any of these is =NA then heuristics are used to set the value).
For help on vmethod and vpar see mining.
metric is the internal error function (e.g., used by search to select the best model), valid options are explained in mmetric. When mpar=NULL then default values are used. If there are NA values (e.g., mpar=c(NA,NA)) then default values are used.

feature

feature selection and sensitivity analysis control. Valid fit function options are:

  • none – no feature selection;

  • a fmethod character value, such as sabs (see below);

  • a-vector – vector with c(fmethod,deletions,Runs,vmethod,vpar,defaultsearch)

  • a-vector – vector with c(fmethod,deletions,Runs,vmethod,vpar)

fmethod sets the type. Valid options are:

  • sbs – standard backward selection;

  • sabs – sensitivity analysis backward selection (faster);

  • sabsv – equal to sabs but uses variance for sensitivity importance measure;

  • sabsr – equal to sabs but uses range for sensitivity importance measure;

  • sabsg – equal to sabs (uses gradient for sensitivity importance measure);

deletions is the maximum number of feature deletions (if -1 not used).
Runs is the number of runs for each feature set evaluation (e.g., 1).
For help on vmethod and vpar see mining.
defaultsearch is one hyperparameter used during the feature selection search, after selecting the best feature set then search is used (faster). If not defined, then search is used during feature selection (may be slow).
When feature is a vector then default values are used to fill missing values or NA values. Note: feature selection capabilities are expected to be enhanced in next rminer versions.

scale

if data needs to be scaled (i.e. for mlp or mlpe). Valid options are:

  • default – uses scaling when needed (i.e. for mlp or mlpe)

  • none – no scaling;

  • inputs – standardizes (0 mean, 1 st. deviation) input attributes;

  • all – standardizes (0 mean, 1 st. deviation) input and output attributes;

If needed, the predict function of rminer performs the inverse scaling.

transform

if the output data needs to be transformed (e.g., log transform). Valid options are:

  • none – no transform;

  • log – y=(log(y+1)) (the inverse function is applied in the predict function);

  • positive – all predictions are positive (negative values are turned into zero);

  • logpositive – both log and logpositive;

created

time stamp for the model. By default, the system time is used. Else, you can specify another time.

fdebug

if TRUE show some search details.

...

additional and specific parameters send to each fit function model (e.g., dt, rpart, randomforest, kernlab). A few examples:
– the rpart function is used for decision trees, thus you can have:
control=rpart.control(cp=.05) (see crossvaldata example).
– the ksvm function is used for support vector machines, thus you can change the kernel type: kernel="polydot" (see examples below).
Important note: if you use package functions and get an error, then try to explicitly define the package. For instance, you might need to use fit(several-arguments,control=Cubist::cubistControl()) instead of
fit(several-arguments,control=cubistControl()).

Details

Fits a classification or regression model given a data.frame (see [Cortez, 2010] for more details). The ... optional arguments should be used to fix values used by specific model functions (see examples). Notes:
- if there is an error in the fit, then a warning is issued (see example).
- the new search argument is very flexible and allows a powerful design of supervised learning models.
- the search correct use is very dependent on the R learning base functions. For example, if you are tuning model="rpart" then read carefully the help of function rpart.
- mpar argument is only kept due to compatibility issues and should be avoided; instead, use the more flexible search.

Details about some models:

  • Neural Network: mlp trains nr multilayer perceptrons (with maxit epochs, size hidden nodes and decay value according to the nnet function) and selects the best network according to minimum penalized error ($value). mlpe uses an ensemble of nr networks and the final prediction is given by the average of all outputs. To tune mlp or mlpe you can use the search parameter, which performs a grid search for size or decay.

  • Support Vector Machine: svm adopts by default the gaussian (rbfdot) kernel. For classification tasks, you can use search to tune sigma (gaussian kernel parameter) and C (complexity parameter). For regression, the epsilon insensitive function is adopted and there is an additional hyperparameter epsilon.

  • Other methods: Random Forest – if needed, you can tune several parameters, including the default mtry parameter adopted by search heuristics; k-nearest neighbor – search by default tunes k. The remaining models can also be tunned but a full definition of search is required (e.g., with $smethod, $search and other fields); please check mparheuristic function for further tuning examples (e.g., rpart).

Value

Returns a model object. You can check all model elements with str(M), where M is a model object. The slots are:

  • @formula – the x;

  • @model – the model;

  • @task – the task;

  • @mpar – data.frame with the best model parameters (interpretation depends on model);

  • @attributes – the attributes used by the model;

  • @scale – the scale;

  • @transform – the transform;

  • @created – the date when the model was created;

  • @time – computation effort to fit the model;

  • @object – the R object model (e.g., rpart, nnet, ...);

  • @outindex – the output index (of @attributes);

  • @levels – if task=="prob"||task=="class" stores the output levels;

  • @error – similarly to mining this is the "validation" error for some search options;

Note

See also http://hdl.handle.net/1822/36210 and http://www3.dsi.uminho.pt/pcortez/rminer.html

Author(s)

Paulo Cortez https://pcortez.dsi.uminho.pt

References

  • To check for more details about rminer and for citation purposes:
    P. Cortez.
    Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
    In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
    @Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
    http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf

  • This tutorial shows additional code examples:
    P. Cortez.
    A tutorial on using the rminer R package for data mining tasks.
    Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
    http://hdl.handle.net/1822/36210

  • For the grid search and other optimization methods:
    P. Cortez.
    Modern Optimization with R.
    Use R! series, Springer, 2nd edition, July 2021, ISBN 978-3-030-72818-2.
    https://link.springer.com/book/10.1007/978-3-030-72819-9

  • The automl is inspired in this work:
    L. Ferreira, A. Pilastri, C. Martins, P. Santos, P. Cortez.
    An Automated and Distributed Machine Learning Framework for Telecommunications Risk Management. In J. van den Herik et al. (Eds.), Proceedings of 12th International Conference on Agents and Artificial Intelligence – ICAART 2020, Volume 2, pp. 99-107, Valletta, Malta, February, 2020, SCITEPRESS, ISBN 978-989-758-395-7.
    @INSTICC: http://hdl.handle.net/1822/66818

  • For the sabs feature selection:
    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
    Modeling wine preferences by data mining from physicochemical properties.
    In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
    \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.dss.2009.05.016")}

  • For the uniform design details:
    C.M. Huang, Y.J. Lee, D.K.J. Lin and S.Y. Huang.
    Model selection for support vector machines via uniform design,
    In Computational Statistics & Data Analysis, 52(1):335-346, 2007.

See Also

mparheuristic,mining, predict.fit, mgraph, mmetric, savemining, CasesSeries, lforecast, holdout and Importance. Check all rminer functions using: help(package=rminer).

Examples

### dontrun is used when the execution of the example requires some computational effort.

### simple regression (with a formula) example.
x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
M=fit(y~x1+x2,model="mlpe")
new1=rnorm(100,100,20); new2=rnorm(100,100,20)
ynew=0.7*sin(new1/(25*pi))+0.3*sin(new2/(25*pi))
P=predict(M,data.frame(x1=new1,x2=new2,y=rep(NA,100)))
print(mmetric(ynew,P,"MAE"))

### simple classification example.
## Not run: 
data(iris)
M=fit(Species~.,iris,model="rpart")
plot(M@object); text(M@object) # show model
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF"))
print(mmetric(iris$Species,P,"ALL"))
mgraph(iris$Species,P,graph="ROC",TC=2,main="versicolor ROC",
baseline=TRUE,leg="Versicolor",Grid=10)

M2=fit(Species~.,iris,model="ctree")
plot(M2@object) # show model
P2=predict(M2,iris)
print(mmetric(iris$Species,P2,"CONF"))

# ctree with different setup:
# (ctree_control is from the party package)
M3=fit(Species~.,iris,model="ctree",controls = party::ctree_control(testtype="MonteCarlo"))
plot(M3@object) # show model

## End(Not run)

### simple binary classification example with cv.glmnet and xgboost
## Not run: 
data(sa_ssin_2)
H=holdout(sa_ssin_2$y,ratio=2/3)
# cv.glmnet:
M=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",task="cla") # pure classes
P=predict(M,sa_ssin_2[H$ts,])
cat("1st prediction, class:",as.character(P[1]),"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P,"CONF")$conf)

M2=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet") # probabilities
P2=predict(M2,sa_ssin_2[H$ts,])
L=M2@levels
cat("1st prediction, prob:",L[1],"=",P2[1,1],",",L[2],"=",P2[1,2],"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"CONF")$conf)
cat("AUC of ROC curve:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"AUC"))

M3=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",nfolds=3) # use 3 folds instead of 10
plot(M3@object) # show cv.glmnet object
P3=predict(M3,sa_ssin_2[H$ts,])

# xgboost:
M4=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",verbose=1) # nrounds=2, show rounds:
P4=predict(M4,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P4,"AUC"))
M5=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",nrounds=3,verbose=1) # nrounds=3, show rounds:
P5=predict(M5,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P5,"AUC"))

## End(Not run)

### classification example with discrete classes, probabilities and holdout
## Not run: 
data(iris)
H=holdout(iris$Species,ratio=2/3)
M=fit(Species~.,iris[H$tr,],model="ksvm",task="class")
M1=fit(Species~.,iris[H$tr,],model="lssvm") # default task="class" is assumed
M2=fit(Species~.,iris[H$tr,],model="ksvm",task="prob")
P=predict(M,iris[H$ts,]) # classes
P1=predict(M1,iris[H$ts,]) # classes
P2=predict(M2,iris[H$ts,]) # probabilities
print(mmetric(iris$Species[H$ts],P,"CONF"))
print(mmetric(iris$Species[H$ts],P1,"CONF"))
print(mmetric(iris$Species[H$ts],P2,"CONF"))
print(mmetric(iris$Species[H$ts],P,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"AUC"))

### exploration of some rminer classification models:
models=c("lda","naiveBayes","kknn","randomForest","cv.glmnet","xgboost")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(Species~.,iris[H$tr,],model=m)
   P=predict(M,iris[H$ts,])
   print(mmetric(iris$Species[H$ts],P,"AUC")[[1]])
 }

## End(Not run)

### classification example with hyperparameter selection 
###    note: for regression, similar code can be used
### SVM 
## Not run: 
data(iris)
# large list of SVM configurations:
# SVM with kpar="automatic" sigma rbfdot kernel estimation and default C=1:
#  note: each execution can lead to different M@mpar due to sigest stochastic nature:
M=fit(Species~.,iris,model="ksvm")
print(M@mpar) # model hyperparameters/arguments
# same thing, explicit use of mparheuristic:
M=fit(Species~.,iris,model="ksvm",search=list(search=mparheuristic("ksvm")))
print(M@mpar) # model hyperparameters

# SVM with C=3, sigma=2^-7
M=fit(Species~.,iris,model="ksvm",C=3,kpar=list(sigma=2^-7))
print(M@mpar)
# SVM with different kernels:
M=fit(Species~.,iris,model="ksvm",kernel="polydot",kpar="automatic") 
print(M@mpar)
# fit already has a scale argument, thus the only way to fix scale of "tanhdot"
# is to use the special search argument with the "none" method:
s=list(smethod="none",search=list(scale=2,offset=2))
M=fit(Species~.,iris,model="ksvm",kernel="tanhdot",search=s) 
print(M@mpar)
# heuristic: 10 grid search values for sigma, rbfdot kernel (fdebug is used only for more verbose):
s=list(search=mparheuristic("ksvm",10)) # advised "heuristic10" usage
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# same thing, uses older search="heuristic10"
M=fit(Species~.,iris,model="ksvm",search="heuristic10",fdebug=TRUE)
print(M@mpar)
# identical search under a different and explicit code:
s=list(search=2^seq(-15,3,2))
M=fit(Species~.,iris,model="ksvm",search=2^seq(-15,3,2),fdebug=TRUE)
print(M@mpar)

# uniform design "UD" for sigma and C, rbfdot kernel, two level of grid searches, 
# under exponential (2^x) search scale:
M=fit(Species~.,iris,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)
M=fit(Species~.,iris,model="ksvm",search="UD1",fdebug=TRUE)
print(M@mpar)
# now the more powerful search argument is used for modeling SVM:
# grid 3 x 3 search:
s=list(smethod="grid",search=list(sigma=2^c(-15,-5,3),C=2^c(-5,0,15)),convex=0,
            metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# identical search with different argument smethod="matrix" 
s$smethod="matrix"
s$search=list(sigma=rep(2^c(-15,-5,3),times=3),C=rep(2^c(-5,0,15),each=3))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best kernel (only works for kpar="automatic"):
s=list(smethod="grid",search=list(kernel=c("rbfdot","laplacedot","polydot","vanilladot")),
       convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best parameters of "rbfdot" or "laplacedot" (which use same kpar):
s$search=list(kernel=c("rbfdot","laplacedot"),sigma=2^seq(-15,3,5))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)

### randomForest
# search for mtry and ntree
s=list(smethod="grid",search=list(mtry=c(1,2,3),ntree=c(100,200,500)),
            convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="randomForest",search=s,fdebug=TRUE)
print(M@mpar)

### rpart
# simpler way to tune cp in 0.01 to 0.9 (10 searches):
s=list(search=mparheuristic("rpart",n=10,lower=0.01,upper=0.9),method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

# same thing but with more lines of code
# note: this code can be adapted to tune other rpart parameters,
#       while mparheuristic only tunes cp
# a vector list needs to be used for the search$search parameter
lcp=vector("list",10) # 10 grid values for the complexity cp
names(lcp)=rep("cp",10) # same cp name 
scp=seq(0.01,0.9,length.out=10) # 10 values from 0.01 to 0.18
for(i in 1:10) lcp[[i]]=scp[i] # cycle needed due to [[]] notation
s=list(smethod="grid",search=list(control=lcp),
            convex=0,metric="AUC",method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

### ctree 
# simpler way to tune mincriterion in 0.1 to 0.98 (9 searches):
mint=c("kfold",3,123) # internal validation method
s=list(search=mparheuristic("ctree",n=8,lower=0.1,upper=0.99),method=mint)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)
# same thing but with more lines of code
# note: this code can be adapted to tune other ctree parameters,
#       while mparheuristic only tunes mincriterion
# a vector list needs to be used for the search$search parameter
lmc=vector("list",9) # 9 grid values for the mincriterion
smc=seq(0.1,0.99,length.out=9)
for(i in 1:9) lmc[[i]]=party::ctree_control(mincriterion=smc[i]) 
s=list(smethod="grid",search=list(controls=lmc),method=mint,convex=0)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)

### some MLP fitting examples:
# simplest use:
M=fit(Species~.,iris,model="mlpe")  
print(M@mpar)
# same thing, with explicit use of mparheuristic:
M=fit(Species~.,iris,model="mlpe",search=list(search=mparheuristic("mlpe")))
print(M@mpar) # hidden nodes and number of ensemble mlps
# setting some nnet parameters:
M=fit(Species~.,iris,model="mlpe",size=3,decay=0.1,maxit=100,rang=0.9) 
print(M@mpar) # mlpe hyperparameters
# MLP, 5 grid search fdebug is only used to put some verbose in the console:
s=list(search=mparheuristic("mlpe",n=5)) # 5 searches for size
print(s) # show search
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# previous searches used a random holdout (seed=NULL), now a fixed seed (123) is used:
s=list(smethod="grid",search=mparheuristic("mlpe",n=5),convex=0,metric="AUC",
            method=c("holdout",2/3,123))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# faster and greedy grid search:
s$convex=1;s$search=list(size=0:9)
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# 2 level grid with total of 5 searches 
#  note of caution: some "2L" ranges may lead to non integer (e.g., 1.3) values at
#  the 2nd level search. And some R functions crash if non integer values are used for
#  integer parameters.
s$smethod="2L";s$convex=0;s$search=list(size=c(4,8,12))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)

# testing of all 17 rminer classification methods:
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","lssvm","mlp","mlpe",
 "randomForest","xgboost","bagging","boosting","lda","multinom","naiveBayes","qda")
inputs=ncol(iris)-1
ho=holdout(iris$Species,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=iris[ho$ts,ncol(iris)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(Species~.,data=iris[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,iris[ho$ts,])
  cat("predicted ACC:",round(mmetric(Y,P,metric="ACC"),1),"\n")
 }


## End(Not run)

### example of an error (warning) generated using fit:
## Not run: 
data(iris)
# size needs to be a positive integer, thus 0.1 leads to an error:
M=fit(Species~.,iris,model="mlp",size=0.1)  
print(M@object)

## End(Not run)

### exploration of some rminer regression models:
## Not run: 
data(sa_ssin)
H=holdout(sa_ssin$y,ratio=2/3,seed=12345)
models=c("lm","mr","ctree","mars","cubist","cv.glmnet","xgboost","rvm")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(y~.,sa_ssin[H$tr,],model=m)
   P=predict(M,sa_ssin[H$ts,])
   print(mmetric(sa_ssin$y[H$ts],P,"MAE"))
 }

## End(Not run)

# testing of all 18 rminer regression methods:
## Not run: 
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","mlp","mlpe",
 "randomForest","xgboost","cubist","lm","mr","mars","pcr","plsr","cppls","rvm")
# note: in this example, default values are considered for the hyperparameters.
# better results can be achieved by tuning hyperparameters via improved usage
# of the search argument (via mparheuristic function or written code)
data(iris)
ir2=iris[,1:4] # predict iris "Petal.Width"
names(ir2)[ncol(ir2)]="y" # change output name
inputs=ncol(ir2)-1
ho=holdout(ir2$y,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=ir2[ho$ts,ncol(ir2)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(y~.,data=ir2[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,ir2[ho$ts,])
  cat("predicted MAE:",round(mmetric(Y,P,metric="MAE"),1),"\n")
 }

## End(Not run)

### regression example with hyperparameter selection:
## Not run: 
data(sa_ssin)
# some SVM experiments:
# default SVM:
M=fit(y~.,data=sa_ssin,model="svm")
print(M@mpar)
# SVM with (Cherkassy and Ma, 2004) heuristics to set C and epsilon:
M=fit(y~.,data=sa_ssin,model="svm",C=NA,epsilon=NA)
print(M@mpar)
# SVM with Uniform Design set sigma, C and epsilon:
M=fit(y~.,data=sa_ssin,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)

# sensitivity analysis feature selection
M=fit(y~.,data=sa_ssin,model="ksvm",search=list(search=mparheuristic("ksvm",n=5)),feature="sabs") 
print(M@mpar)
print(M@attributes) # selected attributes (1, 2 and 3 are the relevant inputs)

# example that shows how transform works:
M=fit(y~.,data=sa_ssin,model="mr") # linear regression
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P should be negative
print(P)
M=fit(y~.,data=sa_ssin,model="mr",transform="positive")
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P is not negative
print(P)

## End(Not run)

### pure classification example with a generic R (not rminer default) model ###
## Not run: 
### nnet is adopted here but virtually ANY fitting function/package could be used:

# since the default nnet prediction is to provide probabilities, there is
# a need to create this "wrapping" function:
predictprob=function(object,newdata)
{ predict(object,newdata,type="class") }
# list with a fit and predict function:
# nnet::nnet (package::function)
model=list(fit=nnet::nnet,predict=predictprob,name="nnet")
data(iris)
# note that size is not a fit parameter and it is sent directly to nnet:
M=fit(Species~.,iris,model=model,size=3,task="class") 
P=predict(M,iris)
print(P)

## End(Not run) 

### multiple models: automl and ensembles 
## Not run: 
data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# consult the help of mparheuristic for more automl and ensemble examples:
#
# automatic machine learining (automl) with 5 distinct models and "SE" ensemble.
# the single models are tuned with 10 internal hyperparameter searches, 
# except ksvm that uses 13 searches via "UD".
# fit performs an internal validation 
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
M=fit(y~.,data=d,model="auto",search=search,fdebug=TRUE)
P=predict(M,d)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


# average ensemble of 5 distinct models
# the single models are tuned with 1 (heuristic) hyperparameter search 
sm2=mparheuristic(model="automl",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search2=list(search=sm2,smethod="auto",method=method,metric=metric,convex=0)
M2=fit(y~.,data=d,model="AE",search=search2,fdebug=TRUE)
P2=predict(M,d)
cat("best model is:",M2@model,"\n")
cat(metric,"=",round(mmetric(d$y,P2,metric=metric),2),"\n")

# example with an invalid model exclusion: 
# in this case, randomForest produces an error and warning
# thus it is excluded from the leaderboard
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("holdout",2/3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
d2=d
#
d2[,2]=as.factor(1:150) # force randomForest error
M=fit(y~.,data=d2,model="auto",search=search,fdebug=TRUE)
P=predict(M,d2)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


## End(Not run)


rminer documentation built on Oct. 29, 2024, 9:06 a.m.

Related to fit in rminer...