fit: Fit a supervised data mining model (classification or...
In rminer: Data Mining Classification and Regression Methods

View source: R/model.R

fit	R Documentation

Fit a supervised data mining model (classification or regression) model

Description

Fit a supervised data mining model (classification or regression) model. Wrapper function that allows to fit distinct data mining (16 classification and 18 regression) methods under the same coherent function structure. Also, it tunes the hyperparameters of the models (e.g., kknn, mlpe and ksvm) and performs some feature selection methods.

Usage

fit(x, data = NULL, model = "default", task = "default", 
    search = "heuristic", mpar = NULL, feature = "none", 
    scale = "default", transform = "none", 
    created = NULL, fdebug = FALSE, ...)

Arguments

`x`	a symbolic description (formula) of the model to be fit. If `data=NULL` it is assumed that `x` contains a formula expression with known variables (see first example below).
`data`	an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula.
`model`	Typically this should be a character object with the model type name (data mining method, as explained in valid character options). First usage: individual fit. Valid character options are the typical R base learning functions (individual models), namely one of: `naive` most common class (classification) or mean output value (regression) `ctree` – conditional inference tree (classification and regression, uses `ctree`) `cv.glmnet` – generalized linear model (GLM) with lasso or elasticnet regularization (classification and regression, uses `cv.glmnet`; note: cross-validation is used to automatically set the lambda parameter that is needed to compute the predictions) `rpart` or `dt` – decision tree (classification and regression, uses `rpart`) `kknn` or `knn` – k-nearest neighbor (classification and regression, uses `kknn`) `ksvm` or `svm` – support vector machine (classification and regression, uses `ksvm`) `lssvm` – least squares support vector machine (pure classification only, uses `lssvm`) `mlp` – multilayer perceptron with one hidden layer (classification and regression, uses `nnet` (in this version, for both `mlp` and `mlpe`, the maximum number of weights was increased and fixed to `MaxNWts=10000`)) `mlpe` – multilayer perceptron ensemble (classification and regression, uses `nnet`) `randomForest` or `randomforest` – random forest algorithm (classification and regression, uses `randomForest`) `xgboost` – eXtreme Gradient Boosting (Tree) (classification and regression, uses `xgboost`; note: `nrounds` parameter is set by default to 2) `bagging` – bagging from Breiman, 1996 (classification, uses `bagging`) `boosting` – adaboost.M1 method from Freund and Schapire, 1996 (classification, uses `boosting`) `lda` – linear discriminant analysis (classification, uses `lda`) `multinom` or `lr` – logistic regression (classification, uses `multinom`) `naiveBayes` or `naivebayes` – naive bayes (classification, uses `naiveBayes`) `qda` – quadratic discriminant analysis (classification, uses `qda`) `cubist` – M5 rule-based model (regression, uses `cubist`) `lm` – standard multiple/linear regression (uses `lm`) `mr` – multiple regression (regression, equivalent to `lm` but uses `nnet` with zero hidden nodes and linear output function) `mars` – multivariate adaptive regression splines (regression, uses `mars`) `pcr` – principal component regression (regression, uses `pcr`) `plsr` – partial least squares regression (regression, uses `plsr`) `cppls` – canonical powered partial least squares (regression, uses `cppls`) `rvm` – relevance vector machine (regression, uses `rvm`) Second usage: multiple models. `model` can be used to perform Automated Machine Learning (AutoML) or ensembles of several individual models: `auto` – first, the best model is automatically set by searching all models defined in `search` and selecting the one with the best “validation” metric on a validation set (depending on the method defined in `search`); then, the selected best model is fit to all training data. When `auto` is used, a ranked leaderboard of the models (and their selected hyperparameters) is returned as a new `$LB` field of the `@mpar` returned slot (e.g., try: `print(M@mpar$LB)`, where `M` is an object returned by `fit`). `AE`, `WE` or `SE` – all individual models are first fit to the data; then an ensemble is built by: `AE` – Average Ensemble, majority (if `task=="class"`) or average of the predictions; `WE`) – Weighted Ensemble, similar to `AE` but each prediction is weighted according to the validation metric (for `task=="class"` it is equal to `AE`); `SE` – Stacking Ensemble, applies a second-level GLM to weight the individual predictions. For any ensemble, when an individual model produces an error then it is excluded from the ensemble. After excluding invalid models, if there is just a single model then such model is returned (and no ensemble is produced). Third usage: `model` can be a `list` with 2 possibilities of fields A) and B). A) if you have your one fit function, then you can embed it using: `$fit` – a fit function that accepts the arguments `x`, `data` and `...`, the goal is to accept here any R classification or regression model, mainly for its use within the `mining` or `Importance` functions, or to use a hyperparameter search (via `search`). `$predict` – a predict function that accepts the arguments `object`, `newdata`, this function should behave as any rminer prediction, i.e., return: a factor when `task=="class"`; a matrix with Probabilities x Instances when `task=="prob"`; and a vector when `task=="reg"`. `$name` – optional field with the name of the method. B) automatically produced by some ensemble methods, for the sake of documentation the fields for the ensembles ("AE", "WE" or "SE") are listed here: `$m` – a vector character with the fit object model names. `$f` – a vector list with several fit objects. `$w` – a vector with the "weighting" of the individual models. Note: current rminer version emphasizes the use of native fitting functions from their respective packages, since these functions contain several specific hyperparameters that can now be searched or set using the `search` or `...` arguments. For compatibility with previous rminer versions, older `model` options are kept.
`task`	data mining task. Valid options are: `prob` (or `p`) – classification with output probabilities (i.e. the sum of all outputs equals 1). `class` (or `c`) – classification with discrete outputs (`factor`) `reg` (or `r`) – regression (numeric output) `default` tries to guess the best task (`prob` or `reg`) given the `model` and output variable type (if factor then `prob` else `reg`)
`search`	used to tune hyperparameter(s) of the model, such as: `kknn` – number of neighbors (k); `mlp` or `mlpe` – number of hidden nodes (size) or decay; `ksvm` – gaussian kernel parameter (sigma); `randomForest` – `mtry` parameter). This is a very flexible argument that can be used under several options: simpler use, complex tuning of an individual model or multiple models. The simpler use is kept for compatibility issues but it is advised to define this argument via the easier `mparheuristic` function. Valid options for a simpler character type `search` use: `heuristic` – simple heuristic, one search parameter (e.g., size=inputs/2 for `mlp` or size=10 if classification and inputs/2>10, sigma is set using `kpar="automatic"` and `kernel="rbfdot"` of `ksvm`). Important Note: instead of the "heuristic" options, it is advisable to use the explicit `mparheuristic` function that is designed for a wider option of models (all "heuristic" options were kept due to compatibility issues and work only for: `kknn`; `mlp` or `mlpe`; `ksvm`, with `kernel="rbfdot"`; and `randomForest`). `heuristic5` – heuristic with a 5 range grid-search (e.g., `seq(1,9,2)` for `kknn`, `seq(0,8,2)` for `mlp` or `mlpe`, `2^seq(-15,3,4)` for `ksvm`, `1:5` for `randomRorest`) `heuristic10` – heuristic with a 10 range grid-search (e.g., `seq(1,10,1)` for `kknn`, `seq(0,9,1)` for `mlp` or `mlpe`, `2^seq(-15,3,2)` for `ksvm`, `1:10` for `randomRorest`) `UD`, `UD1` or `UD2` – uniform design 2-Level with 13 (`UD` or `UD2`) or 21 (`UD1`) searches (only works for `ksvm` and `kernel="rbfdot"`). Another simpler use of the `search` argument is a numeric type: a-`vector` – numeric vector with all hyperparameter values that will be searched within an internal grid-search (the number of searches is `length(search)` when `convex=0`) A more complex but advised use of `search` is to use a `list`. Non expert users should create this list via the `mparheuristic` function, which is very easy to use. Nevertheless, the fields of the list for a single fit (individual model) are shown here: `$smethod` – type of search method. Valid options are: `none` – no search is executed, one single fit is performed. `matrix` – matrix search (tests only n searches, all search parameters are of size n). `grid` – normal grid search (tests all combinations of search parameters). `2L` - nested 2-Level grid search. First level range is set by `$search` and then the 2nd level performs a fine tuning, with `length($search)` searches around (original range/2) best value in first level (2nd level is only performed on numeric searches). `UD`, `UD1` or `UD2` – uniform design 2-Level with 13 (`UD` or `UD2`) or 21 (`UD1`) searches (note: only works for `model="ksvm"` and `kernel="rbfdot"`). Under this option, `$search` should contain the first level ranges, such as `c(-15,3,-5,15)` for classification (gamma min and max, C min and max, after which a `2^` transform is applied) or `c(-8,0,-1,6,-8,-1)` for regression (last two values are epsilon min and max, after which a `2^` transform is applied). `$search` – a-`list` with all hyperparameter values to be searched or character with previous described options (e.g., "heuristic", "heuristic5", "UD"). If a character, then `$smethod` equal to `"none"` or `"grid"` or `"UD"` is automatically assumed. `$convex` – number that defines how many searches are performed after a local minimum/maximum is found (if >0, the search can be stopped without testing all grid-search values) `$method` – type of internal (validation) estimation method used during the search (see `method` argument of `mining` for details) `$metric` – used to compute a metric value during internal estimation. Can be a single character such as `"SAD"` or a list with all the arguments used by the `mmetric` function except `y` and `x`, such as: `search$metric=list(metric="AUC",TC=3,D=0.7)`. See `mmetric` for more details. A more sophisticated definition of `search` involves the tuning of several models (used by the `model=` `auto`, `AE`, `WE` or `SE`). Again, this sophisticated definition should be automatically set using the `mparheuristic` function. The list of fields for the multiple tuning mode are: `$models` - a vector character with LM individual `model` values. This field can also include ensembles (`"AE"`, `"WE"`, `"SE"`) provided they appear at the end of this vector. They will work if more than one valid individual model is included. `$ls` - a vector list with LM search values (for each individual model, the values are the same as in individual search `$search` field). `$smethod` - must have the `auto` value. `$smethod` - must have the `auto` value. `$method` - internal (validation) estimation method (equal to the individual search `$method` field). `$metric` - internal (validation) estimation metric (equal to the individual search `$metric` field). `$convex` - equal to the individual search `$convex` field. Note: the `mpar` argument only appears due to compatibility issues. If used, then the `mpar` values are automatically fed into search. However, a direct use of the `search` argument is advised instead of `mpar`, since `search` is more flexible and powerful.
`mpar`	(important note: this argument only is kept in this version due to compatibility with previous rminer versions. Instead of `mpar`, you should use the more flexible and powerful `search` argument.) vector with extra default (fixed) model parameters (used for modeling, search and feature selection) with: c(vmethod,vpar,`metric`) – generic use of mpar (including most models); c(C,epsilon,vmethod,vpar,`metric`) – if `ksvm` and C and epsilon are explicitly set; c(nr,maxit,vmethod,vpar,`metric`) – if `mlp` or `mlpe` and nr and maxit are explicitly set; C and epsilon are default values for `svm` (if any of these is `=NA` then heuristics are used to set the value). nr is the number of `mlp` runs or `mlpe` individual models, while maxit is the maximum number of epochs (if any of these is `=NA` then heuristics are used to set the value). For help on vmethod and vpar see `mining`. `metric` is the internal error function (e.g., used by search to select the best model), valid options are explained in `mmetric`. When `mpar=NULL` then default values are used. If there are `NA` values (e.g., `mpar=c(NA,NA)`) then default values are used.
`feature`	feature selection and sensitivity analysis control. Valid `fit` function options are: `none` – no feature selection; a fmethod character value, such as `sabs` (see below); a-`vector` – vector with c(fmethod,deletions,Runs,vmethod,vpar,defaultsearch) a-`vector` – vector with c(fmethod,deletions,Runs,vmethod,vpar) fmethod sets the type. Valid options are: `sbs` – standard backward selection; `sabs` – sensitivity analysis backward selection (faster); `sabsv` – equal to `sabs` but uses variance for sensitivity importance measure; `sabsr` – equal to `sabs` but uses range for sensitivity importance measure; `sabsg` – equal to `sabs` (uses gradient for sensitivity importance measure); deletions is the maximum number of feature deletions (if -1 not used). Runs is the number of runs for each feature set evaluation (e.g., 1). For help on vmethod and vpar see `mining`. defaultsearch is one hyperparameter used during the feature selection search, after selecting the best feature set then `search` is used (faster). If not defined, then `search` is used during feature selection (may be slow). When feature is a vector then default values are used to fill missing values or `NA` values. Note: feature selection capabilities are expected to be enhanced in next rminer versions.
`scale`	if data needs to be scaled (i.e. for `mlp` or `mlpe`). Valid options are: `default` – uses scaling when needed (i.e. for `mlp` or `mlpe`) `none` – no scaling; `inputs` – standardizes (0 mean, 1 st. deviation) input attributes; `all` – standardizes (0 mean, 1 st. deviation) input and output attributes; If needed, the `predict` function of rminer performs the inverse scaling.
`transform`	if the output data needs to be transformed (e.g., `log` transform). Valid options are: `none` – no transform; `log` – y=(log(y+1)) (the inverse function is applied in the `predict` function); `positive` – all predictions are positive (negative values are turned into zero); `logpositive` – both `log` and `logpositive`;
`created`	time stamp for the model. By default, the system time is used. Else, you can specify another time.
`fdebug`	if TRUE show some search details.
`...`	additional and specific parameters send to each fit function model (e.g., `dt`, `rpart`, `randomforest`, `kernlab`). A few examples: – the `rpart` function is used for decision trees, thus you can have: `control=rpart.control(cp=.05)` (see `crossvaldata` example). – the `ksvm` function is used for support vector machines, thus you can change the kernel type: `kernel="polydot"` (see examples below). Important note: if you use package functions and get an error, then try to explicitly define the package. For instance, you might need to use `fit(`several-arguments`,control=Cubist::cubistControl())` instead of `fit(`several-arguments`,control=cubistControl())`.

Details

Fits a classification or regression model given a data.frame (see [Cortez, 2010] for more details). The ... optional arguments should be used to fix values used by specific model functions (see examples). Notes:
- if there is an error in the fit, then a warning is issued (see example).
- the new search argument is very flexible and allows a powerful design of supervised learning models.
- the search correct use is very dependent on the R learning base functions. For example, if you are tuning model="rpart" then read carefully the help of function rpart.
- mpar argument is only kept due to compatibility issues and should be avoided; instead, use the more flexible search.

Details about some models:

Neural Network: mlp trains nr multilayer perceptrons (with maxit epochs, size hidden nodes and decay value according to the nnet function) and selects the best network according to minimum penalized error ($value). mlpe uses an ensemble of nr networks and the final prediction is given by the average of all outputs. To tune mlp or mlpe you can use the search parameter, which performs a grid search for size or decay.
Support Vector Machine: svm adopts by default the gaussian (rbfdot) kernel. For classification tasks, you can use search to tune sigma (gaussian kernel parameter) and C (complexity parameter). For regression, the epsilon insensitive function is adopted and there is an additional hyperparameter epsilon.
Other methods: Random Forest – if needed, you can tune several parameters, including the default mtry parameter adopted by search heuristics; k-nearest neighbor – search by default tunes k. The remaining models can also be tunned but a full definition of search is required (e.g., with $smethod, $search and other fields); please check mparheuristic function for further tuning examples (e.g., rpart).

Value

Returns a model object. You can check all model elements with str(M), where M is a model object. The slots are:

@formula – the x;
@model – the model;
@task – the task;
@mpar – data.frame with the best model parameters (interpretation depends on model);
@attributes – the attributes used by the model;
@scale – the scale;
@transform – the transform;
@created – the date when the model was created;
@time – computation effort to fit the model;
@object – the R object model (e.g., rpart, nnet, ...);
@outindex – the output index (of @attributes);
@levels – if task=="prob"||task=="class" stores the output levels;
@error – similarly to mining this is the "validation" error for some search options;

Note

Author(s)

Paulo Cortez https://pcortez.dsi.uminho.pt

References

To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes, Portugal, July 2015.
http://hdl.handle.net/1822/36210
For the grid search and other optimization methods:
P. Cortez.
Modern Optimization with R.
Use R! series, Springer, 2nd edition, July 2021, ISBN 978-3-030-72818-2.
https://link.springer.com/book/10.1007/978-3-030-72819-9
The automl was benchmarked in this work:
L. Ferreira, A. Pilastri, C.M. Martins, P.M. Pires and P. Cortez.
A Comparison of AutoML Tools for Machine Learning, Deep Learning and XGBoost. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2021), paper N-1274, July, 2021.
@IEEE: https://ieeexplore.ieee.org/document/9534091
For the sabs feature selection:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
\Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.dss.2009.05.016")}
For the uniform design details:
C.M. Huang, Y.J. Lee, D.K.J. Lin and S.Y. Huang.
Model selection for support vector machines via uniform design,
In Computational Statistics & Data Analysis, 52(1):335-346, 2007.

Examples

### dontrun is used when the execution of the example requires some computational effort.

### simple regression (with a formula) example.
x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
M=fit(y~x1+x2,model="mlpe")
new1=rnorm(100,100,20); new2=rnorm(100,100,20)
ynew=0.7*sin(new1/(25*pi))+0.3*sin(new2/(25*pi))
P=predict(M,data.frame(x1=new1,x2=new2,y=rep(NA,100)))
print(mmetric(ynew,P,"MAE"))

### simple classification example.
## Not run: 
data(iris)
M=fit(Species~.,iris,model="rpart")
plot(M@object); text(M@object) # show model
P=predict(M,iris)
print(mmetric(iris$Species,P,"CONF"))
print(mmetric(iris$Species,P,"ALL"))
mgraph(iris$Species,P,graph="ROC",TC=2,main="versicolor ROC",
baseline=TRUE,leg="Versicolor",Grid=10)

M2=fit(Species~.,iris,model="ctree")
plot(M2@object) # show model
P2=predict(M2,iris)
print(mmetric(iris$Species,P2,"CONF"))

# ctree with different setup:
# (ctree_control is from the party package)
M3=fit(Species~.,iris,model="ctree",controls = party::ctree_control(testtype="MonteCarlo"))
plot(M3@object) # show model

## End(Not run)

### simple binary classification example with cv.glmnet and xgboost
## Not run: 
data(sa_ssin_2)
H=holdout(sa_ssin_2$y,ratio=2/3)
# cv.glmnet:
M=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",task="cla") # pure classes
P=predict(M,sa_ssin_2[H$ts,])
cat("1st prediction, class:",as.character(P[1]),"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P,"CONF")$conf)

M2=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet") # probabilities
P2=predict(M2,sa_ssin_2[H$ts,])
L=M2@levels
cat("1st prediction, prob:",L[1],"=",P2[1,1],",",L[2],"=",P2[1,2],"\n")
cat("Confusion matrix:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"CONF")$conf)
cat("AUC of ROC curve:\n")
print(mmetric(sa_ssin_2[H$ts,]$y,P2,"AUC"))

M3=fit(y~.,sa_ssin_2[H$tr,],model="cv.glmnet",nfolds=3) # use 3 folds instead of 10
plot(M3@object) # show cv.glmnet object
P3=predict(M3,sa_ssin_2[H$ts,])

# xgboost:
M4=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",verbose=1) # nrounds=2, show rounds:
P4=predict(M4,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P4,"AUC"))
M5=fit(y~.,sa_ssin_2[H$tr,],model="xgboost",nrounds=3,verbose=1) # nrounds=3, show rounds:
P5=predict(M5,sa_ssin_2[H$ts,])
print(mmetric(sa_ssin_2[H$ts,]$y,P5,"AUC"))

## End(Not run)

### classification example with discrete classes, probabilities and holdout
## Not run: 
data(iris)
H=holdout(iris$Species,ratio=2/3)
M=fit(Species~.,iris[H$tr,],model="ksvm",task="class")
M1=fit(Species~.,iris[H$tr,],model="lssvm") # default task="class" is assumed
M2=fit(Species~.,iris[H$tr,],model="ksvm",task="prob")
P=predict(M,iris[H$ts,]) # classes
P1=predict(M1,iris[H$ts,]) # classes
P2=predict(M2,iris[H$ts,]) # probabilities
print(mmetric(iris$Species[H$ts],P,"CONF"))
print(mmetric(iris$Species[H$ts],P1,"CONF"))
print(mmetric(iris$Species[H$ts],P2,"CONF"))
print(mmetric(iris$Species[H$ts],P,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"CONF",TC=1))
print(mmetric(iris$Species[H$ts],P2,"AUC"))

### exploration of some rminer classification models:
models=c("lda","naiveBayes","kknn","randomForest","cv.glmnet","xgboost")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(Species~.,iris[H$tr,],model=m)
   P=predict(M,iris[H$ts,])
   print(mmetric(iris$Species[H$ts],P,"AUC")[[1]])
 }

## End(Not run)

### classification example with hyperparameter selection 
###    note: for regression, similar code can be used
### SVM 
## Not run: 
data(iris)
# large list of SVM configurations:
# SVM with kpar="automatic" sigma rbfdot kernel estimation and default C=1:
#  note: each execution can lead to different M@mpar due to sigest stochastic nature:
M=fit(Species~.,iris,model="ksvm")
print(M@mpar) # model hyperparameters/arguments
# same thing, explicit use of mparheuristic:
M=fit(Species~.,iris,model="ksvm",search=list(search=mparheuristic("ksvm")))
print(M@mpar) # model hyperparameters

# SVM with C=3, sigma=2^-7
M=fit(Species~.,iris,model="ksvm",C=3,kpar=list(sigma=2^-7))
print(M@mpar)
# SVM with different kernels:
M=fit(Species~.,iris,model="ksvm",kernel="polydot",kpar="automatic") 
print(M@mpar)
# fit already has a scale argument, thus the only way to fix scale of "tanhdot"
# is to use the special search argument with the "none" method:
s=list(smethod="none",search=list(scale=2,offset=2))
M=fit(Species~.,iris,model="ksvm",kernel="tanhdot",search=s) 
print(M@mpar)
# heuristic: 10 grid search values for sigma, rbfdot kernel (fdebug is used only for more verbose):
s=list(search=mparheuristic("ksvm",10)) # advised "heuristic10" usage
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# same thing, uses older search="heuristic10"
M=fit(Species~.,iris,model="ksvm",search="heuristic10",fdebug=TRUE)
print(M@mpar)
# identical search under a different and explicit code:
s=list(search=2^seq(-15,3,2))
M=fit(Species~.,iris,model="ksvm",search=2^seq(-15,3,2),fdebug=TRUE)
print(M@mpar)

# uniform design "UD" for sigma and C, rbfdot kernel, two level of grid searches, 
# under exponential (2^x) search scale:
M=fit(Species~.,iris,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)
M=fit(Species~.,iris,model="ksvm",search="UD1",fdebug=TRUE)
print(M@mpar)
# now the more powerful search argument is used for modeling SVM:
# grid 3 x 3 search:
s=list(smethod="grid",search=list(sigma=2^c(-15,-5,3),C=2^c(-5,0,15)),convex=0,
            metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# identical search with different argument smethod="matrix" 
s$smethod="matrix"
s$search=list(sigma=rep(2^c(-15,-5,3),times=3),C=rep(2^c(-5,0,15),each=3))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best kernel (only works for kpar="automatic"):
s=list(smethod="grid",search=list(kernel=c("rbfdot","laplacedot","polydot","vanilladot")),
       convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)
# search for best parameters of "rbfdot" or "laplacedot" (which use same kpar):
s$search=list(kernel=c("rbfdot","laplacedot"),sigma=2^seq(-15,3,5))
print(s)
M=fit(Species~.,iris,model="ksvm",search=s,fdebug=TRUE)
print(M@mpar)

### randomForest
# search for mtry and ntree
s=list(smethod="grid",search=list(mtry=c(1,2,3),ntree=c(100,200,500)),
            convex=0,metric="AUC",method=c("kfold",3,12345))
print(s)
M=fit(Species~.,iris,model="randomForest",search=s,fdebug=TRUE)
print(M@mpar)

### rpart
# simpler way to tune cp in 0.01 to 0.9 (10 searches):
s=list(search=mparheuristic("rpart",n=10,lower=0.01,upper=0.9),method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

# same thing but with more lines of code
# note: this code can be adapted to tune other rpart parameters,
#       while mparheuristic only tunes cp
# a vector list needs to be used for the search$search parameter
lcp=vector("list",10) # 10 grid values for the complexity cp
names(lcp)=rep("cp",10) # same cp name 
scp=seq(0.01,0.9,length.out=10) # 10 values from 0.01 to 0.18
for(i in 1:10) lcp[[i]]=scp[i] # cycle needed due to [[]] notation
s=list(smethod="grid",search=list(control=lcp),
            convex=0,metric="AUC",method=c("kfold",3,12345))
M=fit(Species~.,iris,model="rpart",search=s,fdebug=TRUE)
print(M@mpar)

### ctree 
# simpler way to tune mincriterion in 0.1 to 0.98 (9 searches):
mint=c("kfold",3,123) # internal validation method
s=list(search=mparheuristic("ctree",n=8,lower=0.1,upper=0.99),method=mint)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)
# same thing but with more lines of code
# note: this code can be adapted to tune other ctree parameters,
#       while mparheuristic only tunes mincriterion
# a vector list needs to be used for the search$search parameter
lmc=vector("list",9) # 9 grid values for the mincriterion
smc=seq(0.1,0.99,length.out=9)
for(i in 1:9) lmc[[i]]=party::ctree_control(mincriterion=smc[i]) 
s=list(smethod="grid",search=list(controls=lmc),method=mint,convex=0)
M=fit(Species~.,iris,model="ctree",search=s,fdebug=TRUE)
print(M@mpar)

### some MLP fitting examples:
# simplest use:
M=fit(Species~.,iris,model="mlpe")  
print(M@mpar)
# same thing, with explicit use of mparheuristic:
M=fit(Species~.,iris,model="mlpe",search=list(search=mparheuristic("mlpe")))
print(M@mpar) # hidden nodes and number of ensemble mlps
# setting some nnet parameters:
M=fit(Species~.,iris,model="mlpe",size=3,decay=0.1,maxit=100,rang=0.9) 
print(M@mpar) # mlpe hyperparameters
# MLP, 5 grid search fdebug is only used to put some verbose in the console:
s=list(search=mparheuristic("mlpe",n=5)) # 5 searches for size
print(s) # show search
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# previous searches used a random holdout (seed=NULL), now a fixed seed (123) is used:
s=list(smethod="grid",search=mparheuristic("mlpe",n=5),convex=0,metric="AUC",
            method=c("holdout",2/3,123))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# faster and greedy grid search:
s$convex=1;s$search=list(size=0:9)
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)
# 2 level grid with total of 5 searches 
#  note of caution: some "2L" ranges may lead to non integer (e.g., 1.3) values at
#  the 2nd level search. And some R functions crash if non integer values are used for
#  integer parameters.
s$smethod="2L";s$convex=0;s$search=list(size=c(4,8,12))
print(s)
M=fit(Species~.,iris,model="mlpe",search=s,fdebug=TRUE)
print(M@mpar)

# testing of all 17 rminer classification methods:
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","lssvm","mlp","mlpe",
 "randomForest","xgboost","bagging","boosting","lda","multinom","naiveBayes","qda")
inputs=ncol(iris)-1
ho=holdout(iris$Species,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=iris[ho$ts,ncol(iris)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(Species~.,data=iris[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,iris[ho$ts,])
  cat("predicted ACC:",round(mmetric(Y,P,metric="ACC"),1),"\n")
 }


## End(Not run)

### example of an error (warning) generated using fit:
## Not run: 
data(iris)
# size needs to be a positive integer, thus 0.1 leads to an error:
M=fit(Species~.,iris,model="mlp",size=0.1)  
print(M@object)

## End(Not run)

### exploration of some rminer regression models:
## Not run: 
data(sa_ssin)
H=holdout(sa_ssin$y,ratio=2/3,seed=12345)
models=c("lm","mr","ctree","mars","cubist","cv.glmnet","xgboost","rvm")
for(m in models)
 { cat("model:",m,"\n") 
   M=fit(y~.,sa_ssin[H$tr,],model=m)
   P=predict(M,sa_ssin[H$ts,])
   print(mmetric(sa_ssin$y[H$ts],P,"MAE"))
 }

## End(Not run)

# testing of all 18 rminer regression methods:
## Not run: 
model=c("naive","ctree","cv.glmnet","rpart","kknn","ksvm","mlp","mlpe",
 "randomForest","xgboost","cubist","lm","mr","mars","pcr","plsr","cppls","rvm")
# note: in this example, default values are considered for the hyperparameters.
# better results can be achieved by tuning hyperparameters via improved usage
# of the search argument (via mparheuristic function or written code)
data(iris)
ir2=iris[,1:4] # predict iris "Petal.Width"
names(ir2)[ncol(ir2)]="y" # change output name
inputs=ncol(ir2)-1
ho=holdout(ir2$y,2/3,seed=123) # 2/3 for training and 1/3 for testing
Y=ir2[ho$ts,ncol(ir2)]
for(i in 1:length(model))
 {
  cat("i:",i,"model:",model[i],"\n")
  search=list(search=mparheuristic(model[i])) # rminer default values
  M=fit(y~.,data=ir2[ho$tr,],model=model[i],search=search,fdebug=TRUE)
  P=predict(M,ir2[ho$ts,])
  cat("predicted MAE:",round(mmetric(Y,P,metric="MAE"),1),"\n")
 }

## End(Not run)

### regression example with hyperparameter selection:
## Not run: 
data(sa_ssin)
# some SVM experiments:
# default SVM:
M=fit(y~.,data=sa_ssin,model="svm")
print(M@mpar)
# SVM with (Cherkassy and Ma, 2004) heuristics to set C and epsilon:
M=fit(y~.,data=sa_ssin,model="svm",C=NA,epsilon=NA)
print(M@mpar)
# SVM with Uniform Design set sigma, C and epsilon:
M=fit(y~.,data=sa_ssin,model="ksvm",search="UD",fdebug=TRUE)
print(M@mpar)

# sensitivity analysis feature selection
M=fit(y~.,data=sa_ssin,model="ksvm",search=list(search=mparheuristic("ksvm",n=5)),feature="sabs") 
print(M@mpar)
print(M@attributes) # selected attributes (1, 2 and 3 are the relevant inputs)

# example that shows how transform works:
M=fit(y~.,data=sa_ssin,model="mr") # linear regression
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P should be negative
print(P)
M=fit(y~.,data=sa_ssin,model="mr",transform="positive")
P=predict(M,data.frame(x1=-1000,x2=0,x3=0,x4=0,y=NA)) # P is not negative
print(P)

## End(Not run)

### pure classification example with a generic R (not rminer default) model ###
## Not run: 
### nnet is adopted here but virtually ANY fitting function/package could be used:

# since the default nnet prediction is to provide probabilities, there is
# a need to create this "wrapping" function:
predictprob=function(object,newdata)
{ predict(object,newdata,type="class") }
# list with a fit and predict function:
# nnet::nnet (package::function)
model=list(fit=nnet::nnet,predict=predictprob,name="nnet")
data(iris)
# note that size is not a fit parameter and it is sent directly to nnet:
M=fit(Species~.,iris,model=model,size=3,task="class") 
P=predict(M,iris)
print(P)

## End(Not run) 

### multiple models: automl and ensembles 
## Not run: 
data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"

# consult the help of mparheuristic for more automl and ensemble examples:
#
# automatic machine learining (automl) with 5 distinct models and "SE" ensemble.
# the single models are tuned with 10 internal hyperparameter searches, 
# except ksvm that uses 13 searches via "UD".
# fit performs an internal validation 
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
M=fit(y~.,data=d,model="auto",search=search,fdebug=TRUE)
P=predict(M,d)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


# average ensemble of 5 distinct models
# the single models are tuned with 1 (heuristic) hyperparameter search 
sm2=mparheuristic(model="automl",n=NA,task="prob", inputs= inputs )
method=c("kfold",3,123)
search2=list(search=sm2,smethod="auto",method=method,metric=metric,convex=0)
M2=fit(y~.,data=d,model="AE",search=search2,fdebug=TRUE)
P2=predict(M,d)
cat("best model is:",M2@model,"\n")
cat(metric,"=",round(mmetric(d$y,P2,metric=metric),2),"\n")

# example with an invalid model exclusion: 
# in this case, randomForest produces an error and warning
# thus it is excluded from the leaderboard
sm=mparheuristic(model="automl3",n=NA,task="prob", inputs= inputs )
method=c("holdout",2/3,123)
search=list(search=sm,smethod="auto",method=method,metric=metric,convex=0)
d2=d
#
d2[,2]=as.factor(1:150) # force randomForest error
M=fit(y~.,data=d2,model="auto",search=search,fdebug=TRUE)
P=predict(M,d2)
# show leaderboard:
cat("> leaderboard models:",M@mpar$LB$model,"\n")
cat(">  validation values:",round(M@mpar$LB$eval,4),"\n")
cat("best model is:",M@model,"\n")
cat(metric,"=",round(mmetric(d$y,P,metric=metric),2),"\n")


## End(Not run)

rminer documentation built on June 8, 2025, 10:26 a.m.

rminer index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rminer
Data Mining Classification and Regression Methods

fit: Fit a supervised data mining model (classification or...
In rminer: Data Mining Classification and Regression Methods

Fit a supervised data mining model (classification or regression) model

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Related to fit in rminer...

R Package Documentation

Browse R Packages

We want your feedback!

rminer Data Mining Classification and Regression Methods

fit: Fit a supervised data mining model (classification or... In rminer: Data Mining Classification and Regression Methods

Fit a supervised data mining model (classification or regression) model

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Related to fit in rminer...

R Package Documentation

Browse R Packages

We want your feedback!

rminer
Data Mining Classification and Regression Methods

fit: Fit a supervised data mining model (classification or...
In rminer: Data Mining Classification and Regression Methods