mining | R Documentation |
Powerful function that trains and tests a particular fit model under several runs and a given validation method. Since there can be a huge number of models, the fitted models are not stored. Yet, several useful statistics (e.g. predictions) are returned.
mining(x, data = NULL, Runs = 1, method = NULL, model = "default",
task = "default", search = "heuristic", mpar = NULL,
feature="none", scale = "default", transform = "none",
debug = FALSE, ...)
x |
a symbolic description (formula) of the model to be fit. If |
data |
an optional data frame (columns denote attributes, rows show examples) containing the training data, when using a formula. |
Runs |
number of runs used (e.g. 1, 5, 10, 20, 30) |
method |
a vector with c(vmethod,vpar,seed) or c(vmethod,vpar,window,increment), where vmethod is:
vpar – number used by vmethod (optional, if not defined 2/3 for
|
model |
See |
task |
See |
search |
See |
mpar |
Only kept for compatibility with previous |
feature |
See
|
scale |
See |
transform |
See |
debug |
If TRUE shows some information about each run. |
... |
See |
Powerful function that trains and tests a particular fit model under several runs and a given validation method
(see [Cortez, 2010] for more details).
Several Runs
are performed. In each run, the same validation method is adopted (e.g. holdout
) and
several relevant statistics are stored. Note: this function can require some computational effort, specially if
a large dataset and/or a high number of Runs
is adopted.
A list
with the components:
$object – fitted object values of the last run (used by multiple model fitting: "auto" mode). For "holdout", it is equal to a fit
object, while for "kfold" it is a list.
$time – vector with time elapsed for each run.
$test – vector list, where each element contains the test (target) results for each run.
$pred – vector list, where each element contains the predicted results for each test set and each run.
$error – vector with a (validation) measure (often it is a error value) according to search$metric
for each run (valid options are explained in mmetric
).
$mpar – vector list, where each element contains the fit model mpar parameters (for each run).
$model – the model
.
$task – the task
.
$method – the external validation method
.
$sen – a matrix with the 1-D sensitivity analysis input importances. The number of rows is Runs
times vpar, if kfold
, else is Runs
.
$sresponses – a vector list with a size equal to the number of attributes (useful for graph="VEC"
).
Each element contains a list with the 1-D sensitivity analysis input responses
(n
– name of the attribute; l
– number of levels; x
– attribute values; y
– 1-D sensitivity responses.
Important note: sresponses (and "VEC" graphs) are only available if feature="sabs"
or "simp"
related (see feature
).
$runs – the Runs
.
$attributes – vector list with all attributes (features) selected in each run (and fold if kfold
) if a feature selection algorithm is used.
$feature – the feature
.
See also http://hdl.handle.net/1822/36210 and http://www3.dsi.uminho.pt/pcortez/rminer.html
Paulo Cortez http://www3.dsi.uminho.pt/pcortez/
To check for more details about rminer and for citation purposes:
P. Cortez.
Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool.
In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects 10th Industrial Conference on Data Mining (ICDM 2010), Lecture Notes in Artificial Intelligence 6171, pp. 572-583, Berlin, Germany, July, 2010. Springer. ISBN: 978-3-642-14399-1.
@Springer: https://link.springer.com/chapter/10.1007/978-3-642-14400-4_44
http://www3.dsi.uminho.pt/pcortez/2010-rminer.pdf
This tutorial shows additional code examples:
P. Cortez.
A tutorial on using the rminer R package for data mining tasks.
Teaching Report, Department of Information Systems, ALGORITMI Research Centre, Engineering School, University of Minho, Guimaraes,
Portugal, July 2015.
http://hdl.handle.net/1822/36210
For the grid search and other optimization methods:
P. Cortez.
Modern Optimization with R.
Use R! series, Springer, 2nd edition, July 2021, ISBN 978-3-030-72818-2.
https://link.springer.com/book/10.1007/978-3-030-72819-9
fit
, predict.fit
, mparheuristic
, mgraph
, mmetric
, savemining
, holdout
and Importance
.
### dontrun is used when the execution of the example requires some computational effort.
### simple regression example
set.seed(123); x1=rnorm(200,100,20); x2=rnorm(200,100,20)
y=0.7*sin(x1/(25*pi))+0.3*sin(x2/(25*pi))
# mining with an ensemble of neural networks, each fixed with size=2 hidden nodes
# assumes a default holdout (random split) with 2/3 for training and 1/3 for testing:
M=mining(y~x1+x2,Runs=2,model="mlpe",search=2)
print(M)
print(mmetric(M,metric="MAE"))
### more regression examples:
## Not run:
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# 5 runs of an external holdout with 2/3 for training and 1/3 for testing, fixed seed 12345
# feature selection: sabs method
# model selection: 5 searches for size, internal 2-fold cross validation fixed seed 123
# with optimization for minimum MAE metric
M=mining(y~.,data=sin1reg,Runs=5,method=c("holdout",2/3,12345),model="mlpe",
search=list(search=mparheuristic("mlpe",n=5),method=c("kfold",2,123),metric="MAE"),
feature="sabs")
print(mmetric(M,metric="MAE"))
print(M$mpar)
print("median hidden nodes (size) and number of MLPs (nr):")
print(centralpar(M$mpar))
print("attributes used by the model in each run:")
print(M$attributes)
mgraph(M,graph="RSC",Grid=10,main="sin1 MLPE scatter plot")
mgraph(M,graph="REP",Grid=10,main="sin1 MLPE scatter plot",sort=FALSE)
mgraph(M,graph="REC",Grid=10,main="sin1 MLPE REC")
mgraph(M,graph="IMP",Grid=10,main="input importances",xval=0.1,leg=names(sin1reg))
# average influence of x1 on the model:
mgraph(M,graph="VEC",Grid=10,main="x1 VEC curve",xval=1,leg=names(sin1reg)[1])
## End(Not run)
### regression example with holdout rolling windows:
## Not run:
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
# rolling with 20 test samples, training window size of 300 and increment of 50 in each run:
# note that Runs argument is automatically set to 14 in this example:
M=mining(y~.,data=sin1reg,method=c("holdoutrol",20,300,50),
model="mlpe",debug=TRUE)
## End(Not run)
### regression example with all rminer models:
## Not run:
# simple nonlinear regression task; x3 is a random variable and does not influence y:
data(sin1reg)
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","mr","mars",
"cubist","pcr","plsr","cppls","rvm")
for(model in models)
{
M=mining(y~.,data=sin1reg,method=c("holdout",2/3,12345),model=model)
cat("model:",model,"MAE:",round(mmetric(M,metric="MAE")$MAE,digits=3),"\n")
}
## End(Not run)
### classification example (task="prob")
## Not run:
data(iris)
# 10 runs of a 3-fold cross validation with fixed seed 123 for generating the 3-fold runs
M=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="rpart")
print(mmetric(M,metric="CONF"))
print(mmetric(M,metric="AUC"))
print(meanint(mmetric(M,metric="AUC")))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
main="versicolor ROC")
mgraph(M,graph="LIFT",TC=2,baseline=TRUE,Grid=10,leg="Versicolor",
main="Versicolor ROC")
M2=mining(Species~.,iris,Runs=10,method=c("kfold",3,123),model="ksvm")
L=vector("list",2)
L[[1]]=M;L[[2]]=M2
mgraph(L,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg=c("DT","SVM"),main="ROC")
## End(Not run)
### other classification examples
## Not run:
### 1st example:
data(iris)
# 2 runs of an external 2-fold validation, random seed
# model selection: SVM model with rbfdot kernel, automatic search for sigma,
# internal 3-fold validation, random seed, minimum "AUC" is assumed
# feature selection: none, "s" is used only to store input importance values
M=mining(Species~.,data=iris,Runs=2,method=c("kfold",2,NA),model="ksvm",
search=list(search=mparheuristic("ksvm"),method=c("kfold",3)),feature="s")
print(mmetric(M,metric="AUC",TC=2))
mgraph(M,graph="ROC",TC=2,baseline=TRUE,Grid=10,leg="SVM",main="ROC",intbar=FALSE)
mgraph(M,graph="IMP",TC=2,Grid=10,main="input importances",xval=0.1,
leg=names(iris),axis=1)
mgraph(M,graph="VEC",TC=2,Grid=10,main="Petal.Width VEC curve",
data=iris,xval=4)
### 2nd example, ordered kfold, k-nearest neigbor:
M=mining(Species~.,iris,Runs=1,method=c("kfoldo",3),model="knn")
# confusion matrix:
print(mmetric(M,metric="CONF"))
### 3rd example, use of all rminer models:
models=c("naive","ctree","rpart","kknn","mlp","mlpe","ksvm","randomForest","bagging",
"boosting","lda","multinom","naiveBayes","qda")
for(model in models)
{
M=mining(Species~.,iris,Runs=1,method=c("kfold",3,123),model=model)
cat("model:",model,"ACC:",round(mmetric(M,metric="ACC")$ACC,digits=1),"\n")
}
## End(Not run)
### multiple models: automl or ensembles
## Not run:
data(iris)
d=iris
names(d)[ncol(d)]="y" # change output name
inputs=ncol(d)-1
metric="AUC"
# simple automl (1 search per individual model),
# internal holdout and external holdout:
sm=mparheuristic(model="automl",n=NA,task="prob",inputs=inputs)
mode="auto"
imethod=c("holdout",4/5,123) # internal validation method
emethod=c("holdout",2/3,567) # external validation method
search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# 1 single model was selected:
cat("best",emethod[1],"selected model:",M$object@model,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")
# simple automl (1 search per individual model),
# internal kfold and external kfold:
imethod=c("kfold",3,123) # internal validation method
emethod=c("kfold",5,567) # external validation method
search=list(search=sm,smethod=mode,method=imethod,metric=metric,convex=0)
M=mining(y~.,data=d,model="auto",search=search,method=emethod,fdebug=TRUE)
# kfold models were selected:
kfolds=as.numeric(emethod[2])
models=vector(length=kfolds)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")
# example with weighted ensemble:
M=mining(y~.,data=d,model="WE",search=search,method=emethod,fdebug=TRUE)
for(i in 1:kfolds) models[i]=M$object$model[[i]]
cat("best",emethod[1],"selected models:",models,"\n")
cat(metric,"=",round(as.numeric(mmetric(M,metric=metric)),2),"\n")
## End(Not run)
### for more fitting examples check the help of function fit: help(fit,package="rminer")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.