In order to provide a unified (formulabased) interface to various machine learning algorithms, these function wrap a common UI around a couple of existing code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93  mlearning(formula, data, method, model.args, call = match.call(), ...,
subset, na.action = na.fail)
## S3 method for class 'mlearning'
print(x, ...)
## S3 method for class 'mlearning'
summary(object, ...)
## S3 method for class 'summary.mlearning'
print(x, ...)
## S3 method for class 'mlearning'
plot(x, y, ...)
## S3 method for class 'mlearning'
predict(object, newdata, type = c("class", "membership", "both"),
method = c("direct", "cv"), na.action = na.exclude, ...)
cvpredict(object, ...)
## S3 method for class 'mlearning'
cvpredict(object, type = c("class", "membership", "both"),
cv.k = 10, cv.strat = TRUE, ...)
mlLda(...)
## Default S3 method:
mlLda(train, response, ...)
## S3 method for class 'formula'
mlLda(formula, data, ..., subset, na.action)
## S3 method for class 'mlLda'
predict(object, newdata, type = c("class", "membership", "both",
"projection"), prior = object$prior, dimension,
method = c("plugin", "predictive", "debiased", "cv"), ...)
mlQda(...)
## Default S3 method:
mlQda(train, response, ...)
## S3 method for class 'formula'
mlQda(formula, data, ..., subset, na.action)
## S3 method for class 'mlQda'
predict(object, newdata, type = c("class", "membership", "both"),
prior = object$prior, method = c("plugin", "predictive", "debiased",
"looCV", "cv"), ...)
mlRforest(...)
## Default S3 method:
mlRforest(train, response, ntree = 500, mtry, replace = TRUE, classwt = NULL, ...)
## S3 method for class 'formula'
mlRforest(formula, data, ntree = 500, mtry, replace = TRUE, classwt = NULL, ...,
subset, na.action)
## S3 method for class 'mlRforest'
predict(object, newdata, type = c("class", "membership", "both",
"vote"), method = c("direct", "oob", "cv"), ...)
mlNnet(...)
## Default S3 method:
mlNnet(train, response, size = NULL, rang = NULL, decay = 0, maxit = 1000, ...)
## S3 method for class 'formula'
mlNnet(formula, data, size = NULL, rang = NULL, decay = 0, maxit = 1000, ...,
subset, na.action)
mlLvq(...)
## Default S3 method:
mlLvq(train, response, k.nn = 5, size, prior, algorithm = "olvq1", ...)
## S3 method for class 'formula'
mlLvq(formula, data, k.nn = 5, size, prior, algorithm = "olvq1", ...,
subset, na.action)
## S3 method for class 'lvq'
summary(object, ...)
## S3 method for class 'summary.lvq'
print(x, ...)
## S3 method for class 'mlLvq'
predict(object, newdata, type = "class", method = c("direct", "cv"),
na.action = na.exclude,...)
mlSvm(...)
## Default S3 method:
mlSvm(train, response, scale = TRUE, type = NULL, kernel = "radial",
classwt = NULL, ...)
## S3 method for class 'formula'
mlSvm(formula, data, scale = TRUE, type = NULL, kernel = "radial",
classwt = NULL, ..., subset, na.action)
## S3 method for class 'mlSvm'
predict(object, newdata, type = c("class", "membership", "both"),
method = c("direct", "cv"), na.action = na.exclude,...)
mlNaiveBayes(...)
## Default S3 method:
mlNaiveBayes(train, response, laplace = 0, ...)
## S3 method for class 'formula'
mlNaiveBayes(formula, data, laplace = 0, ..., subset, na.action)
response(object, ...)
## Default S3 method:
response(object, ...)
train(object, ...)
## Default S3 method:
train(object, ...)

formula 
a formula with left term being the factor variable to predict
(for supervised classification), a vector of numbers (for regression) or
nothing (for unsupervised classification) and the right term with the list
of independent, predictive variables, separated with a plus sign. If the
data frame provided contains only the dependent and independent variables,
one can use the 
data 
a data.frame to use as a training set. 
method 
a machine learning method to use. For 
model.args 
arguments for formula modeling with substituted data and subset... Not to be used by the enduser. 
call 
the function call. Not to be used by the enduser. 
... 
further arguments passed to the machine learning algorithm or
the 
subset 
index vector with the cases to define the training set in use (this argument must be named, if provided). 
na.action 
function to specify the action to be taken if NAs are found

cv.k 
k for kfold cross validation, cf 
cv.strat 
is the subsampling stratified or not in cross validation,
cf 
x 
a mlearning object. 
y 
another object (depending on the machine learning algorithm, but it is usually not used). 
object 
one of the mlearning objects. 
newdata 
a data.frame with same variables as 
type 
the type of result to get. Usually, 
train 
a matrix or data frame with predictors. 
response 
a vector of factor (classification) or numeric (regression),
or 
prior 
prior probabilities of the classes (the proportions in the
training set are used by default). For 
dimension 
the dimension of the space to be used for prediction. 
ntree 
the number of trees to generate (use a value large enough to get at least a few predictions for each input row). 
mtry 
number of variables randomly sampled as candidates at each split. 
replace 
sample cases with or without replacement? 
classwt 
priors of the classes. Need not add up to one. 
size 
number of units in the hidden layer for 
rang 
initial random weights on [rang, rang]. Value about 0.5 unless
the inputs are large, in which case it should be chosen so that
rang * max(x) is about 1. If 
decay 
parameter for weight decay. Default 0. 
maxit 
maximum number of iterations. Default 1000. 
k.nn 
k used for kNN test of correct classification. Default is 5. 
algorithm 
an algorithm among 'olvq1' (default, the optimized lvq1), 'lvq1', 'lvq2', or 'lvq3'. 
scale 
are all the variables scaled? If a vector is provided, it is applied to variables with recycling. 
kernel 
the kernel used by svm, see 
laplace 
positive double controlling Laplace smoothing for the naive Bayes classifier. The default (0) disables Laplace smoothing. 
TODO: explain here the mechanism used to provide a common interface on top of various existing algorithms, and how one can add new items.
A machine learning object where the predict()
method can be applied
to classify new items.
For response()
and train()
, the respective resmonse vector and
training matrix (the matrix with all predicting terms).
All these functions are just wrapper around existing R code written by Philippe Grosjean <Philippe.Grosjean@umons.ac.be> in order to get similar interface and objects. All credits to original authors (click here under).
confusion
, errorest
,
lda
, qda
,
randomForest
, olvq1
,
nnet
, naiveBayes
,
svm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159  ## Prepare data: split into training set (2/3) and test set (1/3)
data("iris", package = "datasets")
train < c(1:34, 51:83, 101:133)
irisTrain < iris[train, ]
irisTest < iris[train, ]
## One case with missing data in train set, and another case in test set
irisTrain[1, 1] < NA
irisTest[25, 2] < NA
data("HouseVotes84", package = "mlbench")
data(airquality, package = "datasets")
## Supervised classification using linear discriminant analysis
irLda < mlLda(Species ~ ., data = irisTrain)
irLda
summary(irLda)
plot(irLda, col = as.numeric(response(irLda)) + 1)
predict(irLda, newdata = irisTest) # class (default type)
predict(irLda, type = "membership") # posterior probability
predict(irLda, type = "both") # both class and membership in a list
## Sometimes, other types are allowed, like for lda:
predict(irLda, type = "projection") # Projection on the LD axes
## Add test set items to the previous plot
points(predict(irLda, newdata = irisTest, type = "projection"),
col = as.numeric(predict(irLda, newdata = irisTest)) + 1, pch = 19)
## predict() and confusion() should be used on a separate test set
## for unbiased estimation (or using crossvalidation, bootstrap, ...)
confusion(irLda) # Wrong, cf. biased estimation (socalled, selfconsistency)
## Estimation using a separate test set
confusion(predict(irLda, newdata = irisTest), irisTest$Species)
## Another dataset (binary predictor... not optimal for lda, just for test)
summary(res < mlLda(Class ~ ., data = HouseVotes84, na.action = na.omit))
confusion(res) # Selfconsistency
print(confusion(res), error.col = FALSE) # Without error column
## More complex formulas
summary(mlLda(Species ~ .  Sepal.Width, data = iris)) # Exclude variable
summary(mlLda(Species ~ log(Petal.Length) + log(Petal.Width) +
I(Petal.Length/Sepal.Length), data = iris)) # With calculations
## Factor levels with missing items are allowed
ir2 < iris[(51:100), ] # No Iris versicolor in the training set
summary(res < mlLda(Species ~ ., data = ir2)) # virginica is NOT there
## Missing levels are reinjected in class or membership by predict()
predict(res, type = "both")
## ... but, of course, the classifier is wrong for Iris versicolor
confusion(predict(res, newdata = iris), iris$Species)
## Simpler interface, but more memoryeffective
summary(mlLda(train = iris[, 5], response = iris$Species))
## Supervised classification using quadratic discriminant analysis
summary(res < mlQda(Species ~ ., data = irisTrain))
confusion(res) # Selfconsistency
confusion(predict(res, newdata = irisTest), irisTest$Species) # Performances
## Another dataset (binary predictor... not optimal for qda, just for test)
summary(res < mlQda(Class ~ ., data = HouseVotes84, na.action = na.omit))
confusion(res) # Selfconsistency
## Supervised classification using random forest
summary(res < mlRforest(Species ~ ., data = irisTrain))
plot(res)
## For such a relatively simple case, 50 trees are enough
summary(res < mlRforest(Species ~ ., data = irisTrain, ntree = 50))
predict(res) # Default type is class
predict(res, type = "membership")
predict(res, type = "both")
predict(res, type = "vote")
## Outofbag prediction
predict(res, method = "oob")
confusion(res) # Selfconsistency
confusion(res, method = "oob") # Outofbag performances
## Crossvalidation prediction is a good choice when there is no test set:
predict(res, method = "cv") # Idem: cvpredict(res)
confusion(res, method = "cv") # Crossvalidation for performances estimation
## Evaluation of performances using a separate test set
confusion(predict(res, newdata = irisTest), irisTest$Species) # Test set perfs
## Regression using random forest (from ?randomForest)
set.seed(131)
summary(ozone.rf < mlRforest(Ozone ~ ., data = airquality, mtry = 3,
importance = TRUE, na.action = na.omit))
## Show "importance" of variables: higher value mean more important:
round(randomForest::importance(ozone.rf), 2)
plot(na.omit(airquality)$Ozone, predict(ozone.rf))
abline(a = 0, b = 1)
## Unsupervised classification using random forest (from ?randomForest)
set.seed(17)
summary(iris.urf < mlRforest(~ ., iris[, 5]))
randomForest::MDSplot(iris.urf, iris$Species)
plot(hclust(as.dist(1  iris.urf$proximity), method = "average"),
labels = iris$Species)
## Supervised classification using neural network
set.seed(689)
summary(res < mlNnet(Species ~ ., data = irisTrain))
predict(res) # Default type is class
predict(res, type = "membership")
predict(res, type = "both")
confusion(res) # Selfconsistency
confusion(predict(res, newdata = irisTest), irisTest$Species) # Test set perfs
## Idem, but two classes prediction using factor predictors
set.seed(325)
summary(res < mlNnet(Class ~ ., data = HouseVotes84, na.action = na.omit))
confusion(res) # Selfconsistency
## Regression using neural network
set.seed(34)
summary(ozone.nnet < mlNnet(Ozone ~ ., data = airquality, na.action = na.omit,
skip = TRUE, decay = 1e3, size = 20, linout = TRUE))
plot(na.omit(airquality)$Ozone, predict(ozone.nnet))
abline(a = 0, b = 1)
## Supervised classification using learning vector quantization
summary(res < mlLvq(Species ~ ., data = irisTrain))
predict(res) # This object only returns class
confusion(res) # Selfconsistency
confusion(predict(res, newdata = irisTest), irisTest$Species) # Test set perfs
## Supervised classification using support vector machine
summary(res < mlSvm(Species ~ ., data = irisTrain))
predict(res) # Default type is class
predict(res, type = "membership")
predict(res, type = "both")
confusion(res) # Selfconsistency
confusion(predict(res, newdata = irisTest), irisTest$Species) # Test set perfs
## Another dataset
summary(res < mlSvm(Class ~ ., data = HouseVotes84, na.action = na.omit))
confusion(res) # Selfconsistency
## Regression using support vector machine
summary(ozone.svm < mlSvm(Ozone ~ ., data = airquality, na.action = na.omit))
plot(na.omit(airquality)$Ozone, predict(ozone.svm))
abline(a = 0, b = 1)
## Supervised classification using naive Bayes
summary(res < mlNaiveBayes(Species ~ ., data = irisTrain))
predict(res) # Default type is class
predict(res, type = "membership")
predict(res, type = "both")
confusion(res) # Selfconsistency
confusion(predict(res, newdata = irisTest), irisTest$Species) # Test set perfs
## Another dataset
summary(res < mlNaiveBayes(Class ~ ., data = HouseVotes84, na.action = na.omit))
confusion(res) # Selfconsistency

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
All documentation is copyright its authors; we didn't write any of that.