This package contains several ensemble learning algorithms based on the following papers:
The main idea of these algorithms are
These two ideas focus on the efficiency and accuracy respectively. Specifically, we usually use more than one SVM model to solve the whole problem, therefore this is also an ensemble learning framework.
In this package, we choose a small data set to demonstrate the usage of our functions. The data set is svmguide1
from libsvm's official website. The data is collected from an astroparticle application from Jan Conrad of Uppsala University, Sweden.
We can load it by
require(SwarmSVM) data(svmguide1)
It is a list object. Let's first take a look at the training data.
head(svmguide1[[1]])
The first column contains the classification target value, the other columns contain the features. It is a binary classification task. The second part in the list is the test set:
head(svmguide1[[2]])
We rename them with the following command:
svmguide1.t = svmguide1[[2]] svmguide1 = svmguide1[[1]]
From now on, we have the training data set svmguide1
and the test data set svmguide1.t
.
In our pacakge, there are two main algorithms requiring clustering algorithm for the input data. Therefore we provide some clustering algorithms for users to choose from. Users can also implement their own functions and pass it to our algorithm.
We now provide two algorithms existing in R:
stats::kmeans
, named as "kmeans";kernlab::kkmeans
, named as "kernkmeans".In clusterSVM
and dcSVM
, we offer an argument cluster.method
, you could choose one of the two algorithms and pass its name to the argument.
We also offer arguments for users to pass their own implementation of the clustering algorithm: cluster.fun
and cluster.predict
.
cluster.fun
is the clustering training function. It takes a function requiring the data and number of centers as the two main arguments. The output of this function should be an list object of the clustering result, while it has two fields:
object$cluster
as the clustering label on the input data.object$centers
as the clustering center matrix.cluster.predict
is the predicting algorithm on the trained clustering object. It takes a function requiring the data and trained clustering object from cluster.fun
as the two main arguments. The output of this function should be simply a vector of tue clustering label on the input data.
The algorithm is straight forward:
Training
stats::kmeans
.LiblineaR::LiblineaR
.Test
We demonstrate the usage of this function with the following code:
csvm.obj = clusterSVM(x = svmguide1[,-1], y = svmguide1[,1], type = 1, valid.x = svmguide1.t[,-1],valid.y = svmguide1.t[,1], seed = 1, verbose = 1, centers = 8) csvm.obj$valid.score
Here the parameters are grouped into four parts:
x
and y
are the feature matric and target vector of the training data. type
is specifying the mission and the type of the SVM.valid.x
and valid.y
are the feature matric and target vector of the validation data.seed
is controlling the random seed to make the result reproducible. verbose
is controlling the content of the output.centers
is the parameter passing to the cluster algorithm.Dense and sparse input
The sample data set is in the format of sparse matrix.
class(svmguide1)
The function takes a dense matrix or a sparse matrix as the input feature matrix. Therefore the following code gives you the same result.
csvm.obj = clusterSVM(x = as.matrix(svmguide1[,-1]), y = svmguide1[,1], type = 1, valid.x = as.matrix(svmguide1.t[,-1]),valid.y = svmguide1.t[,1], seed = 1, verbose = 1, centers = 8) csvm.obj$valid.score
Self-defined clustering algorithm
In clusterSVM
, the clustering is a very important step. Therefore we don't restrict users to the RcppMLPACK::mlKmeans
algorithm. Instead, we accept user-defined clustering algorithm as an argument.
Note that we require the output of the clustering algorithm contains two fields: centers
and cluster
. One example could be
cluster.fun = stats::kmeans cluster.predict = function(x, cluster.object) { centers = cluster.object$centers eucliDist = function(x, centers) apply(centers, 1, function(C) colSums( (t(x)-C)^2 )) euclidean.dist = eucliDist(x, centers) result = max.col(-euclidean.dist) return(result) }
Here we use the default kmeans, and implement the prediction function. Once we have defined the algorithm, it is straight forward to pass it to clusterSVM
:
csvm.obj = clusterSVM(x = svmguide1[,-1], y = svmguide1[,1], centers = 8, seed = 1, cluster.fun = cluster.fun, cluster.predict = cluster.predict, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) csvm.obj$valid.score
The algorithm could be described as the following:
Training
k
groups.k
finer groups.max.levels
timesj
-th group on level l
, wealpha
(coefficient on support vector) value from the subgroups of the j
-th groupalpha
value initializedalpha
values by training an svm on all the support vectors of the whole data set.alpha
valueTest
There are two ways to do prediction.
l
, we predict the clustering label at level l
for the new input data.We demonstrate the usage of this function with the following code:
dcsvm.model = dcSVM(x = svmguide1[,-1], y = svmguide1[,1], k = 4, max.levels = 4, seed = 0, cost = 32, gamma = 2, kernel = 3,early = 0, m = 800, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) dcsvm.model$valid.score
Here the parameters can be grouped into five parts.
x
and y
are the feature matric and target vector of the training data. valid.x
and valid.y
are the feature matric and target vector of the validation data.seed
is controlling the random seed to make the result reproducible.k
, max.levels
controls the size of the subproblem tree. early
is the variable specifying whether we use early prediction or not. If early = 0
then we don't use early prediction strategy, if early = l
then we perform the early prediction at level l
.Early Prediction
We can do the early prediction by the following command:
dcsvm.model = dcSVM(x = as.matrix(svmguide1[,-1]), y = svmguide1[,1], k = 10, max.levels = 1, early = 1, gamma = 2, cost = 32, tolerance = 1e-2, m = 800, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) dcsvm.model$valid.score dcsvm.model$time$total.time
It is faster because we can stop at a level, and don't need to train SVMs for data of larger size.
Exact Prediction
To make the model more accurate, we can also perform the exact training by:
dcsvm.model = dcSVM(x = as.matrix(svmguide1[,-1]), y = svmguide1[,1], k = 10, max.levels = 1, early = 1, gamma = 2, cost = 32, tolerance = 1e-2, m = 800, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) dcsvm.model$valid.score dcsvm.model$time$total.time
This is more accurate but the time is longer. It a balance between the accuracy and the time complexity.
The algorithm can be described in the following iterative framework:
Training
Test
gaterSVM.model = gaterSVM(x = svmguide1[,-1], y = svmguide1[,1], hidden = 10, seed = 0, m = 10, max.iter = 3, learningrate = 0.01, threshold = 1, stepmax = 1000, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1], verbose = TRUE) gaterSVM.model$valid.score
The parameters can be categorized into the following groups:
x
and y
are the feature matric and target vector of the training data. valid.x
and valid.y
are the feature matric and target vector of the validation data.seed
is controlling the random seed to make the result reproducible. verbose
is controlling the content of the output.m
, max.iter
controls the iteration of "experts-gater" process.hidden
, learningrate
, threshold
, stepmax
are parameters for the neural network model.We offer benchmark codes to compare the performance and efficiency of our implementation. You can find the codes under inst/benchmark
.
utils.R
contains helper functions to prepare the data.preprocess_data.R
contains codes that prepares the data for the other benchmark codes.clustered_SVM.R
contains simple codes for different data sets. clustered_SVM_Repeat.R
contains repeated experiments and measures on the average performance of clusterSVM
, and comparison against LiblineaR::LiblineaR
and e1071::svm
.dc_SVM.R
contains experiments and measures on the performance of dcSVM
, and comparison against e1071::svm
.gater_SVM.R
contains experiments and measures on the performance of gaterSVM
, and comparison against e1071::svm
.For some experiments running in a reasonable time, we have already generated some results. Those results are subjected to changes in the machine, system environment and the implementation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.