This package contains several ensemble learning algorithms based on the following papers:
The main idea of these algorithms are
These two ideas focus on the efficiency and accuracy respectively. Specifically, we usually use more than one SVM model to solve the whole problem, therefore this is also an ensemble learning framework.
In this package, we choose a small data set to demonstrate the usage of our functions. The data set is svmguide1 from libsvm's official website. The data is collected from an astroparticle application from Jan Conrad of Uppsala University, Sweden.
We can load it by
require(SwarmSVM) data(svmguide1)
It is a list object. Let's first take a look at the training data.
head(svmguide1[[1]])
The first column contains the classification target value, the other columns contain the features. It is a binary classification task. The second part in the list is the test set:
head(svmguide1[[2]])
We rename them with the following command:
svmguide1.t = svmguide1[[2]] svmguide1 = svmguide1[[1]]
From now on, we have the training data set svmguide1 and the test data set svmguide1.t.
In our pacakge, there are two main algorithms requiring clustering algorithm for the input data. Therefore we provide some clustering algorithms for users to choose from. Users can also implement their own functions and pass it to our algorithm.
We now provide two algorithms existing in R:
stats::kmeans, named as "kmeans";kernlab::kkmeans, named as "kernkmeans".In clusterSVM and dcSVM, we offer an argument cluster.method, you could choose one of the two algorithms and pass its name to the argument.
We also offer arguments for users to pass their own implementation of the clustering algorithm: cluster.fun and cluster.predict.
cluster.fun is the clustering training function. It takes a function requiring the data and number of centers as the two main arguments. The output of this function should be an list object of the clustering result, while it has two fields:
object$cluster as the clustering label on the input data.object$centers as the clustering center matrix.cluster.predict is the predicting algorithm on the trained clustering object. It takes a function requiring the data and trained clustering object from cluster.fun as the two main arguments. The output of this function should be simply a vector of tue clustering label on the input data.
The algorithm is straight forward:
Training
stats::kmeans.LiblineaR::LiblineaR.Test
We demonstrate the usage of this function with the following code:
csvm.obj = clusterSVM(x = svmguide1[,-1], y = svmguide1[,1], type = 1, valid.x = svmguide1.t[,-1],valid.y = svmguide1.t[,1], seed = 1, verbose = 1, centers = 8) csvm.obj$valid.score
Here the parameters are grouped into four parts:
x and y are the feature matric and target vector of the training data. type is specifying the mission and the type of the SVM.valid.x and valid.y are the feature matric and target vector of the validation data.seed is controlling the random seed to make the result reproducible. verbose is controlling the content of the output.centers is the parameter passing to the cluster algorithm.Dense and sparse input
The sample data set is in the format of sparse matrix.
class(svmguide1)
The function takes a dense matrix or a sparse matrix as the input feature matrix. Therefore the following code gives you the same result.
csvm.obj = clusterSVM(x = as.matrix(svmguide1[,-1]), y = svmguide1[,1], type = 1, valid.x = as.matrix(svmguide1.t[,-1]),valid.y = svmguide1.t[,1], seed = 1, verbose = 1, centers = 8) csvm.obj$valid.score
Self-defined clustering algorithm
In clusterSVM, the clustering is a very important step. Therefore we don't restrict users to the RcppMLPACK::mlKmeans algorithm. Instead, we accept user-defined clustering algorithm as an argument.
Note that we require the output of the clustering algorithm contains two fields: centers and cluster. One example could be
cluster.fun = stats::kmeans cluster.predict = function(x, cluster.object) { centers = cluster.object$centers eucliDist = function(x, centers) apply(centers, 1, function(C) colSums( (t(x)-C)^2 )) euclidean.dist = eucliDist(x, centers) result = max.col(-euclidean.dist) return(result) }
Here we use the default kmeans, and implement the prediction function. Once we have defined the algorithm, it is straight forward to pass it to clusterSVM:
csvm.obj = clusterSVM(x = svmguide1[,-1], y = svmguide1[,1], centers = 8, seed = 1, cluster.fun = cluster.fun, cluster.predict = cluster.predict, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) csvm.obj$valid.score
The algorithm could be described as the following:
Training
k groups.k finer groups.max.levels timesj-th group on level l, wealpha (coefficient on support vector) value from the subgroups of the j-th groupalpha value initializedalpha values by training an svm on all the support vectors of the whole data set.alpha valueTest
There are two ways to do prediction.
l, we predict the clustering label at level l for the new input data.We demonstrate the usage of this function with the following code:
dcsvm.model = dcSVM(x = svmguide1[,-1], y = svmguide1[,1], k = 4, max.levels = 4, seed = 0, cost = 32, gamma = 2, kernel = 3,early = 0, m = 800, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) dcsvm.model$valid.score
Here the parameters can be grouped into five parts.
x and y are the feature matric and target vector of the training data. valid.x and valid.y are the feature matric and target vector of the validation data.seed is controlling the random seed to make the result reproducible.k, max.levels controls the size of the subproblem tree. early is the variable specifying whether we use early prediction or not. If early = 0 then we don't use early prediction strategy, if early = l then we perform the early prediction at level l.Early Prediction
We can do the early prediction by the following command:
dcsvm.model = dcSVM(x = as.matrix(svmguide1[,-1]), y = svmguide1[,1], k = 10, max.levels = 1, early = 1, gamma = 2, cost = 32, tolerance = 1e-2, m = 800, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) dcsvm.model$valid.score dcsvm.model$time$total.time
It is faster because we can stop at a level, and don't need to train SVMs for data of larger size.
Exact Prediction
To make the model more accurate, we can also perform the exact training by:
dcsvm.model = dcSVM(x = as.matrix(svmguide1[,-1]), y = svmguide1[,1], k = 10, max.levels = 1, early = 1, gamma = 2, cost = 32, tolerance = 1e-2, m = 800, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1]) dcsvm.model$valid.score dcsvm.model$time$total.time
This is more accurate but the time is longer. It a balance between the accuracy and the time complexity.
The algorithm can be described in the following iterative framework:
Training
Test
gaterSVM.model = gaterSVM(x = svmguide1[,-1], y = svmguide1[,1], hidden = 10, seed = 0, m = 10, max.iter = 3, learningrate = 0.01, threshold = 1, stepmax = 1000, valid.x = svmguide1.t[,-1], valid.y = svmguide1.t[,1], verbose = TRUE) gaterSVM.model$valid.score
The parameters can be categorized into the following groups:
x and y are the feature matric and target vector of the training data. valid.x and valid.y are the feature matric and target vector of the validation data.seed is controlling the random seed to make the result reproducible. verbose is controlling the content of the output.m, max.iter controls the iteration of "experts-gater" process.hidden, learningrate, threshold, stepmax are parameters for the neural network model.We offer benchmark codes to compare the performance and efficiency of our implementation. You can find the codes under inst/benchmark. 
utils.R contains helper functions to prepare the data.preprocess_data.R contains codes that prepares the data for the other benchmark codes.clustered_SVM.R contains simple codes for different data sets. clustered_SVM_Repeat.R contains repeated experiments and measures on the average performance of clusterSVM, and comparison against LiblineaR::LiblineaR and e1071::svm.dc_SVM.R contains experiments and measures on the performance of dcSVM, and comparison against e1071::svm.gater_SVM.R contains experiments and measures on the performance of gaterSVM, and comparison against e1071::svm.For some experiments running in a reasonable time, we have already generated some results. Those results are subjected to changes in the machine, system environment and the implementation.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.