Nothing
#' @title Cross-validated grid search for GenSVM
#'
#' @description This function performs a cross-validated grid search of the
#' model parameters to find the best hyperparameter configuration for a given
#' dataset. This function takes advantage of GenSVM's ability to use warm
#' starts to speed up computation. The function uses the GenSVM C library for
#' speed.
#'
#' @param x training data matrix. We denote the size of this matrix by
#' n_samples x n_features.
#' @param y training vector of class labels of length n_samples. The number of
#' unique labels in this vector is denoted by n_classes.
#' @param param.grid String (\code{'tiny'}, \code{'small'}, or \code{'full'})
#' or data frame with parameter configurations to evaluate. Typically this is
#' the output of \code{expand.grid}. For more details, see "Using a Parameter
#' Grid" below.
#' @param refit boolean variable. If true, the best model from cross validation
#' is fitted again on the entire dataset.
#' @param scoring metric to use to evaluate the classifier performance during
#' cross validation. The metric should be an R function that takes two
#' arguments: y_true and y_pred and that returns a float such that higher
#' values are better. If it is NULL, the accuracy score will be used.
#' @param cv the number of cross-validation folds to use or a vector with the
#' same length as \code{y} where each unique value denotes a test split.
#' @param verbose integer to indicate the level of verbosity (higher is more
#' verbose)
#' @param return.train.score whether or not to return the scores on the
#' training splits
#'
#' @return A "gensvm.grid" S3 object with the following items:
#' \item{call}{Call that produced this object}
#' \item{param.grid}{Sorted version of the parameter grid used in training}
#' \item{cv.results}{A data frame with the cross validation results}
#' \item{best.estimator}{If refit=TRUE, this is the GenSVM model fitted with
#' the best hyperparameter configuration, otherwise it is NULL}
#' \item{best.score}{Mean cross-validated test score for the model with the
#' best hyperparameter configuration}
#' \item{best.params}{Parameter configuration that provided the highest mean
#' cross-validated test score}
#' \item{best.index}{Row index of the cv.results data frame that corresponds to
#' the best hyperparameter configuration}
#' \item{n.splits}{The number of cross-validation splits}
#' \item{n.objects}{The number of instances in the data}
#' \item{n.features}{The number of features of the data}
#' \item{n.classes}{The number of classes in the data}
#' \item{classes}{Array with the unique classes in the data}
#' \item{total.time}{Training time for the grid search}
#' \item{cv.idx}{Array with cross validation indices used to split the data}
#'
#' @section Using a Parameter Grid:
#' To evaluate certain parameter configurations, a data frame can be supplied
#' to the \code{param.grid} argument of the function. Such a data frame can
#' easily be generated using the R function \code{expand.grid}, or could be
#' created through other ways to test specific parameter configurations.
#'
#' Three parameter grids are predefined:
#' \describe{
#' \item{\code{'tiny'}}{This parameter grid is generated by the function
#' \code{\link{gensvm.load.tiny.grid}} and is the default parameter grid. It
#' consists of parameter configurations that are likely to perform well on
#' various datasets.}
#' \item{\code{'small'}}{This grid is generated by
#' \code{\link{gensvm.load.small.grid}} and generates a data frame with 90
#' configurations. It is typically fast to train but contains some
#' configurations that are unlikely to perform well. It is included for
#' educational purposes.}
#' \item{\code{'full'}}{This grid loads the parameter grid as used in the
#' GenSVM paper. It consists of 342 configurations and is generated by the
#' \code{\link{gensvm.load.full.grid}} function. Note that in the GenSVM paper
#' cross validation was done with this parameter grid, but the final training
#' step used \code{epsilon=1e-8}. The \code{\link{gensvm.refit}} function is
#' useful in this scenario.}
#' }
#'
#' When you provide your own parameter grid, beware that only certain column
#' names are allowed in the data frame corresponding to parameters for the
#' GenSVM model. These names are:
#'
#' \describe{
#' \item{p}{Parameter for the lp norm. Must be in [1.0, 2.0].}
#' \item{kappa}{Parameter for the Huber hinge function. Must be larger than
#' -1.}
#' \item{lambda}{Parameter for the regularization term. Must be larger than 0.}
#' \item{weights}{Instance weights specification. Allowed values are "unit" for
#' unit weights and "group" for group-size correction weights}
#' \item{epsilon}{Stopping parameter for the algorithm. Must be larger than 0.}
#' \item{max.iter}{Maximum number of iterations of the algorithm. Must be
#' larger than 0.}
#' \item{kernel}{The kernel to used, allowed values are "linear", "poly",
#' "rbf", and "sigmoid". The default is "linear"}
#' \item{coef}{Parameter for the "poly" and "sigmoid" kernels. See the section
#' "Kernels in GenSVM" in the code{ink{gensvm-package}} page for more info.}
#' \item{degree}{Parameter for the "poly" kernel. See the section "Kernels in
#' GenSVM" in the code{ink{gensvm-package}} page for more info.}
#' \item{gamma}{Parameter for the "poly", "rbf", and "sigmoid" kernels. See the
#' section "Kernels in GenSVM" in the code{ink{gensvm-package}} page for more
#' info.}
#' }
#'
#' For variables that are not present in the \code{param.grid} data frame the
#' default parameter values in the \code{\link{gensvm}} function will be used.
#'
#' Note that this function reorders the parameter grid to make the warm starts
#' as efficient as possible, which is why the param.grid in the result will not
#' be the same as the param.grid in the input.
#'
#' @note
#' 1. This function returns partial results when the computation is interrupted
#' by the user.
#' 2. The score.time reported in the results only covers the time needed to
#' compute the score from the predictions and true class labels. It does not
#' include the time to compute the predictions themselves.
#'
#' @author
#' Gerrit J.J. van den Burg, Patrick J.F. Groenen \cr
#' Maintainer: Gerrit J.J. van den Burg <gertjanvandenburg@gmail.com>
#'
#' @references
#' Van den Burg, G.J.J. and Groenen, P.J.F. (2016). \emph{GenSVM: A Generalized
#' Multiclass Support Vector Machine}, Journal of Machine Learning Research,
#' 17(225):1--42. URL \url{https://jmlr.org/papers/v17/14-526.html}.
#'
#' @seealso
#' \code{\link{predict.gensvm.grid}}, \code{\link{print.gensvm.grid}},
#' \code{\link{plot.gensvm.grid}}, \code{\link{gensvm}},
#' \code{\link{gensvm-package}}
#'
#' @export
#'
#' @importFrom stats sd
#'
#' @examples
#' x <- iris[, -5]
#' y <- iris[, 5]
#'
#' \donttest{
#' # use the default parameter grid
#' grid <- gensvm.grid(x, y, verbose=TRUE)
#' }
#'
#' # use a smaller parameter grid
#' pg <- expand.grid(p=c(1.0, 1.5, 2.0), kappa=c(-0.9, 1.0), epsilon=c(1e-3))
#' grid <- gensvm.grid(x, y, param.grid=pg)
#'
#' # print the result
#' print(grid)
#'
#' \donttest{
#' # Using a custom scoring function (accuracy as percentage)
#' acc.pct <- function(yt, yp) { return (100 * sum(yt == yp) / length(yt)) }
#' grid <- gensvm.grid(x, y, scoring=acc.pct)
#'
#' # With RBF kernel and very verbose progress printing
#' pg <- expand.grid(kernel=c('rbf'), gamma=c(1e-2, 1e-1, 1, 1e1, 1e2),
#' lambda=c(1e-8, 1e-6), max.iter=c(5000))
#' grid <- gensvm.grid(x, y, param.grid=pg, verbose=2)
#' }
#'
gensvm.grid <- function(x, y, param.grid='tiny', refit=TRUE, scoring=NULL, cv=3,
verbose=0, return.train.score=TRUE)
{
call <- match.call()
n.objects <- nrow(x)
n.features <- ncol(x)
n.classes <- length(unique(y))
if (n.objects != length(y)) {
cat("Error: x and y are not the same length.\n")
return(invisible(NULL))
}
if (is.character(param.grid)) {
if (param.grid == 'tiny') {
param.grid <- gensvm.load.tiny.grid()
} else if (param.grid == 'small') {
param.grid <- gensvm.load.small.grid()
} else if (param.grid == 'full') {
param.grid <- gensvm.load.full.grid()
}
}
# Validate the range of the values for the gridsearch
if (!gensvm.validate.param.grid(param.grid))
return(invisible(NULL))
# Sort the parameter grid for efficient warm starts
param.grid <- gensvm.sort.param.grid(param.grid)
# Expand and convert the parameter grid for use in the C function
C.param.grid <- gensvm.expand.param.grid(param.grid, n.features)
# Convert labels to integers
classes <- sort(unique(y))
y.clean <- match(y, classes)
if (is.vector(cv) && length(cv) == n.objects) {
folds <- sort(unique(cv))
cv.idx <- match(cv, folds) - 1
n.splits <- length(folds)
} else {
cv.idx <- gensvm.generate.cv.idx(n.objects, cv[1])
n.splits <- cv
}
results <- .Call("R_gensvm_grid",
data.matrix(x),
as.integer(y.clean),
as.matrix(C.param.grid),
as.integer(nrow(C.param.grid)),
as.integer(ncol(C.param.grid)),
as.integer(cv.idx),
as.integer(n.splits),
as.integer(verbose),
as.integer(n.objects),
as.integer(n.features),
as.integer(n.classes)
)
cv.results <- gensvm.cv.results(results, param.grid, cv.idx, y.clean,
scoring,
return.train.score=return.train.score)
best.index <- which.min(cv.results$rank.test.score)[1]
if (!is.na(best.index)) { # can occur when user interrupts
best.score <- cv.results$mean.test.score[best.index]
best.params <- param.grid[best.index, , drop=F]
# Remove duplicate attributes from best.params
attr(best.params, "out.attrs") <- NULL
} else {
best.score <- NA
best.params <- list()
}
if (refit && !is.na(best.index)) {
gensvm.args <- as.list(best.params)
# Stupid factors...
if ("weights" %in% names(gensvm.args))
gensvm.args$weights <- as.character(gensvm.args$weights)
if ("kernel" %in% names(gensvm.args))
gensvm.args$kernel <- as.character(gensvm.args$kernel)
gensvm.args$x <- x
gensvm.args$y <- y
gensvm.args$verbose <- if(verbose>1) 1 else 0
if (verbose > 1)
cat("Refitting with best parameters ...\n")
best.estimator <- do.call(gensvm, gensvm.args)
} else {
best.estimator <- NULL
}
object <- list(call = call, param.grid = param.grid,
cv.results = cv.results, best.estimator = best.estimator,
best.score = best.score, best.params = best.params,
best.index = best.index, n.splits = n.splits,
n.objects = n.objects, n.features = n.features,
n.classes = n.classes, classes = classes,
total.time = results$total.time, cv.idx = cv.idx)
class(object) <- "gensvm.grid"
return(object)
}
#' @title Load a tiny parameter grid for the GenSVM grid search
#'
#' @description This function returns a parameter grid to use in the GenSVM
#' grid search. This grid was obtained by analyzing the experiments done for
#' the GenSVM paper and selecting the configurations that achieve accuracy
#' within the 95th percentile on over 90% of the datasets. It is a good start
#' for a parameter search with a reasonably high chance of achieving good
#' performance on most datasets.
#'
#' Note that this grid is only tested to work well in combination with the
#' linear kernel.
#'
#' @author
#' Gerrit J.J. van den Burg, Patrick J.F. Groenen \cr
#' Maintainer: Gerrit J.J. van den Burg <gertjanvandenburg@gmail.com>
#'
#' @references
#' Van den Burg, G.J.J. and Groenen, P.J.F. (2016). \emph{GenSVM: A Generalized
#' Multiclass Support Vector Machine}, Journal of Machine Learning Research,
#' 17(225):1--42. URL \url{https://jmlr.org/papers/v17/14-526.html}.
#'
#' @export
#'
#' @seealso
#' \code{\link{gensvm.grid}}, \code{\link{gensvm.load.small.grid}},
#' \code{\link{gensvm.load.full.grid}}.
#'
gensvm.load.tiny.grid <- function()
{
df <- data.frame(
p=c(2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.5, 2.0, 2.0),
kappa=c(5.0, 5.0, 0.5, 5.0, -0.9, 5.0, 0.5, -0.9, 0.5, 0.5),
lambda=c(2^-16, 2^-18, 2^-18, 2^-18, 2^-18, 2^-14, 2^-18,
2^-18, 2^-16, 2^-16),
weights=c('unit', 'unit', 'unit', 'group', 'unit',
'unit', 'group', 'unit', 'unit', 'group')
)
return(df)
}
#' @title Load a large parameter grid for the GenSVM grid search
#'
#' @description This loads the parameter grid from the GenSVM paper. It
#' consists of 342 configurations and is constructed from all possible
#' combinations of the following parameter sets:
#'
#' \code{p = c(1.0, 1.5, 2.0)}
#' \code{lambda = 2^seq(-18, 18, 2)}
#' \code{kappa = c(-0.9, 0.5, 5.0)}
#' \code{weights = c('unit', 'group')}
#'
#' @author
#' Gerrit J.J. van den Burg, Patrick J.F. Groenen \cr
#' Maintainer: Gerrit J.J. van den Burg <gertjanvandenburg@gmail.com>
#'
#' @references
#' Van den Burg, G.J.J. and Groenen, P.J.F. (2016). \emph{GenSVM: A Generalized
#' Multiclass Support Vector Machine}, Journal of Machine Learning Research,
#' 17(225):1--42. URL \url{https://jmlr.org/papers/v17/14-526.html}.
#'
#' @export
#'
#' @seealso
#' \code{\link{gensvm.grid}}, \code{\link{gensvm.load.tiny.grid}},
#' \code{\link{gensvm.load.full.grid}}.
#'
gensvm.load.full.grid <- function()
{
df <- expand.grid(p=c(1.0, 1.5, 2.0), lambda=2^seq(-18, 18, 2),
kappa=c(-0.9, 0.5, 5.0), weights=c('unit', 'group'),
epsilon=c(1e-6))
return(df)
}
#' @title Load the small parameter grid for the GenSVM grid search
#'
#' @description This function loads a small parameter grid to use for the
#' GenSVM gridsearch. It contains all possible combinations of the following
#' parameter sets:
#'
#' \code{p = c(1.0, 1.5, 2.0)}
#' \code{lambda = c(1e-8, 1e-6, 1e-4, 1e-2, 1)}
#' \code{kappa = c(-0.9, 0.5, 5.0)}
#' \code{weights= c('unit', 'group')}
#'
#' @author
#' Gerrit J.J. van den Burg, Patrick J.F. Groenen \cr
#' Maintainer: Gerrit J.J. van den Burg <gertjanvandenburg@gmail.com>
#'
#' @references
#' Van den Burg, G.J.J. and Groenen, P.J.F. (2016). \emph{GenSVM: A Generalized
#' Multiclass Support Vector Machine}, Journal of Machine Learning Research,
#' 17(225):1--42. URL \url{https://jmlr.org/papers/v17/14-526.html}.
#'
#' @export
#'
#' @seealso
#' \code{\link{gensvm.grid}}, \code{\link{gensvm.load.tiny.grid}},
#' \code{\link{gensvm.load.small.grid}}.
#'
gensvm.load.small.grid <- function()
{
df <- expand.grid(p=c(1.0, 1.5, 2.0), lambda=c(1e-8, 1e-6, 1e-4, 1e-2, 1),
kappa=c(-0.9, 0.5, 5.0), weights=c('unit', 'group'))
return(df)
}
#' @title Generate a vector of cross-validation indices
#'
#' @description
#' This function generates a vector of length \code{n} with values from 0 to
#' \code{folds-1} to mark train and test splits.
#'
#' @param n the number of instances
#' @param folds the number of cross validation folds
#'
#' @return an array of length \code{n} with values in the range [0, folds-1]
#' indicating the test fold of each instance.
#'
#' @author
#' Gerrit J.J. van den Burg, Patrick J.F. Groenen \cr
#' Maintainer: Gerrit J.J. van den Burg <gertjanvandenburg@gmail.com>
#'
#' @references
#' Van den Burg, G.J.J. and Groenen, P.J.F. (2016). \emph{GenSVM: A Generalized
#' Multiclass Support Vector Machine}, Journal of Machine Learning Research,
#' 17(225):1--42. URL \url{https://jmlr.org/papers/v17/14-526.html}.
#'
#' @seealso
#' \code{\link{gensvm.grid}}
gensvm.generate.cv.idx <- function(n, folds)
{
cv.idx <- matrix(0, n, 1)
big.folds <- n %% folds
small.fold.size <- n %/% folds
j <- 0
for (i in 0:(small.fold.size * folds)) {
while (TRUE) {
idx <- round(runif(1, 1, n))
if (cv.idx[idx] == 0) {
cv.idx[idx] <- j
j <- j + 1
j <- (j %% folds)
break
}
}
}
j <- 1
i <- 0
while (i < big.folds) {
if (cv.idx[j] == 0) {
cv.idx[j] <- i
i <- i + 1
}
j <- j + 1
}
return(cv.idx)
}
gensvm.cv.results <- function(results, param.grid, cv.idx, y.true,
scoring, return.train.score=TRUE)
{
n.candidates <- nrow(param.grid)
n.splits <- length(unique(cv.idx))
score <- if(is.function(scoring)) scoring else gensvm.accuracy
# Build names and initialize the data.frame
names <- c("mean.fit.time", "mean.score.time", "mean.test.score")
if (return.train.score)
names <- c(names, "mean.train.score")
for (param in names(param.grid)) {
names <- c(names, sprintf("param.%s", param))
}
names <- c(names, "rank.test.score")
for (idx in sort(unique(cv.idx))) {
names <- c(names, sprintf("split%i.test.score", idx))
if (return.train.score)
names <- c(names, sprintf("split%i.train.score", idx))
}
names <- c(names, "std.fit.time", "std.score.time", "std.test.score")
if (return.train.score)
names <- c(names, "std.train.score")
df <- data.frame(matrix(ncol=length(names), nrow=n.candidates))
colnames(df) <- names
for (pidx in 1:n.candidates) {
param <- param.grid[pidx, , drop=F]
durations <- results$durations[pidx, ]
predictions <- results$predictions[pidx, ]
fit.times <- durations
score.times <- c()
test.scores <- c()
train.scores <- c()
is.missing <- any(is.na(durations))
for (test.idx in sort(unique(cv.idx))) {
score.time <- 0
if (return.train.score) {
y.train.pred <- predictions[cv.idx != test.idx]
y.train.true <- y.true[cv.idx != test.idx]
start.time <- proc.time()
train.score <- score(y.train.true, y.train.pred)
stop.time <- proc.time()
score.time <- score.time + (stop.time - start.time)[3]
train.scores <- c(train.scores, train.score)
}
y.test.pred <- predictions[cv.idx == test.idx]
y.test.true <- y.true[cv.idx == test.idx]
start.time <- proc.time()
test.score <- score(y.test.true, y.test.pred)
stop.time <- proc.time()
score.time <- score.time + (stop.time - start.time)[3]
test.scores <- c(test.scores, test.score)
score.times <- c(score.times, score.time)
}
df$mean.fit.time[pidx] <- mean(fit.times)
df$mean.score.time[pidx] <- if(is.missing) NA else mean(score.times)
df$mean.test.score[pidx] <- mean(test.scores)
df$std.fit.time[pidx] <- sd(fit.times)
df$std.score.time[pidx] <- if(is.missing) NA else sd(score.times)
df$std.test.score[pidx] <- sd(test.scores)
if (return.train.score) {
df$mean.train.score[pidx] <- mean(train.scores)
df$std.train.score[pidx] <- sd(train.scores)
}
for (parname in names(param.grid)) {
header <- sprintf("param.%s", parname)
val <- param[[parname]]
if (is.factor(val))
val <- levels(val)[val]
df[[header]][pidx] <- val
}
j <- 1
for (test.idx in sort(unique(cv.idx))) {
lbl <- sprintf("split%i.test.score", test.idx)
df[[lbl]][pidx] <- test.scores[j]
if (return.train.score) {
lbl <- sprintf("split%i.train.score", test.idx)
df[[lbl]][pidx] <- train.scores[j]
}
j <- j + 1
}
}
df$rank.test.score <- gensvm.rank.score(df$mean.test.score)
return(df)
}
gensvm.sort.param.grid <- function(param.grid)
{
all.cols <- c("kernel", "coef", "degree", "gamma", "weights", "kappa",
"lambda", "p", "epsilon", "max.iter")
order.args <- NULL
for (name in all.cols) {
if (name %in% colnames(param.grid)) {
if (name == "epsilon")
order.args <- cbind(order.args, -param.grid[[name]])
else
order.args <- cbind(order.args, param.grid[[name]])
}
}
sorted.pg <- param.grid[do.call(order, as.list(as.data.frame(order.args))), ]
rownames(sorted.pg) <- NULL
return(sorted.pg)
}
gensvm.expand.param.grid <- function(pg, n.features)
{
if ("kernel" %in% colnames(pg)) {
all.kernels <- c("linear", "poly", "rbf", "sigmoid")
pg$kernel <- match(pg$kernel, all.kernels) - 1
} else {
pg$kernel <- 0
}
if ("weights" %in% colnames(pg)) {
all.weights <- c("unit", "group")
pg$weights <- match(pg$weights, all.weights)
} else {
pg$weights <- 1
}
if ("gamma" %in% colnames(pg)) {
pg$gamma[pg$gamma == "auto"] <- 1.0/n.features
} else {
pg$gamma <- 1.0/n.features
}
if (!("degree" %in% colnames(pg)))
pg$degree <- 2.0
if (!("coef" %in% colnames(pg)))
pg$coef <- 0.0
if (!("p" %in% colnames(pg)))
pg$p <- 1.0
if (!("lambda" %in% colnames(pg)))
pg$lambda <- 1e-8
if (!("kappa" %in% colnames(pg)))
pg$kappa <- 0.0
if (!("epsilon" %in% colnames(pg)))
pg$epsilon <- 1e-6
if (!("max.iter" %in% colnames(pg)))
pg$max.iter <- 1e8
C.param.grid <- data.frame(kernel=pg$kernel, coef=pg$coef,
degree=pg$degree, gamma=pg$gamma,
weights=pg$weights, kappa=pg$kappa,
lambda=pg$lambda, p=pg$p, epsilon=pg$epsilon,
max.iter=pg$max.iter)
return(C.param.grid)
}
#' @title Compute the ranks for the numbers in a given vector
#'
#' @description This function computes the ranks for the values in an array.
#' The highest value gets the smallest rank. Ties are broken by assigning the
#' smallest value. The smallest rank is 1.
#'
#' @param x array of numeric values
#'
#' @return array with the ranks of the values in the input array.
#'
gensvm.rank.score <- function(x)
{
x <- as.array(x)
l <- length(x)
r <- 1
ranks <- as.vector(matrix(0, l, 1))
ranks[which(is.na(x))] <- NA
while (!all(mapply(is.na, x))) {
m <- max(x, na.rm=T)
idx <- which(x == m)
ranks[idx] <- r
r <- r + length(idx)
x[idx] <- NA
}
return(ranks)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.