mismm: Fit MILD-SVM model to the data

View source: R/mismm.R

mismmR Documentation

Fit MILD-SVM model to the data

Description

This function fits the MILD-SVM model, which takes a multiple-instance learning with distributions (MILD) data set and fits a modified SVM to it. The MILD-SVM methodology is based on research in progress.

Usage

## Default S3 method:
mismm(
  x,
  y,
  bags,
  instances,
  cost = 1,
  method = c("heuristic", "mip", "qp-heuristic"),
  weights = TRUE,
  control = list(kernel = "radial", sigma = if (is.vector(x)) 1 else 1/ncol(x),
    nystrom_args = list(m = nrow(x), r = nrow(x), sampling = "random"), max_step = 500,
    scale = TRUE, verbose = FALSE, time_limit = 60, start = FALSE),
  ...
)

## S3 method for class 'formula'
mismm(formula, data, ...)

## S3 method for class 'mild_df'
mismm(x, ...)

Arguments

x

A data.frame, matrix, or similar object of covariates, where each row represents a sample. If a mild_df object is passed, y, bags, instances are automatically extracted, and all other columns will be used as predictors.

y

A numeric, character, or factor vector of bag labels for each instance. Must satisfy length(y) == nrow(x). Suggest that one of the levels is 1, '1', or TRUE, which becomes the positive class; otherwise, a positive class is chosen and a message will be supplied.

bags

A vector specifying which instance belongs to each bag. Can be a string, numeric, of factor.

instances

A vector specifying which samples belong to each instance. Can be a string, numeric, of factor.

cost

The cost parameter in SVM. If method = 'heuristic', this will be fed to kernlab::ksvm(), otherwise it is similarly in internal functions.

method

The algorithm to use in fitting (default 'heuristic'). When method = 'heuristic', the algorithm iterates between selecting positive witnesses and solving an underlying smm() problem. When method = 'mip', the novel MIP method will be used. When method = 'qp-heuristic', the heuristic algorithm is computed using a slightly modified dual SMM. See details

weights

named vector, or TRUE, to control the weight of the cost parameter for each possible y value. Weights multiply against the cost vector. If TRUE, weights are calculated based on inverse counts of instances with given label, where we only count one positive instance per bag. Otherwise, names must match the levels of y.

control

list of additional parameters passed to the method that control computation with the following components:

  • kernel either a character the describes the kernel ('linear' or 'radial') or a kernel matrix at the instance level.

  • sigma argument needed for radial basis kernel.

  • nystrom_args a list of parameters to pass to kfm_nystrom(). This is used when method = 'mip' and kernel = 'radial' to generate a Nystrom approximation of the kernel features.

  • max_step argument used when method = 'heuristic'. Maximum steps of iteration for the heuristic algorithm.

  • scale argument used for all methods. A logical for whether to rescale the input before fitting.

  • verbose argument used when method = 'mip'. Whether to message output to the console.

  • time_limit argument used when method = 'mip'. FALSE, or a time limit (in seconds) passed to gurobi() parameters. If FALSE, no time limit is given.

  • start argument used when method = 'mip'. If TRUE, the mip program will be warm_started with the solution from method = 'qp-heuristic' to potentially improve speed.

...

Arguments passed to or from other methods.

formula

A formula with specification mild(y, bags, instances) ~ x which uses the mild function to create the bag-instance structure. This argument is an alternative to the x, y, bags, instances arguments, but requires the data argument. See examples.

data

If formula is provided, a data.frame or similar from which formula elements will be extracted.

Details

Several choices of fitting algorithm are available, including a version of the heuristic algorithm proposed by Andrews et al. (2003) and a novel algorithm that explicitly solves the mixed-integer programming (MIP) problem using the gurobi package optimization back-end.

Value

An object of class mismm The object contains at least the following components:

  • *_fit: A fit object depending on the method parameter. If method = 'heuristic', this will be a ksvm fit from the kernlab package. If method = 'mip' this will be gurobi_fit from a model optimization.

  • call_type: A character indicating which method misvm() was called with.

  • x: The training data needed for computing the kernel matrix in prediction.

  • features: The names of features used in training.

  • levels: The levels of y that are recorded for future prediction.

  • cost: The cost parameter from function inputs.

  • weights: The calculated weights on the cost parameter.

  • sigma: The radial basis function kernel parameter.

  • repr_inst: The instances from positive bags that are selected to be most representative of the positive instances.

  • n_step: If method %in% c('heuristic', 'qp-heuristic'), the total steps used in the heuristic algorithm.

  • useful_inst_idx: The instances that were selected to represent the bags in the heuristic fitting.

  • inst_order: A character vector that is used to modify the ordering of input data.

  • x_scale: If scale = TRUE, the scaling parameters for new predictions.

Methods (by class)

  • default: Method for data.frame-like objects

  • formula: Method for passing formula

  • mild_df: Method for mild_df objects

Author(s)

Sean Kent, Yifei Liu

References

Kent, S., & Yu, M. (2022). Non-convex SVM for cancer diagnosis based on morphologic features of tumor microenvironment arXiv preprint arXiv:2206.14704

See Also

predict.mismm() for prediction on new data.

Examples

set.seed(8)
mil_data <- generate_mild_df(nbag = 15, nsample = 20, positive_prob = 0.15,
                             sd_of_mean = rep(0.1, 3))

# Heuristic method
mdl1 <- mismm(mil_data)
mdl2 <- mismm(mild(bag_label, bag_name, instance_name) ~ X1 + X2 + X3, data = mil_data)

# MIP method
if (require(gurobi)) {
  mdl3 <- mismm(mil_data, method = "mip", control = list(nystrom_args = list(m = 10, r = 10)))
  predict(mdl3, mil_data)
}

predict(mdl1, new_data = mil_data, type = "raw", layer = "bag")

# summarize predictions at the bag layer
library(dplyr)
mil_data %>%
  bind_cols(predict(mdl2, mil_data, type = "class")) %>%
  bind_cols(predict(mdl2, mil_data, type = "raw")) %>%
  distinct(bag_name, bag_label, .pred_class, .pred)



mildsvm documentation built on July 14, 2022, 9:08 a.m.