classify: SVM classifier
In Yuliaxis/kernInt: Kernel Integration of Microbiome Analysis Methods & Data

Description Usage Arguments Details Value Examples

classify() automatically trains a Support Vector Classification model, tests it and returns the confusion matrix.

classify(
  data,
  y,
  coeff = "mean",
  kernel,
  prob = FALSE,
  classimb,
  p = 0.2,
  k,
  C = 1,
  H = NULL,
  domain = NULL
)

`data`	Input data: a matrix or data.frame with predictor variables/features as columns. To perform MKL: a list of m datasets. All datasets should have the same number of rows.
`y`	Reponse variable (factor)
`coeff`	ONLY IN MKL CASE: A t·m matrix of the coefficients, where m are the number of different data types and t the number of different coefficient combinations to evaluate via k-CV. If absent, the same weight is given to all data sources.
`kernel`	"lin" or rbf" to standard Linear and RBF kernels. "clin" for compositional linear and "crbf" for Aitchison-RBF kernels. "jac" for quantitative Jaccard / Ruzicka kernel. "jsk" for Jensen-Shannon Kernel. "flin" and "frbf" for functional linear and functional RBF kernels. "matrix" if a pre-computed kernel matrix is given as input. To perform MKL: Vector of m kernels to apply to each dataset.
`prob`	if TRUE class probabilities (soft-classifier) are computed instead of a True-or-false assignation (hard-classifier)
`classimb`	"weights" to introduce class weights in the SVM algorithm, "ubOver" to oversampling and "ubUnder" to undersample.
`p`	The proportion of data reserved for the test set. Otherwise, a vector containing the indexes or the names of the rows for testing.
`k`	The k for the k-Cross Validation. Minimum k = 2. If no argument is provided cross-validation is not performed.
`C`	The cost. A vector with the possible costs (SVM hyperparameter) to evaluate via k-Cross-Val can be entered too.
`H`	Gamma hyperparameter (only in RBF-like functions). A vector with the possible values to chose the best one via k-Cross-Val can be entered. For the MKL, a list with m entries can be entered, being' m is the number of different data types. Each element on the list must be a number or, if k-Cross-Validation is needed, a vector with the hyperparameters to evaluate for each data type.
`domain`	Only used in "frbf" or "flin".

Cross-validation is available to choose the best hyperparameters (e.g. Cost) during the training step.

The classification can be hard (predicting the class) or soft (predicting the probability of belonging to a given class)

Another feature is the possibility to deal with imbalanced data in the target variable with several techniques:

Data Resampling: Oversampling techniques (oversample the minority class or undersample the majority class.
Class weighting: Changing the class weighting accordint to their frequence in the training set

To use one-class SVM to deal with imbalanced data, see: outliers()

If the input data has repeated rownames, classify() will consider that the row names that share id are repeated measures from the same individual. The function will ensure that all repeated measures are used either to train or to test the model, but not for both, thus preserving the independance between the training and tets sets.

Confusion matrix, chosen hyperparameters, test set predicted and observed values, and variable importances (only with linear-like kernels)

# Simple classification
classify(data=soil$abund ,y=soil$metadata[ ,"env_feature"],kernel="clin")
# Cassification with MKL:
Nose <- list()
Nose$left <- CSSnorm(smoker$abund$nasL)
Nose$right <- CSSnorm(smoker$abund$nasR)
smoking <- smoker$metadata$smoker[seq(from=1,to=62*4,by=4)]
w <- matrix(c(0.5,0.1,0.9,0.5,0.9,0.1),nrow=3,ncol=2)
classify(data=Nose,kernel="jac",y=smoking,C=c(1,10,100), coeff = w, k=10)
# Classification with longitudinal data:
growth2 <- growth
colnames(growth2) <-  c( "time", "height")
growth_coeff <- lsq(data=growth2,degree=2)
target <- rep("Girl",93)
target[ grep("boy",rownames(growth_coeff$coef))] <- "Boy"
target <- as.factor(target)
classify(data=growth_coeff,kernel="frbf",H=0.0001, y=target, domain=c(11,18))