MiBiClassGBODT: Binary classification using gradient boosting over desicion...
In MiDA: Microarray Data Analysis

Description Usage Arguments Details Value Author(s) See Also Examples

This function conducts a binary classification of specimens based on microarray gene (transcript) expression data. Gradient boosting over desicion trees algorithm is used. Several generalized boosted regression models are fitted during cross-validation, for each model measurements of classification quality and feature importance are returned.

1 2	MiBiClassGBODT(Matrix, specimens, n.crossval = 5, ntrees = 10000, shrinkage = 0.1, intdepth = 2, n.terminal = 10, bag.frac = 0.5)

`Matrix`	numeric matrix of expression data where each row corresponds to a probe (gene, transcript), and each column correspondes to a specimen (patient).
`specimens`	factor vector with two levels specifying specimens in the columns of the `Matrix`.
`n.crossval`	integer specifying number of cross-validation folds.
`ntrees`	integer specifying the total number of decision trees (boosting iterations).
`shrinkage`	numeric specifying the learning rate. Scales the step size in the gradient descent procedure.
`intdepth`	integer specifying the maximum depth of each tree.
`n.terminal`	integer specifying the actual minimum number of observations in the terminal nodes of the trees.
`bag.frac`	the fraction of the training set observations randomly selected to propose the next tree in the expansion.

Matrix must contain specimens from two classification groups only. To sample expression matrix use MiDataSample.
The order of the variables in specimens and the columns of Matrix must be the same. Levels of specimens are two classification groups. To sample specimens use MiSpecimenSample.
Number of cross-validation folders defines number of models to be fitted. For example, if n.crossval=5 then all specimens are divided into 5 folders, each of them is later used for model testing, so 5 models are fitted. See createFolds for details.
While boosting, basis functions are iteratively adding in a greedy fashion so that each additional basis function further reduces the selected loss function. Gaussian distribution (squared error) is used. ntrees, shrinkage, intdepth are parameters for model tuning. bag.frac introduces randomnesses into the model fit. If bag.frac < 1 then running the same model twice will result in similar but different fits. Number of specimens in train sample must be enough to provide the minimum number of observations in terminal nodes.I.e.
(1-1/n.crossval)*bag.frac > n.terminal.
See gbm for details.

list of 2:
QC - matrix containing quality measures for each fitted model and their summary. Accur - accuracy (percentage of correct predictions), AUC - area under ROC curve (see roc), MCC - Mattew's correlation coefficient
formula ((TP*TN)-(FP*FN))/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)),
F1sc - F1 score
formula 2xPresxRec/(Pres+Rec).
If all the data points from one class are misclassified into other, MCC and F1 score may get NaN values.

Importance - list of data frames containing for each fitted model: var - probe ID and rel.inf - its feature importance for classification (relative influence).
Feature importance (relative influence) graphs are also produced.

Elena N. Filatova

createFolds, gbm, MiSpecimenSample, MiDataSample, roc

#get gene expression and specimen data
data("IMexpression");data("IMspecimen")
#sample expression matrix and specimen data for binary classification,
#only "NORM" and "EBV" specimens are left
SampleMatrix<-MiDataSample(IMexpression, IMspecimen$diagnosis,"norm", "ebv")
SampleSpecimen<-MiSpecimenSample(IMspecimen$diagnosis, "norm", "ebv")
#Fitting, low tuning for faster running
BoostRes<-MiBiClassGBODT(SampleMatrix, SampleSpecimen, n.crossval = 3,
                       ntrees = 10, shrinkage = 1, intdepth = 2)
BoostRes[[1]] # QC values for n.crossval = 3 models and its summary
length(BoostRes[[2]]) # n.crossval = 3 data frames of probes feature importance for classification
head(BoostRes[[2]][[1]])