trainOcc: Fit a one-class classification model over Different Tuning...
In benmack/oneClass: One-class classification in the absence of test data

Description Usage Arguments Details Value Examples

This function calculates resampling based performance measures over a grid of tuning parameters for one of the implemented classifiers (one-class SVM, biased SVM, maxent).

1
2
3

trainOcc(x, y, positive = NULL, method = "biasedsvm", metric = NULL,
  trControl = NULL, index = NULL, summaryFunction = NULL,
  allowParallel = TRUE, verboseIter = TRUE, dirModelInfo = NULL, ...)

`x`	a data frame with the training data. The samples are in the rows and the features in the columns.
`y`	a vector containing the labels encoding if a sample is positive or unlabeled.
`positive`	The positive class in `y`.
`method`	a one-class classification method. Implemented are `ocsvm` (one-class SVM, via `ksvm`), `biasedsvm` (biased SVM, via `ksvm`), `maxent` (via `maxent`), or a costum method that can be passed to `train`.
`metric`	A performance metric for positive/unlabeled data used for model selection. Default for `ocsvm` and `biasedsvm` is `puF` and for `maxent` `puAuc`.
`trControl`	see `train` and `trainControl`. If this argument is given, make sure that it makes sense (see details).
`index`	a list of training indices for the resampling iterations. This will be passed to the identically named argument of the `trainControl` function unless the argument `trControl` is not `NULL`.
`summaryFunction`	a function to compute performance metrics across resamples. This will be passed to the identically named argument of the `trainControl` function unless the argument `trControl` is not `NULL`.
`allowParallel`	enable or disable parallel processing. Even if `TRUE`, parallel processing is only possible if a parallel backend is loaded and available.
`verboseIter`	Logical for printing progress, does only work if parallel processing is disabled (defaults to `TRUE`).
`...`	other arguments that can be passed to train. Be careful with trainControl... !

trainOcc calls train and returns an object of class trainOcc which is a child of train, i.e. methods defined in caret for train can also be used.
Via the trControl argument you can customize the way how train acts (see trainControl) but note the following (see also the example, where the (trainOcc-) defaults of trControl are given):

make sure that you define a suitable summaryFunction functions which defines returns metrics for positive/unlabeled data (default: puSummary).
classProbs has to be set to TRUE if the continuous outputs of the one-class classifier are required to calculate all performance metric(s), i.e. the ones which rely on the continuous predictions, such as the puAuc.
savePredictions and returnResamp should also be set to TRUE in order to make all diagnostic methods available for later analaysis.

A trainOcc object with is a child of the object train.

## Not run: 
## a synthetic data set
data(bananas)

## this is the default setting of trControl in trainOcc
cntrl <- trainControl(method = "cv",
                      number = 10,
                      summaryFunction = puSummary, #!
                      classProbs = TRUE,           #!
                      savePredictions = TRUE,      #!
                      returnResamp = "all",        #!
                      allowParallel = TRUE)

## but lets use repeated k-fold cross-validation
set.seed(123)
rcv.idx <- createMultiFolds(puFactor(bananas$tr[,1]), k=5, times=5)
cntrl <- trainControl(index = rcv.idx,
                      summaryFunction = puSummary, 
                      classProbs = TRUE, 
                      savePredictions = TRUE,
                      returnResamp = "all",
                      allowParallel = TRUE)

tocc <- trainOcc(x=bananas$tr[, -1], y=bananas$tr[, 1], trControl=cntrl)                         

## be aware that the PU-performance metrics are not always choosing the 
## optimal model
## you may want to investigate other performance metrics stored in the 
## model selection table. 
tocc

## neatly arranged by sorting 
sort(tocc, by="puF") 

## particularly the true positive rate (tpr) and the probability of 
## positive prediction (ppp) are informative. you want to find a model
## with high tpr but low ppp.
plot_PPPvsTPR(tocc) 

## based on this plot you may want to select candidate models for more
## thoroughly evaluation: use identifyPoints=TRUE
\dontrun{ candiModels <- plot_PPPvsTPR(tocc, identifyPoints=TRUE) }

## the former assignment returns a list like the one created here: 
candiModels <- modelPosition(tocc, modRow=c(80, 86, 44))

## plot the resampling distributions 
resamps <- resamples(tocc, modRow=candiModels$row)
bwplot(resamps, scales="free")

## also the diagnostic distributions plot can be help
## therefore (a large subset of ) the unlabeled data needs to be predicted 
tocc.m80 <- update(tocc, modRow=candiModels$row[1]) # set the final model
pred.m80 <- predict(tocc, bananas$x)                # predict
tocc.m86 <- update(tocc, modRow=candiModels$row[2])
pred.m86 <- predict(tocc, bananas$x) 
tocc.m44 <- update(tocc, modRow=candiModels$row[3])
pred.m44 <- predict(tocc, bananas$x) 

par(mfrow=c(1,3))
hist(tocc.m80, pred.m80, th=0)
hist(tocc.m86, pred.m86, th=0)
hist(tocc.m44, pred.m44, th=0)

## here we can also see the model in the 2D feature space. this is usually
## not possible because the feature space is high diminsional.
par(mfrow=c(1,1))
featurespace(tocc.m80, th=0)
featurespace(tocc.m86, th=0)
featurespace(tocc.m44, th=0)

## End(Not run)