trainOcc: Fit a one-class classification model over Different Tuning...

Description Usage Arguments Details Value Examples

Description

This function calculates resampling based performance measures over a grid of tuning parameters for one of the implemented classifiers (one-class SVM, biased SVM, maxent).

Usage

1
2
3
trainOcc(x, y, positive = NULL, method = "biasedsvm", metric = NULL,
  trControl = NULL, index = NULL, summaryFunction = NULL,
  allowParallel = TRUE, verboseIter = TRUE, dirModelInfo = NULL, ...)

Arguments

x

a data frame with the training data. The samples are in the rows and the features in the columns.

y

a vector containing the labels encoding if a sample is positive or unlabeled.

positive

The positive class in y.

method

a one-class classification method. Implemented are ocsvm (one-class SVM, via ksvm), biasedsvm (biased SVM, via ksvm), maxent (via maxent), or a costum method that can be passed to train.

metric

A performance metric for positive/unlabeled data used for model selection. Default for ocsvm and biasedsvm is puF and for maxent puAuc.

trControl

see train and trainControl. If this argument is given, make sure that it makes sense (see details).

index

a list of training indices for the resampling iterations. This will be passed to the identically named argument of the trainControl function unless the argument trControl is not NULL.

summaryFunction

a function to compute performance metrics across resamples. This will be passed to the identically named argument of the trainControl function unless the argument trControl is not NULL.

allowParallel

enable or disable parallel processing. Even if TRUE, parallel processing is only possible if a parallel backend is loaded and available.

verboseIter

Logical for printing progress, does only work if parallel processing is disabled (defaults to TRUE).

...

other arguments that can be passed to train. Be careful with trainControl... !

Details

trainOcc calls train and returns an object of class trainOcc which is a child of train, i.e. methods defined in caret for train can also be used.
Via the trControl argument you can customize the way how train acts (see trainControl) but note the following (see also the example, where the (trainOcc-) defaults of trControl are given):

Value

A trainOcc object with is a child of the object train.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
## Not run: 
## a synthetic data set
data(bananas)

## this is the default setting of trControl in trainOcc
cntrl <- trainControl(method = "cv",
                      number = 10,
                      summaryFunction = puSummary, #!
                      classProbs = TRUE,           #!
                      savePredictions = TRUE,      #!
                      returnResamp = "all",        #!
                      allowParallel = TRUE)

## but lets use repeated k-fold cross-validation
set.seed(123)
rcv.idx <- createMultiFolds(puFactor(bananas$tr[,1]), k=5, times=5)
cntrl <- trainControl(index = rcv.idx,
                      summaryFunction = puSummary, 
                      classProbs = TRUE, 
                      savePredictions = TRUE,
                      returnResamp = "all",
                      allowParallel = TRUE)

tocc <- trainOcc(x=bananas$tr[, -1], y=bananas$tr[, 1], trControl=cntrl)                         

## be aware that the PU-performance metrics are not always choosing the 
## optimal model
## you may want to investigate other performance metrics stored in the 
## model selection table. 
tocc

## neatly arranged by sorting 
sort(tocc, by="puF") 

## particularly the true positive rate (tpr) and the probability of 
## positive prediction (ppp) are informative. you want to find a model
## with high tpr but low ppp.
plot_PPPvsTPR(tocc) 

## based on this plot you may want to select candidate models for more
## thoroughly evaluation: use identifyPoints=TRUE
\dontrun{ candiModels <- plot_PPPvsTPR(tocc, identifyPoints=TRUE) }

## the former assignment returns a list like the one created here: 
candiModels <- modelPosition(tocc, modRow=c(80, 86, 44))

## plot the resampling distributions 
resamps <- resamples(tocc, modRow=candiModels$row)
bwplot(resamps, scales="free")

## also the diagnostic distributions plot can be help
## therefore (a large subset of ) the unlabeled data needs to be predicted 
tocc.m80 <- update(tocc, modRow=candiModels$row[1]) # set the final model
pred.m80 <- predict(tocc, bananas$x)                # predict
tocc.m86 <- update(tocc, modRow=candiModels$row[2])
pred.m86 <- predict(tocc, bananas$x) 
tocc.m44 <- update(tocc, modRow=candiModels$row[3])
pred.m44 <- predict(tocc, bananas$x) 

par(mfrow=c(1,3))
hist(tocc.m80, pred.m80, th=0)
hist(tocc.m86, pred.m86, th=0)
hist(tocc.m44, pred.m44, th=0)

## here we can also see the model in the 2D feature space. this is usually
## not possible because the feature space is high diminsional.
par(mfrow=c(1,1))
featurespace(tocc.m80, th=0)
featurespace(tocc.m86, th=0)
featurespace(tocc.m44, th=0)

## End(Not run)

benmack/oneClass documentation built on Dec. 15, 2020, 7:38 p.m.