Description Usage Arguments Details Value Examples
This function calculates resampling based performance measures over a grid of tuning parameters for one of the implemented classifiers (one-class SVM, biased SVM, maxent).
1 2 3 |
x |
a data frame with the training data. The samples are in the rows and the features in the columns. |
y |
a vector containing the labels encoding if a sample is positive or unlabeled. |
positive |
The positive class in |
method |
a one-class classification method. Implemented are |
metric |
A performance metric for positive/unlabeled data used for model selection.
Default for |
trControl |
see |
index |
a list of training indices for the resampling iterations. This will be passed
to the identically named argument of the |
summaryFunction |
a function to compute performance metrics across resamples. This will be passed
to the identically named argument of the |
allowParallel |
enable or disable parallel processing. Even if |
verboseIter |
Logical for printing progress, does only work if parallel processing is disabled (defaults to |
... |
other arguments that can be passed to train. Be careful with trainControl... ! |
trainOcc calls train and returns an object of class
trainOcc which is a child of train, i.e. methods defined in caret
for train can also be used.
Via the trControl argument you can customize the way how train acts
(see trainControl) but note the following (see also the example, where the (trainOcc-) defaults of
trControl are given):
make sure that you define a suitable summaryFunction functions which
defines returns metrics for positive/unlabeled data (default: puSummary).
classProbs has to be set to TRUE if the continuous outputs of
the one-class classifier are required to calculate all performance metric(s), i.e. the ones
which rely on the continuous predictions, such as the puAuc.
savePredictions and returnResamp should also be set to
TRUE in order to make all diagnostic methods available for later analaysis.
A trainOcc object with is a child of the object train.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | ## Not run:
## a synthetic data set
data(bananas)
## this is the default setting of trControl in trainOcc
cntrl <- trainControl(method = "cv",
number = 10,
summaryFunction = puSummary, #!
classProbs = TRUE, #!
savePredictions = TRUE, #!
returnResamp = "all", #!
allowParallel = TRUE)
## but lets use repeated k-fold cross-validation
set.seed(123)
rcv.idx <- createMultiFolds(puFactor(bananas$tr[,1]), k=5, times=5)
cntrl <- trainControl(index = rcv.idx,
summaryFunction = puSummary,
classProbs = TRUE,
savePredictions = TRUE,
returnResamp = "all",
allowParallel = TRUE)
tocc <- trainOcc(x=bananas$tr[, -1], y=bananas$tr[, 1], trControl=cntrl)
## be aware that the PU-performance metrics are not always choosing the
## optimal model
## you may want to investigate other performance metrics stored in the
## model selection table.
tocc
## neatly arranged by sorting
sort(tocc, by="puF")
## particularly the true positive rate (tpr) and the probability of
## positive prediction (ppp) are informative. you want to find a model
## with high tpr but low ppp.
plot_PPPvsTPR(tocc)
## based on this plot you may want to select candidate models for more
## thoroughly evaluation: use identifyPoints=TRUE
\dontrun{ candiModels <- plot_PPPvsTPR(tocc, identifyPoints=TRUE) }
## the former assignment returns a list like the one created here:
candiModels <- modelPosition(tocc, modRow=c(80, 86, 44))
## plot the resampling distributions
resamps <- resamples(tocc, modRow=candiModels$row)
bwplot(resamps, scales="free")
## also the diagnostic distributions plot can be help
## therefore (a large subset of ) the unlabeled data needs to be predicted
tocc.m80 <- update(tocc, modRow=candiModels$row[1]) # set the final model
pred.m80 <- predict(tocc, bananas$x) # predict
tocc.m86 <- update(tocc, modRow=candiModels$row[2])
pred.m86 <- predict(tocc, bananas$x)
tocc.m44 <- update(tocc, modRow=candiModels$row[3])
pred.m44 <- predict(tocc, bananas$x)
par(mfrow=c(1,3))
hist(tocc.m80, pred.m80, th=0)
hist(tocc.m86, pred.m86, th=0)
hist(tocc.m44, pred.m44, th=0)
## here we can also see the model in the 2D feature space. this is usually
## not possible because the feature space is high diminsional.
par(mfrow=c(1,1))
featurespace(tocc.m80, th=0)
featurespace(tocc.m86, th=0)
featurespace(tocc.m44, th=0)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.