getRDN: Compute the RDN applicability domain for increasing...

Description Usage Arguments Details Value Examples

Description

Function that automates the characterization of the RDN applicability domain, by computing the neighbourhood thresholds for an increasing number of nearest neighbours, and for each set of thresholds collecting the respective in-domain accuracy and outlying test instances.

Usage

1
2
getRDN(performance, trainingSet, testSet, agreementInput, STDInput,
  stepLimit = 65, initialcompression = 31, decompression = 41)

Arguments

performance

Data frame of dimensions (N,1) with 1 for correctly predicted instances in testSet, and 0 otherwise.

trainingSet

Data frame with the training descriptors (raw) used to calculate NNthreshold.

testSet

Data frame with descriptors (raw) of the new instances to be tested against trainingSet.

agreementInput

A data frame (N,1) with the agreement measure from M ensemble models, for each instance in trainingSet.

STDInput

A data frame (N,1) with the ensemble standard deviation for each instance in trainingSet

stepLimit

Number of domain expansion iterations to be computed (optional)

initialcompression

Integer setting the iteration limit up to (but not including) which the threshold distances are compressed to a third of their original values; from initialcompression + 1, threshold values get decompressed to half of their original distance values.

decompression

Integer setting the starting iteration at which threshold distances get fully decompressed (Euclidean distances used as is)

Details

This function calls getEDmatrix, getThreshold and TestInTrain to compute the applicability domain accuracy and in-domain test instances at iteratively larger threshold distances. It will do so by calling getEDmatrix once and storing it under Dij.sort (implicit variable), and then call getThreshold and TestInTrain for an iteratively larger number of k nearest neighbours; at each such step the in-domain accuracy and the outlying instance count will be stored in resultSummary.

Value

a matrix called resultSummary which stores a column of outlying instance count and another with the in-domain accuracy, for each calculation step.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
library(randomForest)
library(mlbench)
data("BreastCancer")
# remove first Col (IDs)
bcdata <- BreastCancer[,c(2:ncol(BreastCancer))]
bcdata <- bcdata[complete.cases(bcdata),]

#sample 70% for training
trainID <- sample(1:nrow(bcdata),round(nrow(bcdata)*0.7))

#QSAR model; gather external predictions
# ntree was set to was purposefully to challenge the RDN calculation with more error.
m<-randomForest(Class~.,data=bcdata[c(trainID),], kepp.forest=TRUE, ntree=1, norm.votes=TRUE)
p<-predict(m, bcdata[-trainID,], predict.all=TRUE, type="class")
test_pred <- p$aggregate #class pred external
performance <- as.double(test_pred==bcdata[-trainID,"Class"])
performance <- data.frame(performance)

ensembleClass.t <- data.frame(bcdata$Class[trainID])
ensembleProb.t <- data.frame(bcdata$Class[trainID])

# train ensemble for AD calculation
for (i in 2:11){
  #sampling
  samp <- sample(trainID, round(length(trainID)*0.8))
  # train ensemble
  m<-randomForest(Class~.,data=bcdata[samp,], kepp.forest=TRUE, ntree=100, norm.votes=TRUE)
  pred<-predict(m, bcdata[trainID,], predict.all=TRUE, type="Prob")
  pred_class<-predict(m, bcdata[trainID,], predict.all=TRUE, type="class")
  P_AD <- pred$aggregate[,1] #prob train
 class_AD <-pred_class$aggregate #class train
 ensembleClass.t[,i]=class_AD
 ensembleProb.t[,i]=P_AD

}

# compute agreement for TRAIN
agree <- data.frame(trainID)
for (i in 1:length(trainID)){
  agree[i,2] <- sum(ensembleClass.t[i,-1]==toString(ensembleClass.t[i,1]))/10
}
agree <- data.frame(agree[,-1])

#compute std for TRAIN
std <- apply(ensembleProb.t[,-1],1,sd)
std <- data.frame(std)

# Prepare descriptors to be passed into getRDN
train <- data.matrix(bcdata[trainID,1:ncol(bcdata)-1])
test <- data.matrix(bcdata[-trainID,1:ncol(bcdata)-1])

# Compute RDN; this will take care of scaling trainingSet and testSet internally before any use.
resultSummary <- getRDN(performance=performance, trainingSet=train,
                        testSet=test, agreementInput=agree, STDInput=std)
# The results saved in resultSummary show a decreasing overall quality of predictions as
# AD gets expanded (i.e. instances out of AD decrease).
#> resultSummary
#NNout ACC in AD
#[1,]    108 0.9793814
#[2,]     97 0.9722222
#[3,]     92 0.9734513
#[4,]     89 0.9741379
#[5,]     88 0.9743590
#[6,]     88 0.9743590
#[7,]     88 0.9743590
#[8,]     87 0.9745763
#[9,]     84 0.9752066
#[10,]     83 0.9754098
#[11,]     80 0.9760000
#[12,]     78 0.9763780
#[13,]     77 0.9765625
#[14,]     75 0.9769231
#[15,]     73 0.9772727
#[16,]     71 0.9776119
#[17,]     67 0.9710145
#[18,]     67 0.9710145
#[19,]     67 0.9710145
#[20,]     64 0.9645390
#[21,]     64 0.9645390
#[22,]     64 0.9645390
#[23,]     64 0.9645390
#[24,]     61 0.9652778
#[25,]     60 0.9655172
#[26,]     59 0.9657534
#[27,]     58 0.9659864
#[28,]     58 0.9659864
#[29,]     58 0.9659864
#[30,]     58 0.9659864
#[31,]     14 0.9476440
#[32,]     14 0.9476440
#[33,]     14 0.9476440
#[34,]     14 0.9476440
#[35,]     12 0.9430052
#[36,]     12 0.9430052
#[37,]     12 0.9430052
#[38,]     12 0.9430052
#[39,]     12 0.9430052
#[40,]     12 0.9430052
#[41,]      0 0.9365854
#[42,]      0 0.9365854
#...          ...
#[65,]      0 0.9365854

machLearnNA/RDN documentation built on May 21, 2019, 10:51 a.m.