Description Usage Arguments Details Value Examples
Function that automates the characterization of the RDN applicability domain, by computing the neighbourhood thresholds for an increasing number of nearest neighbours, and for each set of thresholds collecting the respective in-domain accuracy and outlying test instances.
1 2 | getRDN(performance, trainingSet, testSet, agreementInput, STDInput,
stepLimit = 65, initialcompression = 31, decompression = 41)
|
performance |
Data frame of dimensions (N,1) with 1 for correctly predicted instances in testSet, and 0 otherwise. |
trainingSet |
Data frame with the training descriptors (raw) used to calculate |
testSet |
Data frame with descriptors (raw) of the new instances to be tested against |
agreementInput |
A data frame (N,1) with the agreement measure from M ensemble models, for each instance in |
STDInput |
A data frame (N,1) with the ensemble standard deviation for each instance in |
stepLimit |
Number of domain expansion iterations to be computed (optional) |
initialcompression |
Integer setting the iteration limit up to (but not including) which the threshold distances are compressed to a third of their original values; from initialcompression + 1, threshold values get decompressed to half of their original distance values. |
decompression |
Integer setting the starting iteration at which threshold distances get fully decompressed (Euclidean distances used as is) |
This function calls getEDmatrix
, getThreshold
and TestInTrain
to compute the applicability domain accuracy and in-domain test instances at iteratively larger threshold distances. It will do so by calling getEDmatrix
once and storing it under Dij.sort
(implicit variable),
and then call getThreshold
and TestInTrain
for an iteratively larger number of k
nearest neighbours; at each such step the in-domain accuracy and the outlying instance count will be stored in resultSummary
.
a matrix called resultSummary
which stores a column of outlying instance count and another with the in-domain accuracy, for each calculation step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | library(randomForest)
library(mlbench)
data("BreastCancer")
# remove first Col (IDs)
bcdata <- BreastCancer[,c(2:ncol(BreastCancer))]
bcdata <- bcdata[complete.cases(bcdata),]
#sample 70% for training
trainID <- sample(1:nrow(bcdata),round(nrow(bcdata)*0.7))
#QSAR model; gather external predictions
# ntree was set to was purposefully to challenge the RDN calculation with more error.
m<-randomForest(Class~.,data=bcdata[c(trainID),], kepp.forest=TRUE, ntree=1, norm.votes=TRUE)
p<-predict(m, bcdata[-trainID,], predict.all=TRUE, type="class")
test_pred <- p$aggregate #class pred external
performance <- as.double(test_pred==bcdata[-trainID,"Class"])
performance <- data.frame(performance)
ensembleClass.t <- data.frame(bcdata$Class[trainID])
ensembleProb.t <- data.frame(bcdata$Class[trainID])
# train ensemble for AD calculation
for (i in 2:11){
#sampling
samp <- sample(trainID, round(length(trainID)*0.8))
# train ensemble
m<-randomForest(Class~.,data=bcdata[samp,], kepp.forest=TRUE, ntree=100, norm.votes=TRUE)
pred<-predict(m, bcdata[trainID,], predict.all=TRUE, type="Prob")
pred_class<-predict(m, bcdata[trainID,], predict.all=TRUE, type="class")
P_AD <- pred$aggregate[,1] #prob train
class_AD <-pred_class$aggregate #class train
ensembleClass.t[,i]=class_AD
ensembleProb.t[,i]=P_AD
}
# compute agreement for TRAIN
agree <- data.frame(trainID)
for (i in 1:length(trainID)){
agree[i,2] <- sum(ensembleClass.t[i,-1]==toString(ensembleClass.t[i,1]))/10
}
agree <- data.frame(agree[,-1])
#compute std for TRAIN
std <- apply(ensembleProb.t[,-1],1,sd)
std <- data.frame(std)
# Prepare descriptors to be passed into getRDN
train <- data.matrix(bcdata[trainID,1:ncol(bcdata)-1])
test <- data.matrix(bcdata[-trainID,1:ncol(bcdata)-1])
# Compute RDN; this will take care of scaling trainingSet and testSet internally before any use.
resultSummary <- getRDN(performance=performance, trainingSet=train,
testSet=test, agreementInput=agree, STDInput=std)
# The results saved in resultSummary show a decreasing overall quality of predictions as
# AD gets expanded (i.e. instances out of AD decrease).
#> resultSummary
#NNout ACC in AD
#[1,] 108 0.9793814
#[2,] 97 0.9722222
#[3,] 92 0.9734513
#[4,] 89 0.9741379
#[5,] 88 0.9743590
#[6,] 88 0.9743590
#[7,] 88 0.9743590
#[8,] 87 0.9745763
#[9,] 84 0.9752066
#[10,] 83 0.9754098
#[11,] 80 0.9760000
#[12,] 78 0.9763780
#[13,] 77 0.9765625
#[14,] 75 0.9769231
#[15,] 73 0.9772727
#[16,] 71 0.9776119
#[17,] 67 0.9710145
#[18,] 67 0.9710145
#[19,] 67 0.9710145
#[20,] 64 0.9645390
#[21,] 64 0.9645390
#[22,] 64 0.9645390
#[23,] 64 0.9645390
#[24,] 61 0.9652778
#[25,] 60 0.9655172
#[26,] 59 0.9657534
#[27,] 58 0.9659864
#[28,] 58 0.9659864
#[29,] 58 0.9659864
#[30,] 58 0.9659864
#[31,] 14 0.9476440
#[32,] 14 0.9476440
#[33,] 14 0.9476440
#[34,] 14 0.9476440
#[35,] 12 0.9430052
#[36,] 12 0.9430052
#[37,] 12 0.9430052
#[38,] 12 0.9430052
#[39,] 12 0.9430052
#[40,] 12 0.9430052
#[41,] 0 0.9365854
#[42,] 0 0.9365854
#... ...
#[65,] 0 0.9365854
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.