Description Usage Arguments Details Value Examples
Function that searches for training instances whose neighbourhood radius covers new (test) instances (i.e., is equal to, or larger than the distance to new instances). Test instances falling in at least a training neighbourhood are taken to calculate in-domain accuracy.
1 | TestInTrain(performance, testSet, trainingSet, NNthreshold)
|
performance |
Data frame of dimensions (N,1) with 1 for correctly predicted instances in testSet, and 0 otherwise. |
testSet |
Data frame with the scaled descritptors of the new instances to be tested against trainingSet. scaling should be done with scale(testSet, center = mins, scale = maxs - mins), where mins and maxs are inherited from previously calling |
trainingSet |
Data frame with the scaled descritptors of the training instances used to calculate NNthreshold. scaling should be done with scale(trainingSet, center = mins, scale = maxs - mins), where mins and maxs are inherited from previously calling |
NNthreshold |
Data frame with threshold distances output by getThreshold. |
This function computes the Euclidean distance between each test instance and each of all training instances, and determines how many training instances are within threshold distance of each test instance. Test instances that have 1 or more training neighbours are considered in-domain and only those will be used for computing in-applicability domain accuracy (Acc). This function calls the output of getThreshold.
TestInTrain
outputs a list with three elements:
TE.NNcount
A vector with the number of training nearest neighbours found within the threshold distance of each test instances
outAD
Number of test instances outside all threshold distances (i.e., outside of the applicability domain)
Acc
Within-applicability domain accuracy; sum of in-domain performances divided over count of in-domain test instances
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | library(randomForest)
library(mlbench)
data("BreastCancer")
#remove ID col
bcdata <- BreastCancer[,c(2:ncol(BreastCancer))]
bcdata <- bcdata[complete.cases(bcdata),]
#sample 70% for training
trainID <- sample(1:nrow(bcdata),round(nrow(bcdata)*0.7))
#QSAR model; gather external predictions
# ntree was set to 1 to purposefully to challenge the RDN calculation with more error.
m<-randomForest(Class~.,data=bcdata[c(trainID),], kepp.forest=TRUE, ntree=1, norm.votes=TRUE)
p<-predict(m, bcdata[-trainID,], predict.all=TRUE, type="class")
test_pred <- p$aggregate #class pred external
performance <- as.double(test_pred==bcdata[-trainID,"Class"])
performance <- data.frame(performance)
ensembleClass.t <- data.frame(bcdata$Class[trainID])
ensembleProb.t <- data.frame(bcdata$Class[trainID])
# train ensemble for AD calculation
for (i in 2:11){
#sampling
samp <- sample(trainID, round(length(trainID)*0.8))
# train ensemble
m<-randomForest(Class~.,data=bcdata[samp,], kepp.forest=TRUE, ntree=100, norm.votes=TRUE)
pred<-predict(m, bcdata[trainID,], predict.all=TRUE, type="Prob")
pred_class<-predict(m, bcdata[trainID,], predict.all=TRUE, type="class")
P_AD <- pred$aggregate[,1] #prob train
class_AD <-pred_class$aggregate #class train
ensembleClass.t[,i]=class_AD
ensembleProb.t[,i]=P_AD
}
# compute agreement for TRAIN
agree <- data.frame(trainID)
for (i in 1:length(trainID)){
agree[i,2] <- sum(ensembleClass.t[i,-1]==toString(ensembleClass.t[i,1]))/10
}
agree <- data.frame(agree[,-1])
#compute std for TRAIN
std <- apply(ensembleProb.t[,-1],1,sd)
std <- data.frame(std)
# Prepare descriptors to be passed into getRDN; dataframe should be numerical not in levels.
train <- data.matrix(bcdata[trainID,1:ncol(bcdata)-1])
test <- data.matrix(bcdata[-trainID,1:ncol(bcdata)-1])
## calculate sorted distance matrix to training neighbours
Dij.sort <- getEDmatrix(train, train)
# data needs to be scaled as all functions rely on the Euclidean distance;
# maxs and mins have been implicitly
# produced from the previous line.
train <- scale(train, center = mins, scale = maxs - mins)
test <- scale(test, center = mins, scale = maxs - mins)
## Compute the coverage threshold corresponding to k=3 nearest neighbours
NNthreshold <- getThreshold(agree, std, train, k=3)
# Place test into train neighbouhoods
resultOutput <- TestInTrain(performance=performance, testSet=test, trainingSet=train, NNthreshold)
# Colect results
test.NNcount <- resultOutput[[1]]
outADcount <- resultOutput[[2]]
Acc <- resultOutput[[3]]
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.