TestInTrain: Determine if new queries fall inside training neighbourhood

Description Usage Arguments Details Value Examples

Description

Function that searches for training instances whose neighbourhood radius covers new (test) instances (i.e., is equal to, or larger than the distance to new instances). Test instances falling in at least a training neighbourhood are taken to calculate in-domain accuracy.

Usage

1
TestInTrain(performance, testSet, trainingSet, NNthreshold)

Arguments

performance

Data frame of dimensions (N,1) with 1 for correctly predicted instances in testSet, and 0 otherwise.

testSet

Data frame with the scaled descritptors of the new instances to be tested against trainingSet. scaling should be done with scale(testSet, center = mins, scale = maxs - mins), where mins and maxs are inherited from previously calling getEDmatrix

trainingSet

Data frame with the scaled descritptors of the training instances used to calculate NNthreshold. scaling should be done with scale(trainingSet, center = mins, scale = maxs - mins), where mins and maxs are inherited from previously calling getEDmatrix

NNthreshold

Data frame with threshold distances output by getThreshold.

Details

This function computes the Euclidean distance between each test instance and each of all training instances, and determines how many training instances are within threshold distance of each test instance. Test instances that have 1 or more training neighbours are considered in-domain and only those will be used for computing in-applicability domain accuracy (Acc). This function calls the output of getThreshold.

Value

TestInTrain outputs a list with three elements:

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
library(randomForest)
library(mlbench)
data("BreastCancer")
#remove ID col
bcdata <- BreastCancer[,c(2:ncol(BreastCancer))]
bcdata <- bcdata[complete.cases(bcdata),]

#sample 70% for training
trainID <- sample(1:nrow(bcdata),round(nrow(bcdata)*0.7))

#QSAR model; gather external predictions
# ntree was set to 1 to purposefully to challenge the RDN calculation with more error.
m<-randomForest(Class~.,data=bcdata[c(trainID),], kepp.forest=TRUE, ntree=1, norm.votes=TRUE)
p<-predict(m, bcdata[-trainID,], predict.all=TRUE, type="class")
test_pred <- p$aggregate #class pred external
performance <- as.double(test_pred==bcdata[-trainID,"Class"])
performance <- data.frame(performance)

ensembleClass.t <- data.frame(bcdata$Class[trainID])
ensembleProb.t <- data.frame(bcdata$Class[trainID])

# train ensemble for AD calculation
for (i in 2:11){
  #sampling
  samp <- sample(trainID, round(length(trainID)*0.8))
  # train ensemble
  m<-randomForest(Class~.,data=bcdata[samp,], kepp.forest=TRUE, ntree=100, norm.votes=TRUE)
  pred<-predict(m, bcdata[trainID,], predict.all=TRUE, type="Prob")
  pred_class<-predict(m, bcdata[trainID,], predict.all=TRUE, type="class")
  P_AD <- pred$aggregate[,1] #prob train
 class_AD <-pred_class$aggregate #class train
 ensembleClass.t[,i]=class_AD
 ensembleProb.t[,i]=P_AD

}

# compute agreement for TRAIN
agree <- data.frame(trainID)
for (i in 1:length(trainID)){
  agree[i,2] <- sum(ensembleClass.t[i,-1]==toString(ensembleClass.t[i,1]))/10
}
agree <- data.frame(agree[,-1])

#compute std for TRAIN
std <- apply(ensembleProb.t[,-1],1,sd)
std <- data.frame(std)

# Prepare descriptors to be passed into getRDN; dataframe should be numerical not in levels.
train <- data.matrix(bcdata[trainID,1:ncol(bcdata)-1])
test <- data.matrix(bcdata[-trainID,1:ncol(bcdata)-1])

## calculate sorted distance matrix to training neighbours
Dij.sort <- getEDmatrix(train, train)

# data needs to be scaled as all functions rely on the Euclidean distance;
# maxs and mins have been implicitly
# produced from the previous line.
train <- scale(train, center = mins, scale = maxs - mins)
test <- scale(test, center = mins, scale = maxs - mins)

## Compute the coverage threshold corresponding to k=3 nearest neighbours
NNthreshold <- getThreshold(agree, std, train, k=3)

# Place test into train neighbouhoods
resultOutput <- TestInTrain(performance=performance, testSet=test, trainingSet=train, NNthreshold)

# Colect results
test.NNcount <- resultOutput[[1]]
outADcount <- resultOutput[[2]]
Acc <- resultOutput[[3]]

machLearnNA/RDN documentation built on May 21, 2019, 10:51 a.m.