getThreshold: Calculate the training neighbourhood threshold distances

Description Usage Arguments Details Value References Examples

Description

Function that calculates the neighbourhood threshold distance associated to each training instance. The individual thresholds are computed from the average distances of the nearest neighbours within the overall average distance of all training instances to their respective kth nearest neighbour, where k is selected by the user.

Usage

1
getThreshold(agreementInput, STDInput, trainingSet, k)

Arguments

agreementInput

A data frame (N,1) with the agreement measure from M ensemble models, for each instance in trainingSet.

STDInput

A data frame (N,1) with the ensemble standard deviation for each instance in trainingSet.

trainingSet

Data frame with the scaled descritptors of the training instances used to calculate NNthreshold. scaling should be done with scale(trainingSet, center = mins, scale = maxs - mins), where mins and maxs are inherited from previously calling getEDmatrix

k

Number of nearest neighbours to account for when computing the neighbourhood distance.

Details

Agreement is calculated from the amount of matching observed and predicted responses in an ensemble of models, divided by the total number of models in the ensemble Agreement = \frac{|Obs\cap Pred|}{M}. The ensemble standard deviation is calculated according to Tetko et al [2]. Both agreementInput and STDInput data frames should be provided with dimensions (N,1). This function implicitly uses the object Dij.sort output by getEDmatrix, which consists of a data frame with the distances between each trainingSet instance and its training neighbours (sorted in ascending order of distance).

Value

getThreshold returns a data frame with dimensions (N,1) with the threshold neighbourhood distances for each input instance.

References

[2] IV Tetko, I Sushko, et al. Critical Assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. 2008. 48(9):1733-46. doi:10.1021/ci800151m

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
library(randomForest)
library(mlbench)
data("BreastCancer")
#remove ID col
bcdata <- BreastCancer[,c(2:ncol(BreastCancer))]
bcdata <- bcdata[complete.cases(bcdata),]

#sample 70% for training
trainID <- sample(1:nrow(bcdata),round(nrow(bcdata)*0.7))

#QSAR model; gather external predictions
# ntree was set to was purposefully to challenge the RDN calculation with more error.
m<-randomForest(Class~.,data=bcdata[c(trainID),], kepp.forest=TRUE, ntree=1, norm.votes=TRUE)
p<-predict(m, bcdata[-trainID,], predict.all=TRUE, type="class")
test_pred <- p$aggregate #class pred external
performance <- as.double(test_pred==bcdata[-trainID,"Class"])
performance <- data.frame(performance)

ensembleClass.t <- data.frame(bcdata$Class[trainID])
ensembleProb.t <- data.frame(bcdata$Class[trainID])

# train ensemble for AD calculation
for (i in 2:11){
  #sampling
  samp <- sample(trainID, round(length(trainID)*0.8))
  # train ensemble
  m<-randomForest(Class~.,data=bcdata[samp,], kepp.forest=TRUE, ntree=100, norm.votes=TRUE)
  pred<-predict(m, bcdata[trainID,], predict.all=TRUE, type="Prob")
  pred_class<-predict(m, bcdata[trainID,], predict.all=TRUE, type="class")
  P_AD <- pred$aggregate[,1] #prob train
 class_AD <-pred_class$aggregate #class train
 ensembleClass.t[,i]=class_AD
 ensembleProb.t[,i]=P_AD

}

# compute agreement for TRAIN
agree <- data.frame(trainID)
for (i in 1:length(trainID)){
  agree[i,2] <- sum(ensembleClass.t[i,-1]==toString(ensembleClass.t[i,1]))/10
}
agree <- data.frame(agree[,-1])

#compute std for TRAIN
std <- apply(ensembleProb.t[,-1],1,sd)
std <- data.frame(std)

# Prepare descriptors to be used
train <- data.matrix(bcdata[trainID,1:ncol(bcdata)-1])
test <- data.matrix(bcdata[-trainID,1:ncol(bcdata)-1])


## compute distance to neighbours
Dij.sort <- getEDmatrix(train, train)

#data needs to be scaled as all functions rely on the Euclidean distance
train <- scale(train, center = mins, scale = maxs - mins)

## Compute the threshold corresponding to k=3 nearest neighbours, using scaled data
NNthreshold <- getThreshold(agree, std, train, k=3)

machLearnNA/RDN documentation built on May 21, 2019, 10:51 a.m.