Description Usage Arguments Details Value References Examples
Function that calculates the neighbourhood threshold distance associated to each training instance. The individual thresholds are computed from the average distances of the nearest neighbours within the overall average distance of all training instances to their respective kth nearest neighbour, where k is selected by the user.
1 | getThreshold(agreementInput, STDInput, trainingSet, k)
|
agreementInput |
A data frame (N,1) with the agreement measure from M ensemble models, for each instance in |
STDInput |
A data frame (N,1) with the ensemble standard deviation for each instance in |
trainingSet |
Data frame with the scaled descritptors of the training instances used to calculate |
k |
Number of nearest neighbours to account for when computing the neighbourhood distance. |
Agreement is calculated from the amount of matching observed and predicted responses in an ensemble of models, divided by the total number of models in the ensemble Agreement = \frac{|Obs\cap Pred|}{M}. The ensemble standard deviation is calculated according to Tetko et al [2]. Both agreementInput
and STDInput
data frames should be provided with dimensions (N,1).
This function implicitly uses the object Dij.sort
output by getEDmatrix
, which consists of a data frame with the distances between each trainingSet
instance and its training neighbours (sorted in ascending order of distance).
getThreshold
returns a data frame with dimensions (N,1) with the threshold neighbourhood distances for each input instance.
[2] IV Tetko, I Sushko, et al. Critical Assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. 2008. 48(9):1733-46. doi:10.1021/ci800151m
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | library(randomForest)
library(mlbench)
data("BreastCancer")
#remove ID col
bcdata <- BreastCancer[,c(2:ncol(BreastCancer))]
bcdata <- bcdata[complete.cases(bcdata),]
#sample 70% for training
trainID <- sample(1:nrow(bcdata),round(nrow(bcdata)*0.7))
#QSAR model; gather external predictions
# ntree was set to was purposefully to challenge the RDN calculation with more error.
m<-randomForest(Class~.,data=bcdata[c(trainID),], kepp.forest=TRUE, ntree=1, norm.votes=TRUE)
p<-predict(m, bcdata[-trainID,], predict.all=TRUE, type="class")
test_pred <- p$aggregate #class pred external
performance <- as.double(test_pred==bcdata[-trainID,"Class"])
performance <- data.frame(performance)
ensembleClass.t <- data.frame(bcdata$Class[trainID])
ensembleProb.t <- data.frame(bcdata$Class[trainID])
# train ensemble for AD calculation
for (i in 2:11){
#sampling
samp <- sample(trainID, round(length(trainID)*0.8))
# train ensemble
m<-randomForest(Class~.,data=bcdata[samp,], kepp.forest=TRUE, ntree=100, norm.votes=TRUE)
pred<-predict(m, bcdata[trainID,], predict.all=TRUE, type="Prob")
pred_class<-predict(m, bcdata[trainID,], predict.all=TRUE, type="class")
P_AD <- pred$aggregate[,1] #prob train
class_AD <-pred_class$aggregate #class train
ensembleClass.t[,i]=class_AD
ensembleProb.t[,i]=P_AD
}
# compute agreement for TRAIN
agree <- data.frame(trainID)
for (i in 1:length(trainID)){
agree[i,2] <- sum(ensembleClass.t[i,-1]==toString(ensembleClass.t[i,1]))/10
}
agree <- data.frame(agree[,-1])
#compute std for TRAIN
std <- apply(ensembleProb.t[,-1],1,sd)
std <- data.frame(std)
# Prepare descriptors to be used
train <- data.matrix(bcdata[trainID,1:ncol(bcdata)-1])
test <- data.matrix(bcdata[-trainID,1:ncol(bcdata)-1])
## compute distance to neighbours
Dij.sort <- getEDmatrix(train, train)
#data needs to be scaled as all functions rely on the Euclidean distance
train <- scale(train, center = mins, scale = maxs - mins)
## Compute the threshold corresponding to k=3 nearest neighbours, using scaled data
NNthreshold <- getThreshold(agree, std, train, k=3)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.