trainDI: Calculate Dissimilarity Index of training data

View source: R/trainDI.R

trainDIR Documentation

Calculate Dissimilarity Index of training data

Description

This function estimates the Dissimilarity Index (DI) of within the training data set used for a prediction model. Predictors can be weighted based on the internal variable importance of the machine learning algorithm used for model training.

Usage

trainDI(
  model = NA,
  train = NULL,
  variables = "all",
  weight = NA,
  CVtest = NULL,
  CVtrain = NULL,
  method = "L2",
  useWeight = TRUE
)

Arguments

model

A train object created with caret used to extract weights from (based on variable importance) as well as cross-validation folds

train

A data.frame containing the data used for model training. Only required when no model is given

variables

character vector of predictor variables. if "all" then all variables of the model are used or if no model is given then of the train dataset.

weight

A data.frame containing weights for each variable. Only required if no model is given.

CVtest

list or vector. Either a list where each element contains the data points used for testing during the cross validation iteration (i.e. held back data). Or a vector that contains the ID of the fold for each training point. Only required if no model is given.

CVtrain

list. Each element contains the data points used for training during the cross validation iteration (i.e. held back data). Only required if no model is given and only required if CVtrain is not the opposite of CVtest (i.e. if a data point is not used for testing, it is used for training). Relevant if some data points are excluded, e.g. when using nndm.

method

Character. Method used for distance calculation. Currently euclidean distance (L2) and Mahalanobis distance (MD) are implemented but only L2 is tested. Note that MD takes considerably longer.

useWeight

Logical. Only if a model is given. Weight variables according to importance in the model?

Value

A list of class trainDI containing:

train

A data frame containing the training data

weight

A data frame with weights based on the variable importance.

variables

Names of the used variables

catvars

Which variables are categorial

scaleparam

Scaling parameters. Output from scale

trainDist_avrg

A data frame with the average distance of each training point to every other point

trainDist_avrgmean

The mean of trainDist_avrg. Used for normalizing the DI

trainDI

Dissimilarity Index of the training data

threshold

The DI threshold used for inside/outside AOA

Note

This function is called within aoa to estimate the DI and AOA of new data. However, it may also be used on its own if only the DI of training data is of interest, or to facilitate a parallelization of aoa by avoiding a repeated calculation of the DI within the training data.

Author(s)

Hanna Meyer, Marvin Ludwig

References

Meyer, H., Pebesma, E. (2021): Predicting into unknown space? Estimating the area of applicability of spatial prediction models. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/2041-210X.13650")}

See Also

aoa

Examples

## Not run: 
library(sf)
library(terra)
library(caret)
library(viridis)
library(latticeExtra)
library(ggplot2)

# prepare sample data:
dat <- get(load(system.file("extdata","Cookfarm.RData",package="CAST")))
dat <- aggregate(dat[,c("VW","Easting","Northing")],by=list(as.character(dat$SOURCEID)),mean)
pts <- st_as_sf(dat,coords=c("Easting","Northing"))
pts$ID <- 1:nrow(pts)
set.seed(100)
pts <- pts[1:30,]
studyArea <- rast(system.file("extdata","predictors_2012-03-25.grd",package="CAST"))[[1:8]]
trainDat <- extract(studyArea,pts,na.rm=FALSE)
trainDat <- merge(trainDat,pts,by.x="ID",by.y="ID")

# visualize data spatially:
plot(studyArea)
plot(studyArea$DEM)
plot(pts[,1],add=TRUE,col="black")

# train a model:
set.seed(100)
variables <- c("DEM","NDRE.Sd","TWI")
model <- train(trainDat[,which(names(trainDat)%in%variables)],
trainDat$VW, method="rf", importance=TRUE, tuneLength=1,
trControl=trainControl(method="cv",number=5,savePredictions=T))
print(model) #note that this is a quite poor prediction model
prediction <- predict(studyArea,model,na.rm=TRUE)
plot(varImp(model,scale=FALSE))

#...then calculate the DI of the trained model:
DI = trainDI(model=model)
plot(DI)

# the DI can now be used to compute the AOA:
AOA = aoa(studyArea, model = model, trainDI = DI)
print(AOA)
plot(AOA)

## End(Not run)


CAST documentation built on May 31, 2023, 7:07 p.m.