knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(raster)
library(viridisLite)

Overview

Running ERFs on a given dataset is easy. The function ens_random_forests() will take a given dataset in R data.frame format, amend it for modeling using erf_data_prep() and erf_formula_prep(), run each RF in the ensemble using rf_ens_fn(), and return a fitted ERF object. This object can then be passed to various output functions: erf_plotter() and ... to visualize and summarize.

First, we must load the R library.

library(EnsembleRandomForests)

Datasets

Using the provided simulated dataset

The provided dataset is a list object that contains a data.frame of the sampled locations, the beta coefficients of the logistic model used to predict the probability of occurrence, and a raster brick object containing the gridded covariates, log-odds of occurrence, and probabilities of occurrence.

# We can also visualize the covariates
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1))
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0),2,6,byrow=TRUE))
r <- range(cellStats(simData$grid[[1:5]],'range'))
for(i in 1:5){
  image(simData$grid[[i]], col=inferno(100), zlim = r, 
        xaxt='n', yaxt='n', xlab="", ylab="")
  title(paste0('Covariate ', i))
}

We can also see the beta coefficients that produced the probability of presence using the model below: $$\begin{equation} log\left[\frac{\hat{P}{obs=1}}{1-\hat{P}{obs=1}}\right] = \alpha + \beta_1X_1 + ... +\beta_nX_n \end{equation}$$

print(round(simData$betas,3))

# We can visualize the log-odds and the probability of presence
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1), mfrow=c(1,2))
image(simData$grid[[6]], col=inferno(100), xaxt='n', yaxt='n', xlab="", ylab="")
title("Log-odds")
image(simData$grid[[7]], col=viridis(100), xaxt='n', yaxt='n', xlab="", ylab="")
with(simData$samples[simData$samples$obs==1,],
     points(x,y,pch=16,col='white'))
title("Probability of Presence")

Running an Ensemble Random Forests model

Now that we have covered the datasets, let's run an ERF. This is simple using ens_random_forests.

ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",
                                covariates=grep("cov",colnames(simData$samples),value=T),
                                header = NULL,
                                save=FALSE,
                                out.folder=NULL,
                                duplicate = TRUE,
                                n.forests = 10L,
                                importance = TRUE,
                                ntree = 1000,
                                mtry = 5,
                                var.q = c(0.1,0.5,0.9),
                                cores = parallel::detectCores()-2)

The arguments to ens_random_forests are:

We can look at some of the output produced by the random forests (see help(ens_random_forests) for a full list):

#view the dataset used in the model
head(ens_rf_ex$data) 

#view the ensemble model predictions
head(ens_rf_ex$ens.pred)

#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')])

#view the mean test threshold-free performance metrics for each RF
ens_rf_ex$mu.te.perf

#structure of the individual model predictions
str(ens_rf_ex$pred)

As we can see, the ensemble performs better than the mean test predictions. This is advantage of ERF over other RF modifications for extreme class imbalance. Siders et al. 2020 discusses the various performance of these other modifications if you are curious.



zsiders/EnsembleRandomForests documentation built on Oct. 8, 2024, 11:41 p.m.