In zsiders/EnsembleRandomForests: Ensemble Random Forests

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(raster)
library(viridisLite)

Overview

Running ERFs on a given dataset is easy. The function ens_random_forests() will take a given dataset in R data.frame format, amend it for modeling using erf_data_prep() and erf_formula_prep(), run each RF in the ensemble using rf_ens_fn(), and return a fitted ERF object. This object can then be passed to various output functions: erf_plotter() and ... to visualize and summarize.

First, we must load the R library.

library(EnsembleRandomForests)

Datasets

Using the provided simulated dataset

The provided dataset is a list object that contains a data.frame of the sampled locations, the beta coefficients of the logistic model used to predict the probability of occurrence, and a raster brick object containing the gridded covariates, log-odds of occurrence, and probabilities of occurrence.

# We can also visualize the covariates
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1))
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0),2,6,byrow=TRUE))
r <- range(cellStats(simData$grid[[1:5]],'range'))
for(i in 1:5){
  image(simData$grid[[i]], col=inferno(100), zlim = r, 
        xaxt='n', yaxt='n', xlab="", ylab="")
  title(paste0('Covariate ', i))
}

We can also see the beta coefficients that produced the probability of presence using the model below: $$\begin{equation} log\left[\frac{\hat{P}{obs=1}}{1-\hat{P}{obs=1}}\right] = \alpha + \beta_1X_1 + ... +\beta_nX_n \end{equation}$$

print(round(simData$betas,3))

# We can visualize the log-odds and the probability of presence
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1), mfrow=c(1,2))
image(simData$grid[[6]], col=inferno(100), xaxt='n', yaxt='n', xlab="", ylab="")
title("Log-odds")
image(simData$grid[[7]], col=viridis(100), xaxt='n', yaxt='n', xlab="", ylab="")
with(simData$samples[simData$samples$obs==1,],
     points(x,y,pch=16,col='white'))
title("Probability of Presence")

Running an Ensemble Random Forests model

Now that we have covered the datasets, let's run an ERF. This is simple using ens_random_forests.

ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",
                                covariates=grep("cov",colnames(simData$samples),value=T),
                                header = NULL,
                                save=FALSE,
                                out.folder=NULL,
                                duplicate = TRUE,
                                n.forests = 10L,
                                importance = TRUE,
                                ntree = 1000,
                                mtry = 5,
                                var.q = c(0.1,0.5,0.9),
                                cores = parallel::detectCores()-2)

The arguments to ens_random_forests are:

Data arguments:
df: this is the data.frame containing the presences/absences and the covariates
var: this is the column name of the presence/absence
covariates: these are the column names of the covariates to use. Here, we grabbed anything with "cov" in the column name
header: these are additional column names you may wish to append to data.frame produced internally
Output arguments:
save: this is a logical whether to save the model to the working directory or an optional out.folder directory
Control arguments:
duplicate: a logical flag to control whether to duplicate observations with more than one presence.
n.forests: this controls the number of forests to generate in the ensemble. See the optimization vignette for more information on tuning this parameter.
importance: a logical flag to calculate variable importance or not
ntree: number of trees in each Random Forests in the ensemble
mtry: number of covariates to try at each node in each tree in each Random Forests in the ensemble
var.q: quantiles for the distribution of the variable importance; only exectuted if importance=TRUE
cores: how many cores to run the model on.

We can look at some of the output produced by the random forests (see help(ens_random_forests) for a full list):

#view the dataset used in the model
head(ens_rf_ex$data) 

#view the ensemble model predictions
head(ens_rf_ex$ens.pred)

#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')])

#view the mean test threshold-free performance metrics for each RF
ens_rf_ex$mu.te.perf

#structure of the individual model predictions
str(ens_rf_ex$pred)

As we can see, the ensemble performs better than the mean test predictions. This is advantage of ERF over other RF modifications for extreme class imbalance. Siders et al. 2020 discusses the various performance of these other modifications if you are curious.

zsiders/EnsembleRandomForests documentation built on Oct. 8, 2024, 11:41 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com