In BigelowLab/calanusthreshold: calanusthreshold

Learning to navigate tidymodels through a nice easy walk through using the example dataset.

We define a grazeable patch where abundance is >= 10000.

suppressPackageStartupMessages({
  library(dplyr)
  library(calanusthreshold)
  library(tidymodels)
})

THRESHOLD <- 10000

We load the example dataset, remove variables we don't need, drop incomplete rows, aggregate the species of interest as a new var called patch, and record patch into levels. Note that we drop sea ice (thickness and concentration) because they are missing in 93% of the original dataset.

x <- calanusthreshold::read_dataset() |>
  calanusthreshold::prep_dataset(drop_var = c("Calanus glacialis", 
                                       "Calanus hyperboreus",
                                       "geometry",
                                       "longitude",
                                       "latitude",
                                       "station",
                                       "year",
                                       "siconc",
                                       "sithick"),
                                 lump_var = "Calanus finmarchicus",
                                 log_var = c("bathymetry", "chlor_a"),
                                 newname = "patch",
                                 threshold = THRESHOLD) |>
  dplyr::glimpse()

Divide into test/train datasets (essential a list with two tibbles)

x_split <- rsample::initial_split(x, prop = 0.6)

Prep the data. Notes are from help for each function - there are quite a few different step_* functions.

step_corr creates a specification of a recipe step that will potentially remove variables that have large absolute correlations with other variables.
step_center creates a specification of a recipe step that will normalize numeric data to have a mean of zero.
step_scale creates a specification of a recipe step that will normalize numeric data to have a standard deviation of one.

x_recipe <- rsample::training(x_split) |>
  recipes::recipe(patch ~.) |>
  recipes::step_corr(recipes::all_predictors()) |>
  recipes::step_center(recipes::all_predictors(), -recipes::all_outcomes()) |>
  recipes::step_scale(recipes::all_predictors(), -recipes::all_outcomes()) |>
  recipes::prep()

Apply the recipe! See ?recipes::bake

For a recipe with at least one preprocessing operation that has been trained by prep.recipe(), apply the computations to new data.

The outputs are tibbles with data transformed by

x_testing <- x_recipe |>
  recipes::bake(rsample::testing(x_split)) 

x_training <- x_recipe |>
  recipes::bake(rsample::training(x_split))

Now we make a model - choosing the engine to create the random forest model. This is where tidymodels has put in a lot of work to provide one consistent interface to the myriad random forest implementations available. I am chosing ranger which is also the default.

x_model <- parsnip::rand_forest(trees = 100, mode = "classification") |>
  parsnip::set_engine("ranger") |>
  parsnip::fit(patch ~ ., data = x_training)
x_model

The model can be saved to disk and restored later.

save_model(x_model, filename = "my_model.Rdata", overwrite = TRUE)
y_model <- read_model(filename = "my_model.Rdata")

Now we can use the model to predict our testing data, and then comapre the predictions to "truth" which we store in variable "patch". Below we make the prediction, which is stored a .pred_class, bind with the test data table and move the patch column to the first position for clarity.

x_pred <-  predict(x_model, x_testing) |>
  dplyr::bind_cols(x_testing) |>
  dplyr::relocate(patch, .before = 1)
dplyr::glimpse(x_pred)

We can compute basic test metrics:

accuracy - the proportion of the data that are predicted correctly,
kappa statistic - similar measure to accuracy, but normalized by the accuracy that would be expected by chance alone.

x_pred |>
  yardstick::metrics(truth = patch, estimate = .pred_class)

Next we can recompute the prediction but include probabilities.

x_prob <-  predict(x_model, x_testing, type = "prob") |>
  dplyr::bind_cols(x_testing) |>
  dplyr::relocate(patch, .before = 1)
glimpse(x_prob)

Finally, we can compute the AUC from the receiver-operator curve. Note that our patch variable is out truth, and the probabilities to examine are .pred_1 (as opposed to .pred_0). In the absence of any other information, roc_auc has no way to determine if .pred_1 predicts the first of second factor of patch. (Recall that we have only two classes: no-patch (=0) and patch (=1). By default is assumes the first factor is the 'event' we are trying to process which is not correct. So, to correct this wew explicitly state the event of interest is the second.

x_auc <- yardstick::roc_auc(x_prob, patch, .pred_1, event_level = 'second') |>
  glimpse()

And now plot...

x_roc <- yardstick::roc_curve(x_prob, truth = patch, .pred_1, event_level = "second")
print(autoplot(x_roc) + 
  ggplot2::labs(title = "ROC-AUC") + 
  ggplot2::annotate("text", x = 0, y = 1, hjust = 0,
                    label = sprintf("AUC: %0.3f", x_auc$.estimate[1])))