knitr::opts_chunk$set(echo = TRUE)
Provides R tools for producing threshold prey models for Calanus.
remotes::install_github("BigelowLab/calanusthreshold")
Convention
We prepend the package namespace to function not exposed by the
calanusthreshold
package. Bare function names (no package prepended) are used for base R functions and forcalanusthreshold
functions.
We provide an anonymized example dataset.
suppressPackageStartupMessages({ library(dplyr) library(ggplot2) library(calanusthreshold) library(tidymodels) }) x <- read_dataset() dplyr::glimpse(x)
If you have your own dataset just provide the filename and path, but be sure to see ?read_dataset
dcoumentation. Note that the variables with species names are abundance measurements by life stage.
x <- read_dataset("/path/to/input/data.csv")
We are modeling patches of the species which we define by excess abundance over user specified thresholds. For this example we define just one threshold...
THRESHOLD <- 10000
For the purpose of illustration, we lump all stages of a given species together.
Species <- calanusthreshold::known_species() abund <- lapply(Species, function(s) { dplyr::tibble(species = s, abund = calanusthreshold::lump_vars(x, vars = s, selector = dplyr::starts_with) |> dplyr::pull("lumped")) })|> dplyr::bind_rows() |> dplyr::mutate(patch = factor(findInterval(abund, THRESHOLD), levels = seq(from = 0, to = length(THRESHOLD)), labels = c("sparse", THRESHOLD))) |> dplyr::group_by(species) |> dplyr::glimpse()
We can plot the raw abundance by species group, note that we drop records where abundance is less than 0 for the sake of clarity. Note that C glacials generally has low patchiness when the threshold is held high.
ggplot2::ggplot(abund %>% dplyr::filter(abund > 0), ggplot2::aes(species, abund)) + ggplot2::geom_violin() + ggplot2::scale_y_log10() + ggplot2::labs(title = "Filtered input raw abundances (>0) by species", x = "Species", y = "Abundance (log10)") + ggplot2::geom_hline(yintercept = THRESHOLD, linetype = 'dashed', color = 'grey')
The tidymodels attempts to provide a uniform programmatic interface for myriad model functions/packages in R, including random forest models. The idea is to provide consistent argument interfaces, "engine" selection, and uniform model assessment metrics.
We are going to model just Calanus Finmarchicus
. First we prep the dataset - dropping non-essential variables (columns), log=scaling some variables and by aggregating (lumping) the various life-stages. The prep_dataset()
function includes a 'patch' label indicating that the abundance for that observation exceeded the threshold. We then split the data into a user specified proportion of training and testing groups.
x_split <- prep_dataset(x, lump_var = "Calanus finmarchicus", threshold = THRESHOLD) |> dplyr::relocate(patch, .before = 1) |> rsample::initial_split(prop = 0.6) x_split
Next we build up a series of 'recipe' steps that prepare the data. The prep
steps implements all of the other steps, including removal highly correlated variable (like roughness).
x_recipe <- rsample::training(x_split) |> recipes::recipe(patch ~.) |> recipes::step_corr(recipes::all_predictors()) |> recipes::step_center(recipes::all_predictors(), -recipes::all_outcomes()) |> recipes::step_scale(recipes::all_predictors(), -recipes::all_outcomes()) |> recipes::prep() x_recipe
Now we can 'bake' the training and testing datasets.
x_testing <- x_recipe |> recipes::bake(rsample::testing(x_split)) x_training <- x_recipe |> recipes::bake(rsample::training(x_split))
Now we build and fit a model. Note the we are accepting the defaults arguments for the ranger::ranger() function.
x_model <- parsnip::rand_forest(mode = "classification") |> parsnip::set_engine("ranger") |> parsnip::fit(patch ~ ., data = x_training) x_model
We can now predict the outcome if we pass in another dataset, in the case the testing data we reserved as 40% of the original. We'll bind the prediction to the testing data and reorganize for clarity. In the output patch
is out original (aka 'Truth') while .pred_class
is the predicted class (aka 'Predicted')
x_pred <- predict(x_model, x_testing) |> dplyr::bind_cols(x_testing) |> dplyr::relocate(patch, .before = 1) dplyr::glimpse(x_pred)
We can compute and examine a confusion matrix.
x_cm <- yardstick::conf_mat(x_pred, patch, .pred_class) x_cm
Or present graphically...
heat_map(x_cm, title = "Calanus finmarchicus")
Similar the the ease of computing a predicted class value, we can also produce predicted class probabilities. Here we
x_prob <- predict(x_model, x_testing, type = "prob") |> dplyr::bind_cols(x_testing) |> dplyr::relocate(patch, .before = 1) |> dplyr::glimpse()
From the probabilities for each class we can compute ACU on the ROC.
x_auc <- yardstick::roc_auc(x_prob, patch, .pred_1, event_level = 'second') |> dplyr::glimpse()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.