knitr::opts_chunk$set( warning = FALSE, collapse = TRUE, comment = "#>", fig.width = 7.5, fig.height = 5 )
This page was entirely written by our AMR for R Assistant, a ChatGPT manually-trained model able to answer any question about the
AMR
package.
Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The AMR
R package provides robust tools for analysing AMR data, including convenient antimicrobial selector functions like aminoglycosides()
and betalactams()
.
In this post, we will explore how to use the tidymodels
framework to predict resistance patterns in the example_isolates
dataset in two examples.
This post contains the following examples:
By leveraging the power of tidymodels
and the AMR
package, we’ll build a reproducible machine learning workflow to predict the Gramstain of the microorganism to two important antibiotic classes: aminoglycosides and beta-lactams.
Our goal is to build a predictive model using the tidymodels
framework to determine the Gramstain of the microorganism based on microbial data. We will:
aminoglycosides()
and betalactams()
.tidymodels
workflow to preprocess, train, and evaluate the model.We begin by loading the required libraries and preparing the example_isolates
dataset from the AMR
package.
# Load required libraries library(AMR) # For AMR data analysis library(tidymodels) # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
Prepare the data:
# Your data could look like this: example_isolates # Select relevant columns for prediction data <- example_isolates %>% # select AB results dynamically select(mo, aminoglycosides(), betalactams()) %>% # replace NAs with NI (not-interpretable) mutate(across(where(is.sir), ~replace_na(.x, "NI")), # make factors of SIR columns across(where(is.sir), as.integer), # get Gramstain of microorganisms mo = as.factor(mo_gramstain(mo))) %>% # drop NAs - the ones without a Gramstain (fungi, etc.) drop_na()
Explanation:
aminoglycosides()
and betalactams()
dynamically select columns for antimicrobials in these classes.drop_na()
ensures the model receives complete cases for training.We now define the tidymodels
workflow, which consists of three steps: preprocessing, model specification, and fitting.
We create a recipe to preprocess the data for modelling.
# Define the recipe for data preprocessing resistance_recipe <- recipe(mo ~ ., data = data) %>% step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9) resistance_recipe
For a recipe that includes at least one preprocessing operation, like we have with step_corr()
, the necessary parameters can be estimated from a training set using prep()
:
prep(resistance_recipe)
Explanation:
recipe(mo ~ ., data = data)
will take the mo
column as outcome and all other columns as predictors.step_corr()
removes predictors (i.e., antibiotic columns) that have a higher correlation than 90%.Notice how the recipe contains just the antimicrobial selector functions - no need to define the columns specifically. In the preparation (retrieved with prep()
) we can see that the columns or variables r paste0("'", suppressMessages(prep(resistance_recipe))$steps[[1]]$removals, "'", collapse = " and ")
were removed as they correlate too much with existing, other variables.
We define a logistic regression model since resistance prediction is a binary classification task.
# Specify a logistic regression model logistic_model <- logistic_reg() %>% set_engine("glm") # Use the Generalised Linear Model engine logistic_model
Explanation:
logistic_reg()
sets up a logistic regression model.set_engine("glm")
specifies the use of R's built-in GLM engine.We bundle the recipe and model together into a workflow
, which organises the entire modelling process.
# Combine the recipe and model into a workflow resistance_workflow <- workflow() %>% add_recipe(resistance_recipe) %>% # Add the preprocessing recipe add_model(logistic_model) # Add the logistic regression model resistance_workflow
To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance.
# Split data into training and testing sets set.seed(123) # For reproducibility data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing training_data <- training(data_split) # Training set testing_data <- testing(data_split) # Testing set # Fit the workflow to the training data fitted_workflow <- resistance_workflow %>% fit(training_data) # Train the model
Explanation:
initial_split()
splits the data into training and testing sets.fit()
trains the workflow on the training set.Notice how in fit()
, the antimicrobial selector functions are internally called again. For training, these functions are called since they are stored in the recipe.
Next, we evaluate the model on the testing data.
# Make predictions on the testing set predictions <- fitted_workflow %>% predict(testing_data) # Generate predictions probabilities <- fitted_workflow %>% predict(testing_data, type = "prob") # Generate probabilities predictions <- predictions %>% bind_cols(probabilities) %>% bind_cols(testing_data) # Combine with true labels predictions # Evaluate model performance metrics <- predictions %>% metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics metrics # To assess some other model properties, you can make our own `metrics()` function our_metrics <- metric_set(accuracy, kap, ppv, npv) # add Positive Predictive Value and Negative Predictive Value metrics2 <- predictions %>% our_metrics(truth = mo, estimate = .pred_class) # run again on our `our_metrics()` function metrics2
Explanation:
predict()
generates predictions on the testing set.metrics()
computes evaluation metrics like accuracy and kappa.It appears we can predict the Gram stain with a r round(metrics$.estimate[1], 3) * 100
% accuracy based on AMR results of only aminoglycosides and beta-lactam antibiotics. The ROC curve looks like this:
predictions %>% roc_curve(mo, `.pred_Gram-negative`) %>% autoplot()
In this post, we demonstrated how to build a machine learning pipeline with the tidymodels
framework and the AMR
package. By combining selector functions like aminoglycosides()
and betalactams()
with tidymodels
, we efficiently prepared data, trained a model, and evaluated its performance.
This workflow is extensible to other antimicrobial classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.
In this second example, we demonstrate how to use <mic>
columns directly in tidymodels
workflows using AMR-specific recipe steps. This includes a transformation to log2
scale using step_mic_log2()
, which prepares MIC values for use in classification models.
This approach and idea formed the basis for the publication DOI: 10.3389/fmicb.2025.1582703 to model the presence of extended-spectrum beta-lactamases (ESBL).
Our goal is to:
tidymodels
recipe.We use the esbl_isolates
dataset that comes with the AMR package.
# Load required libraries library(AMR) library(tidymodels) # View the esbl_isolates data set esbl_isolates # Prepare a binary outcome and convert to ordered factor data <- esbl_isolates %>% mutate(esbl = factor(esbl, levels = c(FALSE, TRUE), ordered = TRUE))
Explanation:
esbl_isolates
: Contains MIC test results and ESBL status for each isolate.mutate(esbl = ...)
: Converts the target column to an ordered factor for classification.We use our step_mic_log2()
function to log2-transform MIC values, ensuring that MICs are numeric and properly scaled. All MIC predictors can easily and agnostically selected using the new all_mic_predictors()
:
# Split into training and testing sets set.seed(123) split <- initial_split(data) training_data <- training(split) testing_data <- testing(split) # Define the recipe mic_recipe <- recipe(esbl ~ ., data = training_data) %>% remove_role(genus, old_role = "predictor") %>% # Remove non-informative variable step_mic_log2(all_mic_predictors()) #%>% # Log2 transform all MIC predictors # prep() mic_recipe
Explanation:
remove_role()
: Removes irrelevant variables like genus.step_mic_log2()
: Applies log2(as.numeric(...))
to all MIC predictors in one go.prep()
: Finalises the recipe based on training data.We use a simple logistic regression to model ESBL presence, though recent models such as xgboost (link to parsnip
manual) could be much more precise.
# Define the model model <- logistic_reg(mode = "classification") %>% set_engine("glm") model
Explanation:
logistic_reg()
: Specifies a binary classification model.set_engine("glm")
: Uses the base R GLM engine.# Create workflow workflow_model <- workflow() %>% add_recipe(mic_recipe) %>% add_model(model) workflow_model
# Fit the model fitted <- fit(workflow_model, training_data) # Generate predictions predictions <- predict(fitted, testing_data) %>% bind_cols(testing_data) # Evaluate model performance our_metrics <- metric_set(accuracy, kap, ppv, npv) metrics <- our_metrics(predictions, truth = esbl, estimate = .pred_class) metrics
Explanation:
fit()
: Trains the model on the processed training data.predict()
: Produces predictions for unseen test data.metric_set()
: Allows evaluating multiple classification metrics.It appears we can predict ESBL gene presence with a positive predictive value (PPV) of r round(metrics$.estimate[3], 3) * 100
% and a negative predictive value (NPV) of r round(metrics$.estimate[4], 3) * 100
using a simplistic logistic regression model.
We can visualise predictions by comparing predicted and actual ESBL status.
library(ggplot2) ggplot(predictions, aes(x = esbl, fill = .pred_class)) + geom_bar(position = "stack") + labs(title = "Predicted vs Actual ESBL Status", x = "Actual ESBL", y = "Count") + theme_minimal()
In this example, we showcased how the new AMR
-specific recipe steps simplify working with <mic>
columns in tidymodels
. The step_mic_log2()
transformation converts ordered MICs to log2-transformed numerics, improving compatibility with classification models.
This pipeline enables realistic, reproducible, and interpretable modelling of antimicrobial resistance data.
In this third example, we aim to predict antimicrobial resistance (AMR) trends over time using tidymodels
. We will model resistance to three antibiotics (amoxicillin AMX
, amoxicillin-clavulanic acid AMC
, and ciprofloxacin CIP
), based on historical data grouped by year and hospital ward.
Our goal is to:
tidymodels
to preprocess, train, and evaluate the model.We start by transforming the example_isolates
dataset into a structured time-series format.
# Load required libraries library(AMR) library(tidymodels) # Transform dataset data_time <- example_isolates %>% top_n_microorganisms(n = 10) %>% # Filter on the top #10 species mutate(year = as.integer(format(date, "%Y")), # Extract year from date gramstain = mo_gramstain(mo)) %>% # Get taxonomic names group_by(year, gramstain) %>% summarise(across(c(AMX, AMC, CIP), function(x) resistance(x, minimum = 0), .names = "res_{.col}"), .groups = "drop") %>% filter(!is.na(res_AMX) & !is.na(res_AMC) & !is.na(res_CIP)) # Drop missing values data_time
Explanation:
mo_name(mo)
: Converts microbial codes into proper species names.resistance()
: Converts AMR results into numeric values (proportion of resistant isolates).group_by(year, ward, species)
: Aggregates resistance rates by year and ward.We now define the modelling workflow, which consists of a preprocessing step, a model specification, and the fitting process.
# Define the recipe resistance_recipe_time <- recipe(res_AMX ~ year + gramstain, data = data_time) %>% step_dummy(gramstain, one_hot = TRUE) %>% # Convert categorical to numerical step_normalize(year) %>% # Normalise year for better model performance step_nzv(all_predictors()) # Remove near-zero variance predictors resistance_recipe_time
Explanation:
step_dummy()
: Encodes categorical variables (ward
, species
) as numerical indicators.step_normalize()
: Normalises the year
variable.step_nzv()
: Removes near-zero variance predictors.We use a linear regression model to predict resistance trends.
# Define the linear regression model lm_model <- linear_reg() %>% set_engine("lm") # Use linear regression lm_model
Explanation:
linear_reg()
: Defines a linear regression model.set_engine("lm")
: Uses R’s built-in linear regression engine.We combine the preprocessing recipe and model into a workflow.
# Create workflow resistance_workflow_time <- workflow() %>% add_recipe(resistance_recipe_time) %>% add_model(lm_model) resistance_workflow_time
We split the data into training and testing sets, fit the model, and evaluate performance.
# Split the data set.seed(123) data_split_time <- initial_split(data_time, prop = 0.8) train_time <- training(data_split_time) test_time <- testing(data_split_time) # Train the model fitted_workflow_time <- resistance_workflow_time %>% fit(train_time) # Make predictions predictions_time <- fitted_workflow_time %>% predict(test_time) %>% bind_cols(test_time) # Evaluate model metrics_time <- predictions_time %>% metrics(truth = res_AMX, estimate = .pred) metrics_time
Explanation:
initial_split()
: Splits data into training and testing sets.fit()
: Trains the workflow.predict()
: Generates resistance predictions.metrics()
: Evaluates model performance.We plot resistance trends over time for amoxicillin.
library(ggplot2) # Plot actual vs predicted resistance over time ggplot(predictions_time, aes(x = year)) + geom_point(aes(y = res_AMX, color = "Actual")) + geom_line(aes(y = .pred, color = "Predicted")) + labs(title = "Predicted vs Actual AMX Resistance Over Time", x = "Year", y = "Resistance Proportion") + theme_minimal()
Additionally, we can visualise resistance trends in ggplot2
and directly add linear models there:
ggplot(data_time, aes(x = year, y = res_AMX, color = gramstain)) + geom_line() + labs(title = "AMX Resistance Trends", x = "Year", y = "Resistance Proportion") + # add a linear model directly in ggplot2: geom_smooth(method = "lm", formula = y ~ x, alpha = 0.25) + theme_minimal()
In this example, we demonstrated how to analyze AMR trends over time using tidymodels
. By aggregating resistance rates by year and hospital ward, we built a predictive model to track changes in resistance to amoxicillin (AMX
), amoxicillin-clavulanic acid (AMC
), and ciprofloxacin (CIP
).
This method can be extended to other antibiotics and resistance patterns, providing valuable insights into AMR dynamics in healthcare settings.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.