Reduce Forecast Error with Cleaned Anomalies
In anomalize: Tidy Anomaly Detection

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = F,
  fig.align = "center"
)

devtools::load_all()

Forecasting error can often be reduced 20% to 50% by repairing anomolous data

Example - Reducing Forecasting Error by 32%

We can often get better forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies() function into your forecast workflow.

library(tidyverse)
library(tidyquant)
library(anomalize)
library(timetk)

# NOTE: timetk now has anomaly detection built in, which 
#  will get the new functionality going forward.
#  Use this script to prevent overwriting legacy anomalize:

anomalize <- anomalize::anomalize
plot_anomalies <- anomalize::plot_anomalies

Here is a short example with the tidyverse_cran_downloads dataset that comes with anomalize. We'll see how we can reduce the forecast error by 32% simply by repairing anomalies.

tidyverse_cran_downloads

Let's take one package with some extreme events. We can hone in on lubridate, which has some outliers that we can fix.

tidyverse_cran_downloads %>%
  ggplot(aes(date, count, color = package)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ package, ncol = 3, scales = "free_y") +
  scale_color_viridis_d() +
  theme_tq()

Forecasting Lubridate Downloads

Let's focus on downloads of the lubridate R package.

lubridate_tbl <- tidyverse_cran_downloads %>%
  ungroup() %>%
  filter(package == "lubridate")

First, we'll make a function, forecast_mae(), that can take the input of both cleaned and uncleaned anomalies and calculate forecast error of future uncleaned anomalies.

The modeling function uses the following criteria:

Split the data into training and testing data that maintains the correct time-series sequence using the prop argument.
Models the daily time series of the training data set from observed (demonstrates no cleaning) or observed and cleaned (demonstrates improvement from cleaning). Specified by the col_train argument.
Compares the predictions to the observed values. Specified by the col_test argument.

forecast_mae <- function(data, col_train, col_test, prop = 0.8) {

  predict_expr <- enquo(col_train)
  actual_expr <- enquo(col_test)

  idx_train <- 1:(floor(prop * nrow(data)))

  train_tbl <- data %>% filter(row_number() %in% idx_train)
  test_tbl  <- data %>% filter(!row_number() %in% idx_train)

  # Model using training data (training) 
  model_formula <- as.formula(paste0(quo_name(predict_expr), " ~ index.num + year + quarter + month.lbl + day + wday.lbl"))

  model_glm <- train_tbl %>%
    tk_augment_timeseries_signature() %>%
    glm(model_formula, data = .)

  # Make Prediction
  suppressWarnings({
    # Suppress rank-deficit warning
    prediction <- predict(model_glm, newdata = test_tbl %>% tk_augment_timeseries_signature()) 
    actual     <- test_tbl %>% pull(!! actual_expr)
  })

  # Calculate MAE
  mae <- mean(abs(prediction - actual))

  return(mae)

}

Workflow for Cleaning Anomalies

We will use the anomalize workflow of decomposing (time_decompose()) and identifying anomalies (anomalize()). We use the function, clean_anomalies(), to add new column called "observed_cleaned" that is repaired by replacing all anomalies with the trend + seasonal components from the decompose operation. We can now experiment to see the improvment in forecasting performance by comparing a forecast made with "observed" versus "observed_cleaned"

lubridate_anomalized_tbl <- lubridate_tbl %>%
  time_decompose(count) %>%
  anomalize(remainder) %>%

  # Function to clean & repair anomalous data
  clean_anomalies()

lubridate_anomalized_tbl

Before Cleaning with anomalize

lubridate_anomalized_tbl %>%
  forecast_mae(col_train = observed, col_test = observed, prop = 0.8)

After Cleaning with anomalize

lubridate_anomalized_tbl %>%
  forecast_mae(col_train = observed_cleaned, col_test = observed, prop = 0.8)

32% Reduction in Forecast Error

This is approximately a 32% reduction in forecast error as measure by Mean Absolute Error (MAE).

(2755 - 4054) / 4054

Interested in Learning Anomaly Detection?

Business Science offers two 1-hour courses on Anomaly Detection:

Learning Lab 18 - Time Series Anomaly Detection with anomalize
Learning Lab 17 - Anomaly Detection with H2O Machine Learning

Any scripts or data that you put into this service are public.

anomalize documentation built on Nov. 2, 2023, 5:13 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

anomalize
Tidy Anomaly Detection

Reduce Forecast Error with Cleaned Anomalies
In anomalize: Tidy Anomaly Detection

Example - Reducing Forecasting Error by 32%

Forecasting Lubridate Downloads

Workflow for Cleaning Anomalies

Before Cleaning with anomalize

After Cleaning with anomalize

32% Reduction in Forecast Error

Interested in Learning Anomaly Detection?

Try the anomalize package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

anomalize Tidy Anomaly Detection

Reduce Forecast Error with Cleaned Anomalies In anomalize: Tidy Anomaly Detection

Example - Reducing Forecasting Error by 32%

Forecasting Lubridate Downloads

Workflow for Cleaning Anomalies

Before Cleaning with anomalize

After Cleaning with anomalize

32% Reduction in Forecast Error

Interested in Learning Anomaly Detection?

Try the anomalize package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

anomalize
Tidy Anomaly Detection

Reduce Forecast Error with Cleaned Anomalies
In anomalize: Tidy Anomaly Detection