knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = F, fig.align = "center" ) devtools::load_all()
Forecasting error can often be reduced 20% to 50% by repairing anomolous data
We can often get better forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies()
function into your forecast workflow.
library(tidyverse) library(tidyquant) library(anomalize) library(timetk) # NOTE: timetk now has anomaly detection built in, which # will get the new functionality going forward. # Use this script to prevent overwriting legacy anomalize: anomalize <- anomalize::anomalize plot_anomalies <- anomalize::plot_anomalies
Here is a short example with the tidyverse_cran_downloads
dataset that comes with anomalize
. We'll see how we can reduce the forecast error by 32% simply by repairing anomalies.
tidyverse_cran_downloads
Let's take one package with some extreme events. We can hone in on lubridate
, which has some outliers that we can fix.
tidyverse_cran_downloads %>% ggplot(aes(date, count, color = package)) + geom_point(alpha = 0.5) + facet_wrap(~ package, ncol = 3, scales = "free_y") + scale_color_viridis_d() + theme_tq()
Let's focus on downloads of the lubridate
R package.
lubridate_tbl <- tidyverse_cran_downloads %>% ungroup() %>% filter(package == "lubridate")
First, we'll make a function, forecast_mae()
, that can take the input of both cleaned and uncleaned anomalies and calculate forecast error of future uncleaned anomalies.
The modeling function uses the following criteria:
data
into training and testing data that maintains the correct time-series sequence using the prop
argument.col_train
argument. col_test
argument.forecast_mae <- function(data, col_train, col_test, prop = 0.8) { predict_expr <- enquo(col_train) actual_expr <- enquo(col_test) idx_train <- 1:(floor(prop * nrow(data))) train_tbl <- data %>% filter(row_number() %in% idx_train) test_tbl <- data %>% filter(!row_number() %in% idx_train) # Model using training data (training) model_formula <- as.formula(paste0(quo_name(predict_expr), " ~ index.num + year + quarter + month.lbl + day + wday.lbl")) model_glm <- train_tbl %>% tk_augment_timeseries_signature() %>% glm(model_formula, data = .) # Make Prediction suppressWarnings({ # Suppress rank-deficit warning prediction <- predict(model_glm, newdata = test_tbl %>% tk_augment_timeseries_signature()) actual <- test_tbl %>% pull(!! actual_expr) }) # Calculate MAE mae <- mean(abs(prediction - actual)) return(mae) }
We will use the anomalize
workflow of decomposing (time_decompose()
) and identifying anomalies (anomalize()
). We use the function, clean_anomalies()
, to add new column called "observed_cleaned" that is repaired by replacing all anomalies with the trend + seasonal components from the decompose operation. We can now experiment to see the improvment in forecasting performance by comparing a forecast made with "observed" versus "observed_cleaned"
lubridate_anomalized_tbl <- lubridate_tbl %>% time_decompose(count) %>% anomalize(remainder) %>% # Function to clean & repair anomalous data clean_anomalies() lubridate_anomalized_tbl
lubridate_anomalized_tbl %>% forecast_mae(col_train = observed, col_test = observed, prop = 0.8)
lubridate_anomalized_tbl %>% forecast_mae(col_train = observed_cleaned, col_test = observed, prop = 0.8)
This is approximately a 32% reduction in forecast error as measure by Mean Absolute Error (MAE).
(2755 - 4054) / 4054
Business Science offers two 1-hour courses on Anomaly Detection:
Learning Lab 18 - Time Series Anomaly Detection with anomalize
Learning Lab 17 - Anomaly Detection with H2O
Machine Learning
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.