This script demonstrates how you can use a reconstruction convolutional autoencoder model to detect anomalies in timeseries data.
knitr::opts_chunk$set(fig.width = 9, fig.height = 6)
library(dplyr, warn.conflicts = FALSE) library(ggplot2) theme_set(theme_minimal()) library(listarrays) library(tfdatasets, exclude = c("shape")) library(keras3)
We will use the Numenta Anomaly Benchmark(NAB) dataset. It provides artificial timeseries data containing labeled anomalous periods of behavior. Data are ordered, timestamped, single-valued metrics.
We will use the art_daily_small_noise.csv
file for training and the
art_daily_jumpsup.csv
file for testing. The simplicity of this dataset
allows us to demonstrate anomaly detection.
get_data <- function(url_suffix) { url_root <- "https://raw.githubusercontent.com/numenta/NAB/master/data/" url <- paste0(url_root, url_suffix) file <- get_file(origin = url) # cache file locally # parse csv; 2 columns with types: datetime (T), double (d) readr::read_csv(file, col_types = "Td") } df_small_noise <- get_data("artificialNoAnomaly/art_daily_small_noise.csv") df_daily_jumpsup <- get_data("artificialWithAnomaly/art_daily_jumpsup.csv")
df_small_noise df_daily_jumpsup
We will use the following data for training.
plot_ts <- function(df) { ggplot(df, aes(x = timestamp, y = value)) + geom_line() + scale_x_datetime(date_breaks = "1 day", date_labels = "%b-%d") } plot_ts(df_small_noise) + ggtitle("Without Anomaly")
We will use the following data for testing and see if the sudden jump up in the data is detected as an anomaly.
plot_ts(df_daily_jumpsup) + ggtitle("With Anomaly")
Get data values from the training timeseries data file and normalize the
value
data. We have a value
for every 5 mins for 14 days.
df_train <- df_small_noise |> mutate(value = (value - mean(value)) / sd(value)) cat("Number of training samples:", nrow(df_train), "\n")
Create sequences combining TIME_STEPS
contiguous data values from the
training data.
TIME_STEPS <- 288 as_dataset <- function(df) { x <- as.matrix(df$value) ds <- timeseries_dataset_from_array(x, NULL, sequence_length = TIME_STEPS) # Because the dataset is small, cast TF Dataset to an R array for convenience. ds |> as_array_iterator() |> iterate() |> bind_on_rows() } x_train <- as_dataset(df_train) writeLines(sprintf("Training input shape: (%s)", toString(dim(x_train))))
We will build a convolutional reconstruction autoencoder model. The model will
take input of shape (batch_size, sequence_length, num_features)
and return
output of the same shape. In this case, sequence_length
is 288 and
num_features
is 1.
model <- keras_model_sequential(input_shape = c(TIME_STEPS, 1)) |> layer_conv_1d( filters = 32, kernel_size = 7, padding = "same", strides = 2, activation = "relu" ) |> layer_dropout(rate = 0.2) |> layer_conv_1d( filters = 16, kernel_size = 7, padding = "same", strides = 2, activation = "relu" ) |> layer_conv_1d_transpose( filters = 16, kernel_size = 7, padding = "same", strides = 2, activation = "relu" ) |> layer_dropout(rate = 0.2) |> layer_conv_1d_transpose( filters = 32, kernel_size = 7, padding = "same", strides = 2, activation = "relu" ) |> layer_conv_1d_transpose(filters = 1, kernel_size = 7, padding = "same") model |> compile(optimizer=optimizer_adam(learning_rate=0.001), loss="mse") model
Please note that we are using x_train
as both the input and the target
since this is a reconstruction model.
history = model |> fit( x_train, x_train, epochs = 50, validation_split = 0.1, callbacks = c( callback_early_stopping( monitor = "val_loss", patience = 5, mode = "min" ) ) )
Let's plot training and validation loss to see how the training went.
plot(history)
We will detect anomalies by determining how well our model can reconstruct the input data.
threshold
for anomaly
detection.threshold
value then we can infer that the model is seeing a pattern that it isn't
familiar with. We will label this sample as an anomaly
.# Get train MAE loss. x_train_pred <- model |> predict(x_train) train_mae_loss <- apply(abs(x_train_pred - x_train), 1, mean) hist(train_mae_loss, breaks = 50) # Get reconstruction loss threshold. threshold <- max(train_mae_loss) cat("Reconstruction error threshold: ", threshold, "\n")
Just for fun, let's see how our model has recontructed the first sample. This is the 288 timesteps from day 1 of our training dataset.
# Checking how the first sequence is learnt plot(NULL, NULL, ylab = 'Value', xlim = c(0, TIME_STEPS), ylim = range(c(x_train[1,,], x_train_pred[1,,]))) lines(x_train[1,,]) lines(x_train_pred[1,,], col = 'red') legend("topleft", lty = 1, legend = c("actual", "predicted"), col = c("black", "red"))
df_test <- df_daily_jumpsup |> mutate(value = (value - mean(df_small_noise$value)) / sd(df_small_noise$value)) df_test |> head() plot_ts(df_test) # Create sequences from test values. x_test <- as_dataset(df_test) # Get test MAE loss. x_test_pred <- model |> predict(x_test) test_mae_loss <- apply(abs(x_test_pred - x_test), 1, mean) hist(test_mae_loss, breaks = 50, xlab = "test MAE loss", ylab = "No of samples") # Detect all the samples which are anomalies. anomalies <- test_mae_loss > threshold cat("Number of anomaly samples:", sum(anomalies), "\n") cat("Indices of anomaly samples:", which(anomalies), "\n", fill = TRUE)
We now know the samples of the data which are anomalies. With this, we will
find the corresponding timestamps
from the original test data. We will be
using the following method to do that:
Let's say time_steps = 3 and we have 10 training values. Our x_train
will
look like this:
All except the initial and the final time_steps-1 data values, will appear in
time_steps
number of samples. So, if we know that the samples
[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
5 is an anomaly.
Let's overlay the anomalies on the original test data plot.
is_anomaly <- test_mae_loss > threshold is_anomaly <- is_anomaly & zoo::rollsum(is_anomaly, TIME_STEPS, align = "right", na.pad = TRUE) >= TIME_STEPS with(df_test, { plot(value ~ timestamp, type = 'l', xaxt = 'n', las = 2) axis.POSIXct(1, at = seq(timestamp[1], tail(timestamp, 1), by = "days"), format = "%b-%d") }) with(df_test[which(is_anomaly),], { points(value ~ timestamp, col = "red") })
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.