In Dorialexander/UmbertoR: Detection of anomalous events in web logs data.

knitr::opts_chunk$set(echo = F,
                      results = "hide")

As a demonstration you can load the sample dataset "log_data". It contains the aggregated logs at the day-level of 200 articles from Open Edition from January 1st, 2017.

library(tidyverse)
library(UmbeRto)

data("log_data")
log_data

The dataset records three variable: an alias of the documents (Document n°1, Document n°2, etc.), the day of the recording in a date format and the number of unique visitors.

Cleaning the dataset

The function um_clean transform any set of logs aggregated by documents and by a time variable into a dataset that can be used by um_model. The function uses non-standard evaluation, so any column name can be used as an input. All alternative naming schemes are ok so long as they match existing relevant columns in the dataset.

#Cleaning the dataset using the current columns of log_data
um_clean(log_data, document, time_variable, count_log)

#Using an alternative naming scheme.
log_data_alt = log_data %>% rename(name = document, server_time = time_variable, count_visitor = count_log)

#This works as well:
um_clean(log_data_alt, name, server_time, count_visitor)

um_clean outputs a normalized dataset that can be used by um_model. You may notice that the name of the column at the output remains identical regardless of the naming scheme at the input. Besides, the time variable has been supplemented with a standardized time sequency that starts at the first apparition of the document in the logs.

UmbeRto has been developed for day-level values but should normally perform with other time units.

UmbeRto works usually better with unique visitors since multiple visits from the same visitor can be wrongly accounted as a sudden "surge" in traffic.

Modeling the anomalies

The function um_model takes a dataset normalized by um_clean to identify the anomalous event for each time series associated to the document. The underlying RCS model can be customized by adding more "knots", that is more precision to the model at the expense of added complexity and more risks of overfitting and missing anomalous events. It is also possible to change the min number of observations necessary to integrate a time sequence into the model.

cleaned_logs = um_clean(log_data, document, time_variable, count_log)

anomalies = um_model(cleaned_logs, knots = 3, min_observations = 7, min_anomaly_ratio = 5)
anomalies

um_model will generate a database of anomalous events with their estimated anomalous rate (from 0-100). By default, all anomalous event with a rate inferior to 5 are removed but this value can be changed using min_anomaly_ratio. To get the most extreme anomalous events, you can simply reorder the dataset using arrange.

anomalies %>% arrange(-anomaly_ratio)

UmbeRto is compatible with the pipe writing system of the tidyverse. It's possible to chain the cleaning and modeling operations:

anomalies = log_data %>%
  um_clean(document, time_variable, count_log) %>%
  um_model(knots = 3, min_observations = 7, min_anomaly_ratio = 5)

anomalies