library(rulsif.ts) knitr::opts_chunk$set( fig.width=6, fig.height=4, fig.align='center' )
This vignette assumes basic familiarity with time series data. It will start with a high-level overview of the RelULSIF algorithm followed by an application of this package to analyzing synthetic data.
Change point detection is the task of determining which points in time the statistical properties (e.g. mean, standard deviation) of a time series or stochastic processes changes. Put another way, we look for time points at which the probability distribution of the series changes. The problem refers to both the question of whether or not a change has occurred and if so, at which time point it did occur. In more mathematical terms, suppose we have an ordered sequence of data $y_{N} = (y_1, \ldots, y_N)$. We say a change point has occurred if there exists a time point $t^ \in {1, 2, \ldots, N - 1}$ such that the statistical properties of $(y_1, \ldots, y_{t^})$ and $(y_{t^+1}, \ldots, y_N)$ are different in some way. RelULSIF solves this problem by computing the ratio* of the probability distributions on either side of a time point.
Let's begin by inventing a time series with some "obvious" change points (in mean) and some not so obvious change point (in mean and standard deviation).
set.seed(2883) series <- c( rnorm(50, mean = 0, sd = 0.3), rnorm(25, mean = 8, sd = 1), rnorm(75, mean = 3, sd = 0.6), rnorm(25, mean = 1, sd = 0.8), rnorm(100, mean = -5, sd = 1.5), rnorm(100, mean = -5, sd = 0.2), rnorm(50, mean = -2.5, sd = 0.4), rnorm(50, mean = 2, sd = 1.2) )
And now visualize the series:
plot(1:length(series), series, type = 'l', xlab = 'time', ylab = 'data')
According to the way we constructed the object series
, the true change points
occur at $t = 50, 75, 150, 250, 275,$ and $325$. Let's see how the RelULSIF
algorithm performs with the default parameters.
# run the algorithm across the time series and plot the results detect <- ts_detect(series, make_plot = TRUE)
Not too bad, but clearly not all change points are detected. Adjusting the algorithm's hyperparameters may help us out.
The simplest way to detect more change points that may be missing is simply
to increase the sensitivity of detection, that is, decrease the parameter
thresh
. By default it's set to 0.9 which means that for a time point to
be classified as a change point, it's corresponding rPE score must exceed the
90th percentile of rPE scores across the time series. Decreasing this
parameter obviously increases the number of time point classified as
change points.
detect_lower_thresh <- ts_detect(series, thresh = 0.65, make_plot = TRUE)
Lowering the threshold seems to have picked up a few more change points, but in practice when we don't know how many true change points there are, the threshold parameter really only represents how likely we think a change point is to be or how sensitive we'd like the detection to be.
RelULSIF "looks" back and forward in the view of multiple subsequences of the data rather than just simply the previous and future $W$ time points. By increasing the window size $W$, we tell RelULSIF to gather more subsequences of the same length in the past and future to estimate the probability ratio. Making the window size too short may limit how much information is available while making it too long may mean that more subtle change points are overlooked.
The default window size is 5
, so let's see first what happens when we make it a tad
longer, say 20
.
detect_longer_window <- ts_detect(series, window_size = 20, make_plot = TRUE)
So that seems to have made it worse. The detected change points are shifted from before and the results seem more "sloopy".
This should make sense, the lengths of several of the subsequences drawm from different parameter pairs aren't too long so increasing the window size will "mix" some of them together (e.g. it would make the third and fourth subsequence look like they're from the same distribution since their means are rather close).
However, there is one positive here: there are more peaks in the rPE around the area where the mean is held at -5 but the standard deviation changes from 1.5 to 0.2.
Now let's try decreasing the window size to 3
.
detect_shorter_window <- ts_detect(series, window_size = 3, make_plot = TRUE)
Again it seems we are getting some peaks at that standard deviation change, but this time the results are a little neater and more accurate. We're still missing some of the true change points, but we could just reduce the threshold to pick those up.
In some cases, the ratio of the probability densities $p(Y) / p'(Y)$ may
be unbounded which is a problem since the standard ULSIF relies on the "sup"-norm.
RelULSIF was introduced precisely to address this issue. The parameter alpha
"smooths"
the density ratio. When alpha = 0
, we recover standard ULSIF. As alpha
tends to 1,
the density ratio gets "smoother".
The default setting is alpha = 0.05
. Let's first set it to 0
to compare the
results to standard ULSIF.
detect_alpha_0 <- ts_detect(series, alpha = 0, make_plot = TRUE)
Okay, not so different, but expected since the default is 0.05. Now let's try
a more dramatic difference with alpha = 0.5
, 0.75
, and, towards the extreme, 0.99
.
detect_alpha_0.5 <- ts_detect(series, alpha = 0.5, make_plot = TRUE)
Pretty nasty, right?
detect_alpha_0.75 <- ts_detect(series, alpha = 0.75, make_plot = TRUE)
Still just as bad.
detect_alpha_1 <- ts_detect(series, alpha = 0.99, make_plot = TRUE)
Alright, so increasing the alpha parameter seems to have made things worse in
the sense that we get back less and less useful information on the probability
density ratios. Keep in mind that the original implementation stuck with 0.01
as the default setting.
The last hyperparameter we'll look at is step
which determines where in the series
of subsequences of the time series we begin our search. The default is 10% the
length of the time series. Pulling the step back has the advantage
of detecting earlier (and later, since it "widens" the search in both directions) time points,
but the disadvantage of higher computation time in cases where that may be an issue.
detect_step_10 <- ts_detect(series, step = 10, make_plot = TRUE)
Now we're really getting somewhere. Not only do we get much more distinct peaks at the true change points, we get more of a signal at that standard deviation change.
detect_step_100 <- ts_detect(series, step = 100, make_plot = TRUE)
There really is a true disadvantage to reducing the step size; you should really try to make it as small as possible if you can afford the computation time.
It seems that the adjustments that helped us the most were the ones to window_size
and step
. Let's try a shorter window size and earlier step.
detect_final <- ts_detect(series, window_size = 2, step = 10, thresh = 0.6, make_plot = TRUE)
So not only have we picked up more of the true change points, but the detection bands are narrower making the results more accurate. We'd have "perfect" results were it not for that pesky standard deviation change, but in practice, the analyst would be wise not to rely just on the detection, but also the scores themselves to identify peaks. Comparing these peaks with the time series itself will prove just as informative.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.