To exemplify how segmentr
is used in real situations, an example of weather
data is provided in this paper. The data was obtained using data found in [@dwd].
segmentr
is a package that implements a handful of algorithms to segment a given data set, by finding
the change points that minimize the total cost of the segments according to an arbitrary
cost function. So, the user of this package has to find an adequate cost for
the segmentation problem to be solved, possibly having to penalize it to avoid either an
over parametrized or under parameterized model, i.e. one with too many or too few change points,
respectively. Also, it's important to consider the computation time of each algorithm and its
trade-offs. This example walks through the main concepts regarding its usage, using historical
temperature data from Berlin as an example.
library(segmentr) library(tidyr) library(tibble) library(dplyr) library(lubridate) library(magrittr) library(purrr) library(knitr) library(kableExtra)
The berlin
data set, provided in this package, contains daily temperature measurements
from seven weather stations in Berlin for every day in the years 2010 and 2011,
i.e., a total of r ncol(berlin)
days. Therefore, every element in the data set has a
temperature data point with units in Celsius, such that each of the r ncol(berlin)
columns corresponds
to a date, and each of the r nrow(berlin)
rows corresponds to the weather station the
data point was measured at. In Table \@ref(tab:berlin-sample), it's possible to see the first three
and the last three columns, together with their respective stations.
data(berlin) as_tibble(berlin, rownames="station") %>% mutate(`..`="..") %>% select(station, `2010-01-01`:`2010-01-03`, `..`, `2011-12-29`:`2011-12-31`) %>% kable(caption=" First and last columns of the `berlin` dataset ") %>% column_spec(1, width="5em")
To grasp the behavior of the weather data, Figure \@ref(fig:berlin-curve) shows the daily average temperature of all the weather stations, i.e. the mean value of each column in the data set as a time series graph to observe how the average temperature of Berlin behaves over time.
berlin %>% colMeans() %>% enframe("time", "temperature") %>% mutate_at(vars(time), ymd) %>% with(plot(time, temperature, cex=0.2))
In the graph, the daily temperatures alternate in upwards and downwards trends, which suggests it's possible to fit linear regressions for each of the upwards or downwards trend segments. So, a cost that optimizes for linear regressions function is proposed, which should minimize when a linear regression fits in a given segment. By intuition, we expect the cost function to segment the dataset approximately to the way the hand-picked vertical lines are placed in Figure \@ref(fig:berlin-ideal), with indices described in Table \@ref(tab:berlinexpectedtable).
expected_berlin <- list( changepoints=c(200, 360, 570), segments=list(1:199, 360:569, 570:ncol(berlin)) ) plot_results(expected_berlin, berlin)
print_results_table( expected_berlin, caption="Expected values to be found when segmenting the Berlin dataset" )
An adequate cost function should be able to rank a given set of possible segments and pick the best one given the evaluation criteria. Since the goal is to select segments with a good linear regression fit, and given the standard log-likelihood function for linear regressions is the negative of the squared sum of residuals, the positive squared sum of residuals is therefore a good candidate for our segment cost function. So, the cost function $K$ is shown in \@ref(eq:squaredresidualcost), with the set of points $X$ that belong to the segment, $x_i$ and $y_i$ as the points that belong to $X$ for each index $i$, and $f$ is the best linear regression that fitted the $X$ segment.
\begin{equation} K(X)=\sum_{i=1}^n\frac{1}{n}(y_i - f(x_i))^2 (#eq:squaredresidualcost) \end{equation}
A cost argument for a segment()
call requires a function which accepts
a candidate segment matrix, i.e. a subset of the columns in the data set, and
returns the cost for the given segment. Therefore, equation
\@ref(eq:squaredresidualcost) is implemented as an R function called
residual_cost
, such that it obeys the argument contract by taking a matrix as
argument and returning the mean of the squared residuals of a linear
regression over the candidate segment. To get a sense of how the function
behaves with the berlin
data set, sample costs for a small, a medium,
and a large segment are provided in \@ref(tab:cost-samples).
residual_cost <- function (data) { fit <- t(data) %>% as_tibble() %>% rowid_to_column() %>% gather(station, temperature, -rowid) %>% with(lm(temperature ~ rowid)) mean(fit$residuals ^ 2) } tibble( `Small Size`=residual_cost(berlin[, 2:3]), `Medium Size`=residual_cost(berlin[, 1:150]), `Large Size`=residual_cost(berlin) ) %>% kable( caption="Sample values of the squared residual cost function for different sizes of segment candidates" )
With the cost function defined, it can now be applied to segment()
to get the
segments for the data set. Since the time complexity of the
exact algorithm is $O(n^2)$ and the number of points in the data set is high,
the execution time required for the computation is quite prohibitive. So, for
demonstration purposes, we use the hierarchical algorithm, due to its
$O(n \log(n))$ complexity. We point the hierarchical algorithm (generally suitable for the cost function based on
multivariate likelihood) assumes the segments to be structured hierarchically,
with a combination of two neighboring segments being selected as an
intermediate step before evaluating the ideal change
points of the data set. The segmentation results can be seen in Table
\@ref(tab:resultsnonpenalized), and they are also plotted together with the
weather data in Figure \@ref(fig:berlin-non-penalized).
results_non_penalized <- segment( berlin, cost = residual_cost, algorithm = "hierarchical" ) print_results_table( results_non_penalized, caption="Results of the segmentation algorithm using a non-penalized cost function" )
plot_results(results_non_penalized, berlin)
In Figure \@ref(fig:berlin-non-penalized), it's possible to see many very short
segments, and a very large last segment. This is a result of the algorithm
used, as well as the fact the residual_cost
function tends to favor very
short segments, as they usually have smaller residual error. So, to not get
segments too short or too long, it's necessary to penalize the cost
function for either extremely short or extremely long lengths.
To penalize a cost function in the segmentr
context is to increase the return value of the cost
function whenever unwanted segments are provided as an input. Typically, this involves
penalizing the cost function whenever a very short or a very long segment is provided.
One such method is to add the output of the cost function with a penalty function which is a function of the length of the segment. We propose a penalty function in \@ref(eq:penalty-function).
\begin{equation} p(l) = C_1e^{s_1(l - \frac{L}{2})} + C_2e^{s_2(-l + \frac{L}{2})} (#eq:penalty-function) \end{equation}
In equation \@ref(eq:penalty-function), the penalty $p(l)$ is a function of the segment's length $l$ and, for parametrization values $C_1 > 0$, $s_1 > 0$, $C_2 > 0$ and $s_2 > 0$, the penalty is high for values of $l$ neighboring $0$, as well as values of $l$ in the order of the total length $L$ of the data set. However, penalty is close to its minimum for values of $l$ neighboring $\frac{L}{2}$. To visualize, consider a sample penalty function, with $C_1 = C_2 = 1$, $s_1 = s_2 = 0.3$ and $L = 100$, plotted in Figure \@ref(fig:penalty-curve).
plot_curve(~ exp(0.3*(. - 50)) + exp(0.3 * (-. + 50)), from = 0, to = 100, type="l")
Given the penalty function general formula, it's necessary to adjust the
parameters such as the penalty function has a scale compatible with the cost
function. The auto_penalize()
function, provided in the segmentr
package,
builds a penalized version of the cost function, by estimating
parametrization values based on sample cost values for big and small
segments of the data set provided. The estimated parameters are tunable by
adjusting the small_segment_penalty
and the big_segment_penalty
parameters, depending
on how much small or big segments, respectively, should be penalized, i.e. the
higher the parameter, the more penalized the related type of segment size is.
Let $P_s$ be the small_segment_penalty
, $P_b$ be the big_segment_penalty
,
$\mu_s$ be the average cost for the sampled small segments and $\mu_b$ be
the average cost for the sample's big segments. The relationship between
the parameters, as defined by the auto_penalize()
function, is defined in
\@ref(eq:penalty-parametrization).
\begin{equation} \begin{aligned} C_1 & = \frac{\mu_b}{P_b} \ s_1 & = \frac{4 \log(P_b)}{L} \ C_2 & = \frac{\mu_s}{P_s} \ s_2 & = \frac{4 \log(P_s)}{L} \end{aligned} (#eq:penalty-parametrization) \end{equation}
So, a penalized cost version of the function is created with
auto_penalize()
and then used with segment()
. The results of the
segmentation can be seen in Table \@ref(tab:resultsautopenalized) and
Figure \@ref(fig:berlin-auto-penalized).
penalized_cost <- auto_penalize(berlin, cost = residual_cost) results_auto_penalized <- segment( berlin, cost = residual_cost, algorithm = "hierarchical" ) print_results_table( results_auto_penalized, caption="Results of the segmentation algorithm using an auto-penalized cost function" )
plot_results(results_auto_penalized, berlin)
We now get a reduced number of segments estimated by the new penalized cost
function in Figure \@ref(fig:berlin-auto-penalized), though it's also over
segmenting some downward trends. It seems the big_segment_penalty
argument
needs to be decreased to allow for larger segments to be estimated by the
solution. So, by decreasing it's value, from the default of $10$, down to $2$,
we run the results again with the segment()
function. The results can be seen
in \@ref(tab:resultsadjustedpenalized) and Figure \@ref(fig:berlin-adjusted-penalized).
penalized_cost <- auto_penalize( berlin, cost = residual_cost, big_segment_penalty = 2 ) results_adjusted_penalized <- segment( berlin, cost = penalized_cost, algorithm = "hierarchical" ) print_results_table( results_adjusted_penalized, caption="Results of the segmentation algorithm using an adjusted penalized cost function" )
plot_results(results_adjusted_penalized, berlin)
Though we got a reduced number of segments with the new adjusted
penalized_cost
in Figure \@ref(fig:berlin-adjusted-penalized),
the segments found are still far from the expected segments in
Figure \@ref(fig:berlin-ideal). The reason behind this,
as discussed earlier, is due to the incorrect assumption the data is
hierarchical. So, to segment the data ideally, it's necessary to
evaluate all of the possibilities.
The exact algorithm does precisely compute all of the possibilities, but its $O(n^2)$ time complexity is quite prohibitive to run the computation on the entire data set. So we can make the computation tolerable by reducing the granularity of the data, getting the monthly averages for each of the measurement stations.
The data set needs to be resampled to represent the monthly weather averages. It is done by computing the average temperature for each combination of the month and weather station. With the granularity reduction, the data set will have 24 columns, one for each month in the two years comprehended. The monthly average temperatures is shown in Figure \@ref(fig:downsampled-curve).
monthly_berlin <- berlin %>% as_tibble(rownames = "station") %>% gather(time, temperature, -station) %>% mutate(month = floor_date(ymd(time), "month")) %>% group_by(station, month) %>% summarize(temperature = mean(temperature)) %>% spread(month, temperature) %>% { stations <- .$station result <- as.matrix(.[, -1]) rownames(result) <- stations result } monthly_berlin %>% colMeans() %>% enframe("time", "temperature") %>% mutate_at(vars(time), ymd) %>% with(plot(time, temperature, cex=0.2))
A new penalized_cost
function is then built and applied to the monthly data set
using segment()
, with the results being shown in Table \@ref(tab:downresautopenalized)
and Figure \@ref(fig:downsampled-auto-penalized).
penalized_cost <- auto_penalize( monthly_berlin, cost = residual_cost, small_segment_penalty = 100 ) results_downscaled <- segment( monthly_berlin, cost = penalized_cost, algorithm = "exact" ) print_results_table( results_downscaled, caption="Results of the segmentation algorithm using an auto-penalized cost function over downsampled berlin data" )
plot_results(results_downscaled, monthly_berlin)
To make the solution comparison clearer, it's possible to
rescale the monthly sampled dataset solution back to the
daily sampled dataset by multiplying all the change points for the number of
columns in the daily dataset and dividing it by the number of columns in the
monthly dataset, i.e. each rescaled change point $C_i$ relates to
the original change point $c_i$ as described in equation
\@ref(eq:rescaled-change-points), in which $i$ is the index of
each change point estimated with the monthly data set, $N_M$
represents the number of columns in the monthly Berlin weather
data set and $N_B$ represents the number of columns in the daily
berlin
data set, in the original form provided by the segmentr
package.
\begin{equation} C_i = \left \lfloor{c_i \frac{N_B}{N_M}} \right \rfloor (#eq:rescaled-change-points) \end{equation}
After applying that transformation, the results can be seen in Table \@ref(tab:resultsrescaled), as well as the plot along with the daily graph in Figure \@ref(fig:plot-results-rescaled)
rescaled_changepoints <- round( results_downscaled$changepoints * ncol(berlin) / ncol(monthly_berlin) ) results_rescaled <- with_segments( changepoints=rescaled_changepoints, len=ncol(berlin) ) print_results_table( results_rescaled, caption="Results of the segmentation algorithm using an auto-penalized cost function over downsampled berlin data, but rescaled to fit the original data dimensions" )
plot_results(results_rescaled, berlin)
With the exact solution, made possible with the granularity reduction, we noticed the data set is segmented closer to what we expect in Figure \@ref(fig:berlin-ideal), as the algorithm was able to evaluate all the possibilities.
berlin
Data SetIn this chapter, the data set of weather temperatures in
Berlin was presented, and there was a walkthrough on different ways to use
segmentr
to find a good estimate set of segments for it. Table
\@ref(tab:berlinexpectedtable) lists the considered solution considered ideal to the
problem, at the same time as different solutions were
found in Tables \@ref(tab:resultsnonpenalized), \@ref(tab:resultsautopenalized)
and \@ref(tab:downresautopenalized). Considering this, the
Hausdorff distance can be used to measure how far each estimate is from
the ideal solution described in Table \@ref(tab:berlinexpectedtable).
In Table \@ref(tab:hausdorff-real-data-comparison) we can see the
comparison, each with the distance to the expected solution calculated.
(ref:table-berlin-expected) \@ref(tab:berlinexpectedtable)
(ref:table-non-penalized) \@ref(tab:resultsnonpenalized)
(ref:table-auto-penalized) \@ref(tab:resultsautopenalized)
(ref:table-adjusted-penalized) \@ref(tab:resultsadjustedpenalized)
(ref:table-rescaled) \@ref(tab:resultsrescaled)
deviation <- partial( segment_distance, changepoints2=expected_berlin$changepoints ) tribble( ~`Results referenced`, ~`Estimated Changepoints`, ~`Hausdorff Distance to Ideal`, "Table (ref:table-berlin-expected)", comma_format(expected_berlin$changepoints), deviation(expected_berlin$changepoints), "Table (ref:table-non-penalized)", comma_format(results_non_penalized$changepoints), deviation(results_non_penalized$changepoints), "Table (ref:table-auto-penalized)", comma_format(results_auto_penalized$changepoints), deviation(results_auto_penalized$changepoints), "Table (ref:table-adjusted-penalized)", comma_format(results_adjusted_penalized$changepoints), deviation(results_adjusted_penalized$changepoints), "Table (ref:table-rescaled)", comma_format(results_rescaled$changepoints), deviation(results_rescaled$changepoints) ) %>% kable( caption="Hausdorff distance for different segmentation attepmts over the Berlin weather data set" ) %>% column_spec(2, width="15em")
In Table \@ref(tab:hausdorff-real-data-comparison) we can see the distance to the ideal is gradually decreased when more adequate approached and cost functions are used to compute the change point estimates. This improvement reflects the gradual progress that can be observed in Figures \@ref(fig:berlin-non-penalized), \@ref(fig:berlin-auto-penalized), \@ref(fig:berlin-adjusted-penalized) and \@ref(fig:plot-results-rescaled).
Notice the only requirement on the cost function is for it to be able to rank segments in
a desired manner. Therefore, there's freedom to pick a non-conventional
cost function. For example, the R-squared statistic of the linear
regression in each segment is conventionally used to infer how well the linear
model fits the points in the data. It ranges from zero to one and the closer it
is to one, the better it predicts the points in the data. Therefore, we propose
the implementation of a rsquared_cost
, which fits a linear regression
against the data set received as input in the function and then returns the
opposite R-squared statistic as the "cost" value of the segment. The defined
function is applied against different segment sizes of the berlin
data set,
analogous to Table \@ref(tab:cost-samples), and the results are shown in
\@ref(tab:rsquared-samples).
rsquared_cost <- function (data) { as_tibble(t(data)) %>% rowid_to_column() %>% gather(station, temperature, -rowid) %>% with(lm(temperature ~ rowid)) %>% summary %>% .$adj.r.squared %>% { -. } } tibble( `Small Size`=rsquared_cost(berlin[, 2:3]), `Medium Size`=rsquared_cost(berlin[, 1:150]), `Large Size`=rsquared_cost(berlin) ) %>% kable( caption="Sample values of the R-squared cost function for different sizes of segment candidates" )
From what is observed in Table \@ref(tab:rsquared-samples), and similar to the
previous case, the new rsquared_cost
has the lowest values for small
segments. Therefore, it needs to be penalized with the auto_penalize
function. We then go on and apply the results of the penalized function and show
them in Table \@ref(tab:rsquaredres-auto-penalized). The plot with the weather
data can be seen in Figure \@ref(fig:rsquaredplot-auto-penalized).
penalized_cost <- auto_penalize( berlin, cost = rsquared_cost ) results <- segment( berlin, cost = penalized_cost, algorithm = "hierarchical" ) print_results_table( results, caption="Results of the segmentation algorithm using an auto-penalized R-squared cost function over Berlin data" )
plot_results(results, berlin)
The default penalized rsquared_cost
split the data set in three
segments, roughly the same size each. It makes sense because the penalty applied by
the default auto_penalize
function has the lowest penalty for segments of sizes closer to
about half the total length. So, increasing the big_penalty_segment
argument will not affect solution estimation anymore. In contrast, smaller segments are being over penalized. Because of this,
it's necessary to reduce the small_segment_penalty
segment, which we decrease
to $1.5$, from the default of $10$. The results of the new segmentation can be
seen in Table \@ref(tab:rsquaredres-adjusted-penalized), and the plot in Figure
\@ref(fig:rsquaredplot-adjusted-penalized).
penalized_cost <- auto_penalize( berlin, cost = rsquared_cost, small_segment_penalty = 1.5 ) results <- segment( berlin, cost = penalized_cost, algorithm = "hierarchical" ) print_results_table( results, caption="Results of the segmentation algorithm using an adjusted penalized R-squared cost function over Berlin data" )
plot_results(results, berlin)
With the adjusted parameters, we see the rsquared_penalized
was able to
segment the data in Figure \@ref(fig:rsquaredplot-adjusted-penalized) in a
seemingly accurate manner, despite the nature of the hierarchical algorithm. As
discussed previously, we point the hierarchical algorithm is not adequate for
the squared residual cost, as the grouping of neighboring segments under a
macro-segment, an intermediate segment state under the hierarchical algorithm,
has an unattractive cost value, due to the presence of alternating trends.
Because of this, the macro-segment is never picked by the algorithm during the
computation process, and so the ideal change points are never picked at their
ideal positions. To see it more clearly, take the first 547 data
points of the berlin
data set, roughly one year and a half. The results
of the segmentation with that portion of the berlin
data set can be seen in
Table \@ref(tab:subberlinres-penalized), and the plot can be seen in Figure
\@ref(fig:subberlinplot-penalized).
sub_berlin <- berlin[, 1:547] penalized_cost <- auto_penalize( sub_berlin, cost = rsquared_cost ) results <- segment( sub_berlin, cost = penalized_cost, algorithm = "hierarchical" ) print_results_table( results, caption="Results of the segmentation algorithm using an auto-penalized R-squared cost function over a subset of Berlin data" )
plot_results(results, sub_berlin)
In Figure \@ref(fig:subberlinplot-penalized), we can see the segments do not match our expectations, as the algorithm is not able to bisect a macro-segment before finding the ideal segments in the next algorithm iteration.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.