Real Data Examples


To exemplify how segmentr is used in real situations, an example of weather data is provided in this paper. The data was obtained using data found in [@dwd].

segmentr is a package that implements a handful of algorithms to segment a given data set, by finding the change points that minimize the total cost of the segments according to an arbitrary cost function. So, the user of this package has to find an adequate cost for the segmentation problem to be solved, possibly having to penalize it to avoid either an over parametrized or under parameterized model, i.e. one with too many or too few change points, respectively. Also, it's important to consider the computation time of each algorithm and its trade-offs. This example walks through the main concepts regarding its usage, using historical temperature data from Berlin as an example.

library(segmentr)
library(tidyr)
library(tibble)
library(dplyr)
library(lubridate)
library(magrittr)
library(purrr)
library(knitr)
library(kableExtra)

Understanding the Data

The berlin data set, provided in this package, contains daily temperature measurements from seven weather stations in Berlin for every day in the years 2010 and 2011, i.e., a total of r ncol(berlin) days. Therefore, every element in the data set has a temperature data point with units in Celsius, such that each of the r ncol(berlin) columns corresponds to a date, and each of the r nrow(berlin) rows corresponds to the weather station the data point was measured at. In Table \@ref(tab:berlin-sample), it's possible to see the first three and the last three columns, together with their respective stations.

data(berlin)
as_tibble(berlin, rownames="station") %>%
  mutate(`..`="..") %>%
  select(station, `2010-01-01`:`2010-01-03`,
         `..`, `2011-12-29`:`2011-12-31`) %>%
  kable(caption="
    First and last columns of the `berlin` dataset
  ") %>%
  column_spec(1, width="5em")

To grasp the behavior of the weather data, Figure \@ref(fig:berlin-curve) shows the daily average temperature of all the weather stations, i.e. the mean value of each column in the data set as a time series graph to observe how the average temperature of Berlin behaves over time.

berlin %>%
  colMeans() %>%
  enframe("time", "temperature") %>%
  mutate_at(vars(time), ymd) %>%
  with(plot(time, temperature, cex=0.2))

In the graph, the daily temperatures alternate in upwards and downwards trends, which suggests it's possible to fit linear regressions for each of the upwards or downwards trend segments. So, a cost that optimizes for linear regressions function is proposed, which should minimize when a linear regression fits in a given segment. By intuition, we expect the cost function to segment the dataset approximately to the way the hand-picked vertical lines are placed in Figure \@ref(fig:berlin-ideal), with indices described in Table \@ref(tab:berlinexpectedtable).

expected_berlin <- list(
  changepoints=c(200, 360, 570),
  segments=list(1:199, 360:569, 570:ncol(berlin))
)

plot_results(expected_berlin, berlin)
print_results_table(
  expected_berlin,
  caption="Expected values to be found when segmenting
           the Berlin dataset"
)

Building the Cost Function

An adequate cost function should be able to rank a given set of possible segments and pick the best one given the evaluation criteria. Since the goal is to select segments with a good linear regression fit, and given the standard log-likelihood function for linear regressions is the negative of the squared sum of residuals, the positive squared sum of residuals is therefore a good candidate for our segment cost function. So, the cost function $K$ is shown in \@ref(eq:squaredresidualcost), with the set of points $X$ that belong to the segment, $x_i$ and $y_i$ as the points that belong to $X$ for each index $i$, and $f$ is the best linear regression that fitted the $X$ segment.

\begin{equation} K(X)=\sum_{i=1}^n\frac{1}{n}(y_i - f(x_i))^2 (#eq:squaredresidualcost) \end{equation}

A cost argument for a segment() call requires a function which accepts a candidate segment matrix, i.e. a subset of the columns in the data set, and returns the cost for the given segment. Therefore, equation \@ref(eq:squaredresidualcost) is implemented as an R function called residual_cost, such that it obeys the argument contract by taking a matrix as argument and returning the mean of the squared residuals of a linear regression over the candidate segment. To get a sense of how the function behaves with the berlin data set, sample costs for a small, a medium, and a large segment are provided in \@ref(tab:cost-samples).

residual_cost <- function (data) {
  fit <- t(data) %>%
    as_tibble() %>%
    rowid_to_column() %>%
    gather(station, temperature, -rowid) %>%
    with(lm(temperature ~ rowid))

  mean(fit$residuals ^ 2)
}

tibble(
  `Small Size`=residual_cost(berlin[, 2:3]),
  `Medium Size`=residual_cost(berlin[, 1:150]),
  `Large Size`=residual_cost(berlin)
) %>%
  kable(
    caption="Sample values of the squared residual cost
    function for different sizes of segment
    candidates"
  )

With the cost function defined, it can now be applied to segment() to get the segments for the data set. Since the time complexity of the exact algorithm is $O(n^2)$ and the number of points in the data set is high, the execution time required for the computation is quite prohibitive. So, for demonstration purposes, we use the hierarchical algorithm, due to its $O(n \log(n))$ complexity. We point the hierarchical algorithm (generally suitable for the cost function based on multivariate likelihood) assumes the segments to be structured hierarchically, with a combination of two neighboring segments being selected as an intermediate step before evaluating the ideal change points of the data set. The segmentation results can be seen in Table \@ref(tab:resultsnonpenalized), and they are also plotted together with the weather data in Figure \@ref(fig:berlin-non-penalized).

results_non_penalized <- segment(
  berlin,
  cost = residual_cost,
  algorithm = "hierarchical"
)

print_results_table(
  results_non_penalized,
  caption="Results of the segmentation algorithm
  using a non-penalized cost function"
)
plot_results(results_non_penalized, berlin)

In Figure \@ref(fig:berlin-non-penalized), it's possible to see many very short segments, and a very large last segment. This is a result of the algorithm used, as well as the fact the residual_cost function tends to favor very short segments, as they usually have smaller residual error. So, to not get segments too short or too long, it's necessary to penalize the cost function for either extremely short or extremely long lengths.

Penalizing the Cost Function

To penalize a cost function in the segmentr context is to increase the return value of the cost function whenever unwanted segments are provided as an input. Typically, this involves penalizing the cost function whenever a very short or a very long segment is provided.

One such method is to add the output of the cost function with a penalty function which is a function of the length of the segment. We propose a penalty function in \@ref(eq:penalty-function).

\begin{equation} p(l) = C_1e^{s_1(l - \frac{L}{2})} + C_2e^{s_2(-l + \frac{L}{2})} (#eq:penalty-function) \end{equation}

In equation \@ref(eq:penalty-function), the penalty $p(l)$ is a function of the segment's length $l$ and, for parametrization values $C_1 > 0$, $s_1 > 0$, $C_2 > 0$ and $s_2 > 0$, the penalty is high for values of $l$ neighboring $0$, as well as values of $l$ in the order of the total length $L$ of the data set. However, penalty is close to its minimum for values of $l$ neighboring $\frac{L}{2}$. To visualize, consider a sample penalty function, with $C_1 = C_2 = 1$, $s_1 = s_2 = 0.3$ and $L = 100$, plotted in Figure \@ref(fig:penalty-curve).

plot_curve(~ exp(0.3*(. - 50)) + exp(0.3 * (-. + 50)),
           from = 0, to = 100, type="l")

Given the penalty function general formula, it's necessary to adjust the parameters such as the penalty function has a scale compatible with the cost function. The auto_penalize() function, provided in the segmentr package, builds a penalized version of the cost function, by estimating parametrization values based on sample cost values for big and small segments of the data set provided. The estimated parameters are tunable by adjusting the small_segment_penalty and the big_segment_penalty parameters, depending on how much small or big segments, respectively, should be penalized, i.e. the higher the parameter, the more penalized the related type of segment size is.

Let $P_s$ be the small_segment_penalty, $P_b$ be the big_segment_penalty, $\mu_s$ be the average cost for the sampled small segments and $\mu_b$ be the average cost for the sample's big segments. The relationship between the parameters, as defined by the auto_penalize() function, is defined in \@ref(eq:penalty-parametrization).

\begin{equation} \begin{aligned} C_1 & = \frac{\mu_b}{P_b} \ s_1 & = \frac{4 \log(P_b)}{L} \ C_2 & = \frac{\mu_s}{P_s} \ s_2 & = \frac{4 \log(P_s)}{L} \end{aligned} (#eq:penalty-parametrization) \end{equation}

So, a penalized cost version of the function is created with auto_penalize() and then used with segment(). The results of the segmentation can be seen in Table \@ref(tab:resultsautopenalized) and Figure \@ref(fig:berlin-auto-penalized).

penalized_cost <- auto_penalize(berlin,
                                cost = residual_cost)
results_auto_penalized <- segment(
  berlin,
  cost = residual_cost,
  algorithm = "hierarchical"
)

print_results_table(
  results_auto_penalized,
  caption="Results of the segmentation algorithm
  using an auto-penalized cost function"
)
plot_results(results_auto_penalized, berlin)

We now get a reduced number of segments estimated by the new penalized cost function in Figure \@ref(fig:berlin-auto-penalized), though it's also over segmenting some downward trends. It seems the big_segment_penalty argument needs to be decreased to allow for larger segments to be estimated by the solution. So, by decreasing it's value, from the default of $10$, down to $2$, we run the results again with the segment() function. The results can be seen in \@ref(tab:resultsadjustedpenalized) and Figure \@ref(fig:berlin-adjusted-penalized).

penalized_cost <- auto_penalize(
  berlin,
  cost = residual_cost,
  big_segment_penalty = 2
)
results_adjusted_penalized <- segment(
  berlin,
  cost = penalized_cost,
  algorithm = "hierarchical"
)
print_results_table(
  results_adjusted_penalized,
  caption="Results of the segmentation algorithm
  using an adjusted penalized cost function"
)
plot_results(results_adjusted_penalized, berlin)

Though we got a reduced number of segments with the new adjusted penalized_cost in Figure \@ref(fig:berlin-adjusted-penalized), the segments found are still far from the expected segments in Figure \@ref(fig:berlin-ideal). The reason behind this, as discussed earlier, is due to the incorrect assumption the data is hierarchical. So, to segment the data ideally, it's necessary to evaluate all of the possibilities.

The exact algorithm does precisely compute all of the possibilities, but its $O(n^2)$ time complexity is quite prohibitive to run the computation on the entire data set. So we can make the computation tolerable by reducing the granularity of the data, getting the monthly averages for each of the measurement stations.

Reducing Granularity

The data set needs to be resampled to represent the monthly weather averages. It is done by computing the average temperature for each combination of the month and weather station. With the granularity reduction, the data set will have 24 columns, one for each month in the two years comprehended. The monthly average temperatures is shown in Figure \@ref(fig:downsampled-curve).

monthly_berlin <- berlin %>%
  as_tibble(rownames = "station") %>%
  gather(time, temperature, -station) %>%
  mutate(month = floor_date(ymd(time), "month")) %>%
  group_by(station, month) %>%
  summarize(temperature = mean(temperature)) %>%
  spread(month, temperature) %>% {
    stations <- .$station
    result <- as.matrix(.[, -1])
    rownames(result) <- stations
    result
  }

monthly_berlin %>%
  colMeans() %>%
  enframe("time", "temperature") %>%
  mutate_at(vars(time), ymd) %>%
  with(plot(time, temperature, cex=0.2))

A new penalized_cost function is then built and applied to the monthly data set using segment(), with the results being shown in Table \@ref(tab:downresautopenalized) and Figure \@ref(fig:downsampled-auto-penalized).

penalized_cost <- auto_penalize(
  monthly_berlin,
  cost = residual_cost,
  small_segment_penalty = 100
)

results_downscaled <- segment(
  monthly_berlin,
  cost = penalized_cost,
  algorithm = "exact"
)

print_results_table(
  results_downscaled,
  caption="Results of the segmentation algorithm
    using an auto-penalized cost function over
    downsampled berlin data"
)
plot_results(results_downscaled, monthly_berlin)

To make the solution comparison clearer, it's possible to rescale the monthly sampled dataset solution back to the daily sampled dataset by multiplying all the change points for the number of columns in the daily dataset and dividing it by the number of columns in the monthly dataset, i.e. each rescaled change point $C_i$ relates to the original change point $c_i$ as described in equation \@ref(eq:rescaled-change-points), in which $i$ is the index of each change point estimated with the monthly data set, $N_M$ represents the number of columns in the monthly Berlin weather data set and $N_B$ represents the number of columns in the daily berlin data set, in the original form provided by the segmentr package.

\begin{equation} C_i = \left \lfloor{c_i \frac{N_B}{N_M}} \right \rfloor (#eq:rescaled-change-points) \end{equation}

After applying that transformation, the results can be seen in Table \@ref(tab:resultsrescaled), as well as the plot along with the daily graph in Figure \@ref(fig:plot-results-rescaled)

rescaled_changepoints <- round(
  results_downscaled$changepoints
  * ncol(berlin) / ncol(monthly_berlin)
)

results_rescaled <- with_segments(
  changepoints=rescaled_changepoints,
  len=ncol(berlin)
)

print_results_table(
  results_rescaled,
  caption="Results of the segmentation algorithm
    using an auto-penalized cost function over
    downsampled berlin data, but rescaled to fit the
    original data dimensions"
)
plot_results(results_rescaled, berlin)

With the exact solution, made possible with the granularity reduction, we noticed the data set is segmented closer to what we expect in Figure \@ref(fig:berlin-ideal), as the algorithm was able to evaluate all the possibilities.

Accuracy of Estimates for the berlin Data Set

In this chapter, the data set of weather temperatures in Berlin was presented, and there was a walkthrough on different ways to use segmentr to find a good estimate set of segments for it. Table \@ref(tab:berlinexpectedtable) lists the considered solution considered ideal to the problem, at the same time as different solutions were found in Tables \@ref(tab:resultsnonpenalized), \@ref(tab:resultsautopenalized) and \@ref(tab:downresautopenalized). Considering this, the Hausdorff distance can be used to measure how far each estimate is from the ideal solution described in Table \@ref(tab:berlinexpectedtable). In Table \@ref(tab:hausdorff-real-data-comparison) we can see the comparison, each with the distance to the expected solution calculated.

(ref:table-berlin-expected) \@ref(tab:berlinexpectedtable)

(ref:table-non-penalized) \@ref(tab:resultsnonpenalized)

(ref:table-auto-penalized) \@ref(tab:resultsautopenalized)

(ref:table-adjusted-penalized) \@ref(tab:resultsadjustedpenalized)

(ref:table-rescaled) \@ref(tab:resultsrescaled)

deviation <- partial(
  segment_distance,
  changepoints2=expected_berlin$changepoints
)

tribble(
  ~`Results referenced`,
    ~`Estimated Changepoints`,
      ~`Hausdorff Distance to Ideal`,
  "Table (ref:table-berlin-expected)",
    comma_format(expected_berlin$changepoints),
      deviation(expected_berlin$changepoints),
  "Table (ref:table-non-penalized)",
    comma_format(results_non_penalized$changepoints),
      deviation(results_non_penalized$changepoints),
  "Table (ref:table-auto-penalized)",
    comma_format(results_auto_penalized$changepoints),
      deviation(results_auto_penalized$changepoints),
  "Table (ref:table-adjusted-penalized)",
    comma_format(results_adjusted_penalized$changepoints),
      deviation(results_adjusted_penalized$changepoints),
  "Table (ref:table-rescaled)",
    comma_format(results_rescaled$changepoints),
      deviation(results_rescaled$changepoints)
) %>% kable(
  caption="Hausdorff distance for different
  segmentation attepmts over the Berlin weather data set"
) %>% column_spec(2, width="15em")

In Table \@ref(tab:hausdorff-real-data-comparison) we can see the distance to the ideal is gradually decreased when more adequate approached and cost functions are used to compute the change point estimates. This improvement reflects the gradual progress that can be observed in Figures \@ref(fig:berlin-non-penalized), \@ref(fig:berlin-auto-penalized), \@ref(fig:berlin-adjusted-penalized) and \@ref(fig:plot-results-rescaled).

A Non-Usual Cost Function

Notice the only requirement on the cost function is for it to be able to rank segments in a desired manner. Therefore, there's freedom to pick a non-conventional cost function. For example, the R-squared statistic of the linear regression in each segment is conventionally used to infer how well the linear model fits the points in the data. It ranges from zero to one and the closer it is to one, the better it predicts the points in the data. Therefore, we propose the implementation of a rsquared_cost, which fits a linear regression against the data set received as input in the function and then returns the opposite R-squared statistic as the "cost" value of the segment. The defined function is applied against different segment sizes of the berlin data set, analogous to Table \@ref(tab:cost-samples), and the results are shown in \@ref(tab:rsquared-samples).

rsquared_cost <- function (data) {
  as_tibble(t(data)) %>%
    rowid_to_column() %>%
    gather(station, temperature, -rowid) %>%
    with(lm(temperature ~ rowid)) %>%
    summary %>%
    .$adj.r.squared %>%
    { -. }
}

tibble(
  `Small Size`=rsquared_cost(berlin[, 2:3]),
  `Medium Size`=rsquared_cost(berlin[, 1:150]),
  `Large Size`=rsquared_cost(berlin)
) %>%
  kable(
    caption="Sample values of the R-squared
    cost function for different sizes of segment
    candidates"
  )

From what is observed in Table \@ref(tab:rsquared-samples), and similar to the previous case, the new rsquared_cost has the lowest values for small segments. Therefore, it needs to be penalized with the auto_penalize function. We then go on and apply the results of the penalized function and show them in Table \@ref(tab:rsquaredres-auto-penalized). The plot with the weather data can be seen in Figure \@ref(fig:rsquaredplot-auto-penalized).

penalized_cost <- auto_penalize(
  berlin,
  cost = rsquared_cost
)
results <- segment(
  berlin,
  cost = penalized_cost,
  algorithm = "hierarchical"
)

print_results_table(
  results,
  caption="Results of the segmentation algorithm
    using an auto-penalized R-squared cost function over
    Berlin data"
)
plot_results(results, berlin)

The default penalized rsquared_cost split the data set in three segments, roughly the same size each. It makes sense because the penalty applied by the default auto_penalize function has the lowest penalty for segments of sizes closer to about half the total length. So, increasing the big_penalty_segment argument will not affect solution estimation anymore. In contrast, smaller segments are being over penalized. Because of this, it's necessary to reduce the small_segment_penalty segment, which we decrease to $1.5$, from the default of $10$. The results of the new segmentation can be seen in Table \@ref(tab:rsquaredres-adjusted-penalized), and the plot in Figure \@ref(fig:rsquaredplot-adjusted-penalized).

penalized_cost <- auto_penalize(
  berlin,
  cost = rsquared_cost,
  small_segment_penalty = 1.5
)
results <- segment(
  berlin,
  cost = penalized_cost,
  algorithm = "hierarchical"
)
print_results_table(
  results,
  caption="Results of the segmentation algorithm
    using an adjusted penalized R-squared cost
    function over Berlin data"
)
plot_results(results, berlin)

With the adjusted parameters, we see the rsquared_penalized was able to segment the data in Figure \@ref(fig:rsquaredplot-adjusted-penalized) in a seemingly accurate manner, despite the nature of the hierarchical algorithm. As discussed previously, we point the hierarchical algorithm is not adequate for the squared residual cost, as the grouping of neighboring segments under a macro-segment, an intermediate segment state under the hierarchical algorithm, has an unattractive cost value, due to the presence of alternating trends. Because of this, the macro-segment is never picked by the algorithm during the computation process, and so the ideal change points are never picked at their ideal positions. To see it more clearly, take the first 547 data points of the berlin data set, roughly one year and a half. The results of the segmentation with that portion of the berlin data set can be seen in Table \@ref(tab:subberlinres-penalized), and the plot can be seen in Figure \@ref(fig:subberlinplot-penalized).

sub_berlin <- berlin[, 1:547]
penalized_cost <- auto_penalize(
  sub_berlin,
  cost = rsquared_cost
)
results <- segment(
  sub_berlin,
  cost = penalized_cost,
  algorithm = "hierarchical"
)
print_results_table(
  results,
  caption="Results of the segmentation algorithm
    using an auto-penalized R-squared cost
    function over a subset of Berlin data"
)
plot_results(results, sub_berlin)

In Figure \@ref(fig:subberlinplot-penalized), we can see the segments do not match our expectations, as the algorithm is not able to bisect a macro-segment before finding the ideal segments in the next algorithm iteration.



thalesmello/segmentr documentation built on March 4, 2020, 1 a.m.