In thalesmello/segmentr: Segment Data Minimizing a Cost Function

Solution Estimation

In this chapter, we discuss the process to solve \@ref(eq:changepointsestimation) and explore a solution involving dynamic programming. Because the exact solution to this problem has quadratic complexity, we also discuss some alternatives that simplify the search path and provide an approximate solution.

Exact Algorithm

The first attempt to solve the equation would be to search for all the possible alternatives to find the correct minimal set of intervals that minimize the total cost of the system. It's done iteratively by:

let $D$ be the current data set we want to segment
for each column index $i \in {1, \dots, m}$
for each column index $j \in {i, \dots, m}$
- compute $K_{i:j}=K(D_{i:j})$
- store the cost $K_{i:j}$ and the index $j$ for the lowest value of $K_{i:j}$ for later comparison

Finally, after computing all of the possible combinations of segment costs, the procedure to find the optimal solution is:

let $j = m$
while $j > 1$
search the stored values for $i = \underset{i}{\arg\min} (K_{i:j})$
store $i$ in the set of change points $C$ if $i \ne 1$
let $j = i$ and repeat until $j = 1$

With the execution of the procedure described above, the estimated set of change points is stored in $C$. The solution will be exact because it analyses all the possible combinations. However, that has a time complexity of $O(m^2)$ in the Big-O notation, for the number of columns $m$ in the data set. Because the number of columns in data segmentation problems can be very large, computation time can be very prohibitive. For those situations, approximate solutions are described.

Hierarchical Algorithm

To simplify the search path of the algorithm, [@base-paper] also proposes a technique that relies on the assumption of the data hierarchical, i.e. a segment can be sub-divided by reapplying the segmentation function recursively. The algorithm is described as:

let $D$ be the current data set we want to segment
for each column $i$ in the data set
compute the total cost $K_i = K(D_{1:i-1})+K(D_{i:m})$ if $i \ne 1$
find $i$ for which $K_i$ is minimum
if $K_i \gt K(D_{1:m})$
return the empty set as the result of the current function call
recursively find the set of change points $C_L$ by calling the algorithm on the left segment $D_{1:i-1}$
recursively find the set of change points $C_R$ by calling the algorithm on the right segment $D_{i:m}$
return $C=C_L \cup C_R \cup {i}$ as result of current function call

Implementing the algorithm described above will estimate the set of change points $C$ with time complexity of $O(m\log(m))$, for the number of columns $m$ in the data set. The reduction in time complexity is only possible because of the strong assumption the data set has a hierarchical cost behavior. However, that is not true for all types of segmentation problems, as it will be seen in more detail in the next chapters. So, the usage of this algorithm should be considered with care.

Hybrid Algorithm

The hybrid algorithm is a modified version of the hierarchical algorithm, in which it will start searching the segments using the hierarchical algorithm, and if the segments become smaller than a certain number of columns threshold, it will switch to the exact algorithm. The procedure would be as follows:

let $D$ be the current data set we want to segment
if the number of columns of $D$ is $m \lt k$, in which $k$ is a predefined threshold
return the set of change points calculated by the exact algorithm.
for each column $i$ in the data set
compute the total cost $K_i = K(D_{1:i-1})+K(D_{i:m})$ if $i \ne 1$
find $i$ for which $K_i$ is minimum
if $K_i \lt K(X_{1:p})$
return empty set as the result of the current function call
recursively find the set of change points $C_L$ by calling the algorithm on the left segment $K_{1:i-1}$
recursively find the set of change points $C_R$ by calling the algorithm on the right segment $K_{i:m}$
return $C=C_L \cup C_R \cup {i}$ as result of current function call

The only difference between this and the hierarchical procedure is the presence of the conditional step is the beginning, that tests whether it should switch over to the exact algorithm. As will be shown in later chapters, the accuracy of this algorithm is shown not to be more advantageous than using the hierarchical or the exact algorithms directly, depending on the situation.

thalesmello/segmentr documentation built on March 4, 2020, 1 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com