In thalesmello/segmentr: Segment Data Minimizing a Cost Function

Main Definition

In this chapter, we explain the problem by providing definitions and equations that gradually build the path to the proposed solution.

A Sequence of Random Variables and Its Segments

As shown in \@ref(eq:definerandomvector), let $X=(X_1, \dots , X_m)$, be a vector of random variables whose elements $X_j$ belong to a set $S$. The equation also shows all random variables $X_j$ have probability distribution $F$, each with parameter $\Theta_j$, such that $X_j \sim F(\Theta_j)$. A segment is defined as a sequence of indices for which $\Theta_a = \Theta_{a+1} = \dots = \Theta_b$ such that $1 \le a \le b \le m$, for $a$ and $b$ as the first and last indices of the segment, respectively. We also define a change point $c$ as being an index for which there is a change in the probability distribution parameters, i.e. $\Theta_{c-1} \ne \Theta_c$, for $c \ge 2$.

\begin{equation} \begin{gathered} X= (X_1, \dots , X_m) \ X_j\in S, \text{ for } 1 \le j \le m \text{ and } j \in \Z \ X_j\sim F(\Theta_j) \end{gathered} (#eq:definerandomvector) \end{equation}

Putting it all together in \@ref(eq:definesegments), a data set $X$ with $t$ change points has each segment represented by the sequence $S_k = X_{c_k}, \dots, X_{c_{k+1}-1}$, with $c_k$ each representing a change point when $k \le 1$. In order to simplify equations, let $c_0 = 1$ and $c_{t+1} = m + 1$, even though they are not change points.

\begin{equation} \begin{gathered} c_0 = 1\ c_{t+1} = m + 1 \ 1 \lt c_{1} \lt \dots \lt c_k \lt \dots \lt c_t \lt m + 1 \, \text{ for } 0 \le k \le t \text{ and } k \in \Z \ \Theta_{c_k} = \Theta_{c_k+1} = \dots = \Theta_{c_{k+1}-1} \ \Theta_{c_k - 1} \ne \Theta_{c_k}, \text{ for } 1 \le k \le t \text{ and } k \in \Z \ S_k = X_{c_k}, \dots, X_{c_{k+1}-1} \end{gathered} (#eq:definesegments) \end{equation}

The Segmentation Problem

As defined in \@ref(eq:definematrixd), let $D$ be a $n \times m$ matrix with elements $x_{i,j}$, with row and column indices $i$ and $j$, respectively. Each of the $n$ rows in the data matrix $D$ is a sample of the vector of random variables $X$, i.e. any row element with column index $j$ is a sample of the random variable $X_j$. In \@ref(eq:definechangepointset), let $C$ is the set of change points in the data set. Considering all random variables within a segment have the same probability function parameter, let $K$ be a cost function that calculates the cost value for a segment $S_k$ with estimated parameter $\Theta_k$, i.e. the calculated cost is minimal when the estimated $\Theta_k$ is optimal for the segment $S_k$. Therefore, a common definition for $K$ is the opposite of the maximum value of a log-likelihood function $L$, illustrated in \@ref(eq:commoncostfunction).

\begin{equation} D = \begin{bmatrix} x_{11} & \dots & x_{1j} & \dots & x_{1m} \ \dots & \dots & \dots & \dots & \dots \ x_{i1} & \dots & x_{ij} & \dots & x_{im} \ \dots & \dots & \dots & \dots & \dots \ x_{n1} & \dots & x_{nj} & \dots & x_{nm} \end{bmatrix}_{n \times m} (#eq:definematrixd) \end{equation}

\begin{equation} C = { c_1, \dots, c_t } (#eq:definechangepointset) \end{equation}

\begin{equation} K(S_k) = -max\left{L(\Theta|S_k)\right} (#eq:commoncostfunction) \end{equation}

Let $T$ be the total cost of the segments, i.e. the sum of all segment costs, as illustrated in \@ref(eq:totalcost). So, the segmentation problem is defined as estimating the optimal set of change points $C$ that minimize the total cost $T$, as described in \@ref(eq:changepointsestimation).

Notice that, in \@ref(eq:totalcost) and in \@ref(eq:changepointsestimation), the cost function $K$ and the data set $D$ are parameters to the segmentation problem, which will ultimately be used to search for the set of change points $C$ using dynamic programming. So, let $s(K, D) = C$ be the segmentation function whose output is the estimation of change points, as illustrated in \@ref(eq:changepointsestimation). The segmentr package implements a few different segmentation function algorithms, which are described in this work.

\begin{equation} T = \sum_{k=0}^t K(S_k) (#eq:totalcost) \end{equation}

\begin{equation} C = arg\,min\left{ T \right} = arg\,min\left{ \sum_{k=0}^t K(S_k) \right} (#eq:changepointsestimation) \end{equation}

In summary, segmentr is a generalization of the work described in [@base-paper], in which the segments of finite alphabet letters are picked by maximizing the discrete multivariate likelihood function. In this work, with the difference that we minimize a cost function rather than maximize a likelihood function, we solve the same problem proposed by [@base-paper], with the difference the cost function concept can be applied to many other use cases as well, e.g. segments with changes in the mean of columns, and segments with different linear regression parameters. These cases are discussed in Chapters \@ref(simulations) and \@ref(real-data-examples) respectively.

The Hausdorff Distance

As the solution estimated by the segmentation function is a set of points, a measure is made necessary to compare the estimations provided by the different estimation algorithms that will be presented in Chapter \@ref(solution-estimation). The Hausdorff distance is commonly used as a measure of distance between two distinct sets, and will be used to compare change point sets with differing numbers of elements each.

[@topology] defines the Hausdorff distance as a measure of how far two subsets from a metric space are from one another. It's defined as the biggest of all the distances from a point in one set to the closest point in the other set, which is expressed mathematically in \@ref(eq:hausdorffformula), for sets $X$ and $Y$ given as input to the function and $d(x, y)$ a distance function defined in the metric space $S$ such that $x, y \in S$.

\begin{equation} d_H(X,Y) = \max\left{\,\sup_{x \in X} \inf_{y \in Y} d(x,y)\, ,\, \sup_{y \in Y} \inf_{x \in X} d(x,y)\,\right} (#eq:hausdorffformula) \end{equation}

Since a set of change points is a subset of the column indices in a data set, we define the distance $d$ as the absolute value of two given numbers for the use-cases analyzed in this paper.

thalesmello/segmentr documentation built on March 4, 2020, 1 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com