In this chapter, we explain the problem by providing definitions and equations that gradually build the path to the proposed solution.
As shown in \@ref(eq:definerandomvector), let $X=(X_1, \dots , X_m)$, be a vector of random variables whose elements $X_j$ belong to a set $S$. The equation also shows all random variables $X_j$ have probability distribution $F$, each with parameter $\Theta_j$, such that $X_j \sim F(\Theta_j)$. A segment is defined as a sequence of indices for which $\Theta_a = \Theta_{a+1} = \dots = \Theta_b$ such that $1 \le a \le b \le m$, for $a$ and $b$ as the first and last indices of the segment, respectively. We also define a change point $c$ as being an index for which there is a change in the probability distribution parameters, i.e. $\Theta_{c-1} \ne \Theta_c$, for $c \ge 2$.
\begin{equation} \begin{gathered} X= (X_1, \dots , X_m) \ X_j\in S, \text{ for } 1 \le j \le m \text{ and } j \in \Z \ X_j\sim F(\Theta_j) \end{gathered} (#eq:definerandomvector) \end{equation}
Putting it all together in \@ref(eq:definesegments), a data set $X$ with $t$ change points has each segment represented by the sequence $S_k = X_{c_k}, \dots, X_{c_{k+1}-1}$, with $c_k$ each representing a change point when $k \le 1$. In order to simplify equations, let $c_0 = 1$ and $c_{t+1} = m + 1$, even though they are not change points.
\begin{equation} \begin{gathered} c_0 = 1\ c_{t+1} = m + 1 \ 1 \lt c_{1} \lt \dots \lt c_k \lt \dots \lt c_t \lt m + 1 \, \text{ for } 0 \le k \le t \text{ and } k \in \Z \ \Theta_{c_k} = \Theta_{c_k+1} = \dots = \Theta_{c_{k+1}-1} \ \Theta_{c_k - 1} \ne \Theta_{c_k}, \text{ for } 1 \le k \le t \text{ and } k \in \Z \ S_k = X_{c_k}, \dots, X_{c_{k+1}-1} \end{gathered} (#eq:definesegments) \end{equation}
As defined in \@ref(eq:definematrixd), let $D$ be a $n \times m$ matrix with elements $x_{i,j}$, with row and column indices $i$ and $j$, respectively. Each of the $n$ rows in the data matrix $D$ is a sample of the vector of random variables $X$, i.e. any row element with column index $j$ is a sample of the random variable $X_j$. In \@ref(eq:definechangepointset), let $C$ is the set of change points in the data set. Considering all random variables within a segment have the same probability function parameter, let $K$ be a cost function that calculates the cost value for a segment $S_k$ with estimated parameter $\Theta_k$, i.e. the calculated cost is minimal when the estimated $\Theta_k$ is optimal for the segment $S_k$. Therefore, a common definition for $K$ is the opposite of the maximum value of a log-likelihood function $L$, illustrated in \@ref(eq:commoncostfunction).
\begin{equation} D = \begin{bmatrix} x_{11} & \dots & x_{1j} & \dots & x_{1m} \ \dots & \dots & \dots & \dots & \dots \ x_{i1} & \dots & x_{ij} & \dots & x_{im} \ \dots & \dots & \dots & \dots & \dots \ x_{n1} & \dots & x_{nj} & \dots & x_{nm} \end{bmatrix}_{n \times m} (#eq:definematrixd) \end{equation}
\begin{equation} C = { c_1, \dots, c_t } (#eq:definechangepointset) \end{equation}
\begin{equation} K(S_k) = -max\left{L(\Theta|S_k)\right} (#eq:commoncostfunction) \end{equation}
Let $T$ be the total cost of the segments, i.e. the sum of all segment costs, as illustrated in \@ref(eq:totalcost). So, the segmentation problem is defined as estimating the optimal set of change points $C$ that minimize the total cost $T$, as described in \@ref(eq:changepointsestimation).
Notice that, in \@ref(eq:totalcost) and in \@ref(eq:changepointsestimation),
the cost function $K$ and the data set $D$ are parameters to the segmentation
problem, which will ultimately be used to search for the set of change points
$C$ using dynamic programming. So, let $s(K, D) = C$ be the segmentation
function whose output is the estimation of change points, as illustrated in
\@ref(eq:changepointsestimation). The segmentr
package implements a few different
segmentation function algorithms, which are described in this work.
\begin{equation} T = \sum_{k=0}^t K(S_k) (#eq:totalcost) \end{equation}
\begin{equation} C = arg\,min\left{ T \right} = arg\,min\left{ \sum_{k=0}^t K(S_k) \right} (#eq:changepointsestimation) \end{equation}
In summary, segmentr
is a generalization of the work described in
[@base-paper], in which the segments of finite alphabet letters are picked by
maximizing the discrete multivariate likelihood function. In this
work, with the difference that we minimize a cost function rather than maximize
a likelihood function, we solve the same problem proposed by [@base-paper], with the
difference the cost function concept can be applied to many other use cases as
well, e.g. segments with changes in the mean of columns, and segments with
different linear regression parameters. These cases are discussed in
Chapters \@ref(simulations) and \@ref(real-data-examples) respectively.
As the solution estimated by the segmentation function is a set of points, a measure is made necessary to compare the estimations provided by the different estimation algorithms that will be presented in Chapter \@ref(solution-estimation). The Hausdorff distance is commonly used as a measure of distance between two distinct sets, and will be used to compare change point sets with differing numbers of elements each.
[@topology] defines the Hausdorff distance as a measure of how far two subsets from a metric space are from one another. It's defined as the biggest of all the distances from a point in one set to the closest point in the other set, which is expressed mathematically in \@ref(eq:hausdorffformula), for sets $X$ and $Y$ given as input to the function and $d(x, y)$ a distance function defined in the metric space $S$ such that $x, y \in S$.
\begin{equation} d_H(X,Y) = \max\left{\,\sup_{x \in X} \inf_{y \in Y} d(x,y)\, ,\, \sup_{y \in Y} \inf_{x \in X} d(x,y)\,\right} (#eq:hausdorffformula) \end{equation}
Since a set of change points is a subset of the column indices in a data set, we define the distance $d$ as the absolute value of two given numbers for the use-cases analyzed in this paper.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.