In thalesmello/segmentr: Segment Data Minimizing a Cost Function

Simulations

source("helper.R")

To exemplify the utility of this package, a handful of hypothetical data are presented, together with proposals of cost functions that are expected to segment the data adequately.

Segments of Independent Variables

To show compatibility with the work presented in [@base-paper], we analyze the same problem presented in their work.

Let $B = (B_1, \dots , B_6)$ be a sequence of independent random variables with Bernoulli distributions of probability $0.5$ and let $X = (X_1, \dots , X_15)$ be another sequence of random variables dependent on $B$, with relationships described in \@ref(eq:examplecolumns). From the relationships, it's possible to see the first segment $X_1, ..., X_5$ depend on $B_1, B_2$, the second segment $X_6, ..., X_{10}$ depend on $B_3, B_4$ and the third and last segment $X_{11}, ..., X_{15}$ depend on $X_4, X_5$. Therefore, the set of change points for $X$ is $C = {6, 11}$.

\begin{equation} \begin{aligned} X_1 & = B_1 \ X_2 & = B_1 - B_2 \ X_3 & = B_2 \ X_4 & = B_1 + B_2 \ X_5 & = B_1 \ X_6 & = B_3 \ X_7 & = B_3 - B_4 \ X_8 & = B_4 \ X_9 & = B_3 + B_4 \ X_{10} & = B_3 \ X_{11} & = B_5 \ X_{12} & = B_5 - B_6 \ X_{13} & = B_6 \ X_{14} & = B_5 + B_6 \ X_{15} & = B_5 \end{aligned} (#eq:examplecolumns) \end{equation}

Given $D$, illustrated in \@ref(tab:sample-columns), a sample data set of the sequence of random variables $X$, the multivariate likelihood function $L$ [@park2017fundamentals] can be used to estimate the likelihood of a given segment in the discrete data set, as described in \@ref(eq:multivariatelikelihood). The multivariate() is a fast native-code implementation of that function that is available in the segmentr package. We then use that likelihood function to define a penalized cost function $K$ in \@ref(eq:costforexamplecolumns) with an extra term to penalize segments that are too large, based on the number of columns $\text{ncol}(S_k)$ of the segment. Notice it's important to use a penalized cost function because the original likelihood function tends to favor bigger segments.

\begin{equation} \log(L(X|S_k)) = \sum_{k=0}^t P(X_{c_k:c_{k+1}-1} = D_{c_k:c_{k+1}-1}) (#eq:multivariatelikelihood) \end{equation}

\begin{equation} K(S_k) = - \log(L(S_k)) + \text{ncol}(S_k) (#eq:costforexamplecolumns) \end{equation}

Therefore, the result of segmenting the data set $D$ with the cost function $K$ is shown in Table \@ref(tab:results-multivariate-penalized), matching the expected set of change points $C$. For contrast, an estimate solution of the change points using an unpenalized cost is shown in Table \@ref(tab:results-multivariate-non-penalized).

(ref:sample-columns-caption) Sample values of the correlated random variables defined in \@ref(eq:examplecolumns).

n <- 100
B1 <- sample(1:2, n, replace = TRUE)
B2 <- sample(1:2, n, replace = TRUE)
B3 <- sample(1:2, n, replace = TRUE)
B4 <- sample(1:2, n, replace = TRUE)
B5 <- sample(1:2, n, replace = TRUE)
B6 <- sample(1:2, n, replace = TRUE)

X1 <- B1
X2 <- B1 - B2
X3 <- B2
X4 <- B1 + B2
X5 <- B1
X6 <- B3
X7 <- B3 - B4
X8 <- B4
X9 <- B3 + B4
X10 <- B3
X11 <- B5
X12 <- B5 - B6
X13 <- B6
X14 <- B5 + B6
X15 <- B5
D_example <- cbind(X1, X2, X3, X4, X5, X6, X7, X8,
                   X9, X10, X11, X12, X13, X14, X15)
head(D_example) %>%
  kable(caption="Sample values of the correlated random
        variables defined in (ref:eq-examplecolumns).")

(ref:eq-examplecolumns) \@ref(eq:examplecolumns)

multivariate_cost = function(X) -multivariate(X) + 2 ^ ncol(X)

results_multivariate_penalized <- segment(
  D_example,
  cost = multivariate_cost,
  algorithm = "exact"
)

print_results_table(
  results_multivariate_penalized,
  caption="Segment tables as defined in
  (ref:eq-examplecolumns)."
)

segment(D_example,
        cost = function(X) -multivariate(X),
        algorithm = "exact") %>%
  print_results_table(
    caption="Results of segmentation using a non-penalized
      multivariate cost")

Segments with Similar Averages

[@homozygosity] describes a process on how to find windows of contiguous homozygosity, i.e. segments in the genetic data in which the alleles are of the same type. This is of interest to a researcher investigating diseases. In this scenario, segmentr can be used to segment random variables $X_1, ..., X_m$ that represent genetic data, encoded as zero for homozygosity and one for heterozygosity, i.e. $0$ when the alleles are the same and $1$ when different.

So, to use segmentr to solve this problem, it's necessary to find a cost function that favors the homogeneity of a given segment. That problem is approached by proposing a heterogeneity cost function, i.e. a function that penalizes segments whose elements are far from the segment average. One such function is defined in \@ref(eq:heterogeneitycost). Notice, however, the cost function proposed favors single column segments, as it's the size for which all the elements approximate the segment average the most. To counter this undesirable behavior, we penalized the function by adding a constant to it, as described in \@ref(eq:penalizedheterogeneitycost). The constant factor has the effect of adding up when too many segments are considered in the estimation process, making it so wider segments are picked up in the estimated solution.

In order to observe how the proposed function behaves, consider the example defined in \@ref(eq:geneticexample), in which $X_i$ for $i \in {1, \dots, 20}$ represent each a column indexed by $i$ of a data set $D$, and $\text{Bern(p)}$ represents the Bernoulli distribution with probability $p$. The result of segmenting a data set sampled from \@ref(eq:geneticexample) using the penalized heterogeneity cost described in \@ref(eq:penalizedheterogeneitycost) is shown in Table \@ref(tab:results-penalized-mean).

\begin{equation} K(S_k)=\sum_i(x_i-E[S_k])^2 (#eq:heterogeneitycost) \end{equation}

\begin{equation} K_p(S_k)=K(S_k) + 1 (#eq:penalizedheterogeneitycost) \end{equation}

\begin{equation} \begin{aligned} X_{1} & \sim \text{Bern}(0.9) \ X_{2} & \sim \text{Bern}(0.9) \ X_{3} & \sim \text{Bern}(0.9) \ X_{4} & \sim \text{Bern}(0.9) \ X_{5} & \sim \text{Bern}(0.9) \ X_{6} & \sim \text{Bern}(0.1) \ X_{7} & \sim \text{Bern}(0.1) \ X_{8} & \sim \text{Bern}(0.1) \ X_{9} & \sim \text{Bern}(0.1) \ X_{10} & \sim \text{Bern}(0.1) \ X_{11} & \sim \text{Bern}(0.1) \ X_{12} & \sim \text{Bern}(0.1) \ X_{13} & \sim \text{Bern}(0.1) \ X_{14} & \sim \text{Bern}(0.1) \ X_{15} & \sim \text{Bern}(0.1) \ X_{16} & \sim \text{Bern}(0.9) \ X_{17} & \sim \text{Bern}(0.9) \ X_{18} & \sim \text{Bern}(0.9) \ X_{19} & \sim \text{Bern}(0.9) \ X_{20} & \sim \text{Bern}(0.9) \end{aligned} (#eq:geneticexample) \end{equation}

(ref:eq-penalizedheterogeneitycost) \@ref(eq:penalizedheterogeneitycost)

(ref:eq-geneticexample) \@ref(eq:geneticexample)

heterogeneity_cost <- function(X) {
  mean_value <- mean(X, na.rm = T)
  if (is.na(mean_value)) {
    0
  } else {
    sum((X - mean_value)^2)
  }
}

penalized_heterogeneity_cost <- function(X) heterogeneity_cost(X) + 1

make_segment <- function(n, p) {
  matrix(rbinom(100 * n, 1, p), nrow = 100)
}

D_genetic <- cbind(
  make_segment(5, 0.9),
  make_segment(10, 0.1),
  make_segment(5, 0.9)
)

segment(
  D_genetic,
  cost = penalized_heterogeneity_cost,
  algorithm = "hieralg"
) %>%
  print_results_table(
    caption="Results of segmentation by applying the
      cost function defined in
      (ref:eq-penalizedheterogeneitycost) to a sample of size
      10 of the model defined in (ref:eq-geneticexample).
    ")

Accuracy of Algorithms Using Different Algorithms

Given different solutions obtained by the segment function, it's often necessary to compare them with each other. For that, the Hausdorff distance, defined in Chapter \@ref(the-hausdorff-distance), can be used to measure a distance between two sets of estimated change points.

We want to measure how well the different algorithms provided in the segmentr package find the segments in the example correlated columns simulated data set described in \@ref(eq:examplecolumns), for which we know what the exact solution to the simulation is.

expected_multivariate_example <- c(6, 11)

deviation <- partial(
  segment_distance,
  changepoints2=expected_multivariate_example
)

exact_multivariate_changepoints <- segment(
  D_example,
  cost = multivariate_cost,
  algorithm = "exact"
)$changepoints

hieralg_multivariate_changepoints <- segment(
  D_example,
  cost = multivariate_cost,
  algorithm = "hierarchical"
)$changepoints

hybrid_multivariate_changepoints <- segment(
  D_example,
  cost = multivariate_cost,
  algorithm = "hybrid"
)$changepoints

hybrid_multivariate_changepoints_threshold <- segment(
  D_example,
  cost = multivariate_cost,
  algorithm = "hybrid",
  threshold = 4
)$changepoints

tribble(
  ~`Description`,
    ~`Changepoints`,
      ~`Hausdorff Distance`,
  "Expected solution",
    comma_format(expected_multivariate_example),
      deviation(expected_multivariate_example),
  "Exact algorithm estimate",
    comma_format(exact_multivariate_changepoints),
      deviation(exact_multivariate_changepoints),
  "Hierarchical algorithm estimate",
    comma_format(hieralg_multivariate_changepoints),
      deviation(hieralg_multivariate_changepoints),
  "Hybrid algorithm estimate with threshold 50",
    comma_format(hybrid_multivariate_changepoints),
      deviation(hybrid_multivariate_changepoints),
  "Hybrid algorithm estimate with threshold 4",
    comma_format(hybrid_multivariate_changepoints_threshold),
      deviation(hybrid_multivariate_changepoints_threshold)
) %>%
  kable(caption="Comparison of solutions of different algorithms
to segment the data set described in (ref:eq-examplecolumns),
measuring how far each solution is from the ideal
solution using the Hausdorff distance.")

Therefore, in Table \@ref(tab:hausdorff-multivariate-comparison) we can see the comparison of different algorithms with different parameters. We notice the "exact" algorithm manages to properly match the expected change points solution, whereas the "hierarchical" algorithm finds an extra segment, which causes the Hausdorff distance to be bigger than zero. The two "hybrid" algorithm cases are interesting in the sense it that it shows how to algorithm works by either adopting the exact or the hierarchical algorithm depending on the threshold and the size of the segment being analyzed. When the threshold argument is large, it applies the exact algorithm to the data set, whereas when the threshold is small, it applies the hierarchical algorithm instead.

Consider now the simulated data set of segments with similar averages described in equation (ref:eq-geneticexample). We do the same algorithm comparison with this data set in Table \@ref(tab:hausdorff-mean-comparison). In contrast with the previous experiment, all the algorithms manage to find the expected solution to the problem.

expected_mean_example <- c(6, 16)

deviation <- partial(
  segment_distance,
  changepoints2=expected_mean_example
)

exact_mean_changepoints <- segment(
  D_genetic,
  cost = penalized_heterogeneity_cost,
  algorithm = "exact"
)$changepoints

hieralg_mean_changepoints <- segment(
  D_genetic,
  cost = penalized_heterogeneity_cost,
  algorithm = "hierarchical"
)$changepoints

hybrid_mean_changepoints <- segment(
  D_genetic,
  cost = penalized_heterogeneity_cost,
  algorithm = "hybrid"
)$changepoints

hybrid_mean_changepoints_threshold <- segment(
  D_genetic,
  cost = penalized_heterogeneity_cost,
  algorithm = "hybrid",
  threshold = 4
)$changepoints

tribble(
  ~`Description`,
    ~`Changepoints`,
      ~`Hausdorff Distance`,
  "Expected solution",
    comma_format(expected_mean_example),
      deviation(expected_mean_example),
  "Exact algorithm estimate",
    comma_format(exact_mean_changepoints),
      deviation(exact_mean_changepoints),
  "Hierarchical algorithm estimate",
    comma_format(hieralg_mean_changepoints),
      deviation(hieralg_mean_changepoints),
  "Hybrid algorithm estimate with threshold 50",
    comma_format(hybrid_mean_changepoints),
      deviation(hybrid_mean_changepoints),
  "Hybrid algorithm estimate with threshold 4",
    comma_format(hybrid_mean_changepoints_threshold),
      deviation(hybrid_mean_changepoints_threshold)
) %>%
  kable(caption="Comparison of solutions of different algorithms
to segment the data set described in (ref:eq-geneticexample),
measuring how far each solution is from the ideal
solution using the Hausdorff distance.")