source("helper.R")
To exemplify the utility of this package, a handful of hypothetical data are presented, together with proposals of cost functions that are expected to segment the data adequately.
To show compatibility with the work presented in [@base-paper], we analyze the same problem presented in their work.
Let $B = (B_1, \dots , B_6)$ be a sequence of independent random variables with Bernoulli distributions of probability $0.5$ and let $X = (X_1, \dots , X_15)$ be another sequence of random variables dependent on $B$, with relationships described in \@ref(eq:examplecolumns). From the relationships, it's possible to see the first segment $X_1, ..., X_5$ depend on $B_1, B_2$, the second segment $X_6, ..., X_{10}$ depend on $B_3, B_4$ and the third and last segment $X_{11}, ..., X_{15}$ depend on $X_4, X_5$. Therefore, the set of change points for $X$ is $C = {6, 11}$.
\begin{equation} \begin{aligned} X_1 & = B_1 \ X_2 & = B_1 - B_2 \ X_3 & = B_2 \ X_4 & = B_1 + B_2 \ X_5 & = B_1 \ X_6 & = B_3 \ X_7 & = B_3 - B_4 \ X_8 & = B_4 \ X_9 & = B_3 + B_4 \ X_{10} & = B_3 \ X_{11} & = B_5 \ X_{12} & = B_5 - B_6 \ X_{13} & = B_6 \ X_{14} & = B_5 + B_6 \ X_{15} & = B_5 \end{aligned} (#eq:examplecolumns) \end{equation}
Given $D$, illustrated in \@ref(tab:sample-columns), a sample data set of the
sequence of random variables $X$, the multivariate likelihood function $L$
[@park2017fundamentals] can be used to estimate the likelihood of a given
segment in the discrete data set, as described in
\@ref(eq:multivariatelikelihood). The multivariate()
is a fast native-code
implementation of that function that is available in the segmentr
package. We
then use that likelihood function to define a penalized cost function $K$ in
\@ref(eq:costforexamplecolumns) with an extra term to penalize segments that
are too large, based on the number of columns $\text{ncol}(S_k)$ of the segment.
Notice it's important to use a penalized cost function because the original
likelihood function tends to favor bigger segments.
\begin{equation} \log(L(X|S_k)) = \sum_{k=0}^t P(X_{c_k:c_{k+1}-1} = D_{c_k:c_{k+1}-1}) (#eq:multivariatelikelihood) \end{equation}
\begin{equation} K(S_k) = - \log(L(S_k)) + \text{ncol}(S_k) (#eq:costforexamplecolumns) \end{equation}
Therefore, the result of segmenting the data set $D$ with the cost function $K$ is shown in Table \@ref(tab:results-multivariate-penalized), matching the expected set of change points $C$. For contrast, an estimate solution of the change points using an unpenalized cost is shown in Table \@ref(tab:results-multivariate-non-penalized).
(ref:sample-columns-caption) Sample values of the correlated random variables defined in \@ref(eq:examplecolumns).
n <- 100 B1 <- sample(1:2, n, replace = TRUE) B2 <- sample(1:2, n, replace = TRUE) B3 <- sample(1:2, n, replace = TRUE) B4 <- sample(1:2, n, replace = TRUE) B5 <- sample(1:2, n, replace = TRUE) B6 <- sample(1:2, n, replace = TRUE) X1 <- B1 X2 <- B1 - B2 X3 <- B2 X4 <- B1 + B2 X5 <- B1 X6 <- B3 X7 <- B3 - B4 X8 <- B4 X9 <- B3 + B4 X10 <- B3 X11 <- B5 X12 <- B5 - B6 X13 <- B6 X14 <- B5 + B6 X15 <- B5 D_example <- cbind(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15) head(D_example) %>% kable(caption="Sample values of the correlated random variables defined in (ref:eq-examplecolumns).")
(ref:eq-examplecolumns) \@ref(eq:examplecolumns)
multivariate_cost = function(X) -multivariate(X) + 2 ^ ncol(X) results_multivariate_penalized <- segment( D_example, cost = multivariate_cost, algorithm = "exact" ) print_results_table( results_multivariate_penalized, caption="Segment tables as defined in (ref:eq-examplecolumns)." )
segment(D_example, cost = function(X) -multivariate(X), algorithm = "exact") %>% print_results_table( caption="Results of segmentation using a non-penalized multivariate cost")
[@homozygosity] describes a process on how to find windows of contiguous
homozygosity, i.e. segments in the genetic data in which the alleles are of the
same type. This is of interest to a researcher investigating diseases.
In this scenario, segmentr
can be used to segment random
variables $X_1, ..., X_m$ that represent genetic data, encoded as zero for
homozygosity and one for heterozygosity, i.e. $0$ when the alleles are the
same and $1$ when different.
So, to use segmentr
to solve this problem, it's necessary to find a cost
function that favors the homogeneity of a given segment. That problem
is approached by proposing a heterogeneity cost function, i.e. a function that
penalizes segments whose elements are far from the segment average. One
such function is defined in \@ref(eq:heterogeneitycost). Notice, however, the
cost function proposed favors single column segments, as it's the size
for which all the elements approximate the segment average the most. To counter
this undesirable behavior, we penalized the function by adding a constant to
it, as described in \@ref(eq:penalizedheterogeneitycost). The constant factor
has the effect of adding up when too many segments are considered in the
estimation process, making it so wider segments are picked up in the
estimated solution.
In order to observe how the proposed function behaves, consider the example defined in \@ref(eq:geneticexample), in which $X_i$ for $i \in {1, \dots, 20}$ represent each a column indexed by $i$ of a data set $D$, and $\text{Bern(p)}$ represents the Bernoulli distribution with probability $p$. The result of segmenting a data set sampled from \@ref(eq:geneticexample) using the penalized heterogeneity cost described in \@ref(eq:penalizedheterogeneitycost) is shown in Table \@ref(tab:results-penalized-mean).
\begin{equation} K(S_k)=\sum_i(x_i-E[S_k])^2 (#eq:heterogeneitycost) \end{equation}
\begin{equation} K_p(S_k)=K(S_k) + 1 (#eq:penalizedheterogeneitycost) \end{equation}
\begin{equation} \begin{aligned} X_{1} & \sim \text{Bern}(0.9) \ X_{2} & \sim \text{Bern}(0.9) \ X_{3} & \sim \text{Bern}(0.9) \ X_{4} & \sim \text{Bern}(0.9) \ X_{5} & \sim \text{Bern}(0.9) \ X_{6} & \sim \text{Bern}(0.1) \ X_{7} & \sim \text{Bern}(0.1) \ X_{8} & \sim \text{Bern}(0.1) \ X_{9} & \sim \text{Bern}(0.1) \ X_{10} & \sim \text{Bern}(0.1) \ X_{11} & \sim \text{Bern}(0.1) \ X_{12} & \sim \text{Bern}(0.1) \ X_{13} & \sim \text{Bern}(0.1) \ X_{14} & \sim \text{Bern}(0.1) \ X_{15} & \sim \text{Bern}(0.1) \ X_{16} & \sim \text{Bern}(0.9) \ X_{17} & \sim \text{Bern}(0.9) \ X_{18} & \sim \text{Bern}(0.9) \ X_{19} & \sim \text{Bern}(0.9) \ X_{20} & \sim \text{Bern}(0.9) \end{aligned} (#eq:geneticexample) \end{equation}
(ref:eq-penalizedheterogeneitycost) \@ref(eq:penalizedheterogeneitycost)
(ref:eq-geneticexample) \@ref(eq:geneticexample)
heterogeneity_cost <- function(X) { mean_value <- mean(X, na.rm = T) if (is.na(mean_value)) { 0 } else { sum((X - mean_value)^2) } } penalized_heterogeneity_cost <- function(X) heterogeneity_cost(X) + 1 make_segment <- function(n, p) { matrix(rbinom(100 * n, 1, p), nrow = 100) } D_genetic <- cbind( make_segment(5, 0.9), make_segment(10, 0.1), make_segment(5, 0.9) ) segment( D_genetic, cost = penalized_heterogeneity_cost, algorithm = "hieralg" ) %>% print_results_table( caption="Results of segmentation by applying the cost function defined in (ref:eq-penalizedheterogeneitycost) to a sample of size 10 of the model defined in (ref:eq-geneticexample). ")
Given different solutions obtained by the segment
function, it's often
necessary to compare them with each other. For that, the Hausdorff distance,
defined in Chapter \@ref(the-hausdorff-distance), can be used to measure a
distance between two sets of estimated change points.
We want to measure how well the different algorithms provided in the segmentr
package find the segments in the example correlated columns simulated data set
described in \@ref(eq:examplecolumns), for which we know what the exact solution
to the simulation is.
expected_multivariate_example <- c(6, 11) deviation <- partial( segment_distance, changepoints2=expected_multivariate_example ) exact_multivariate_changepoints <- segment( D_example, cost = multivariate_cost, algorithm = "exact" )$changepoints hieralg_multivariate_changepoints <- segment( D_example, cost = multivariate_cost, algorithm = "hierarchical" )$changepoints hybrid_multivariate_changepoints <- segment( D_example, cost = multivariate_cost, algorithm = "hybrid" )$changepoints hybrid_multivariate_changepoints_threshold <- segment( D_example, cost = multivariate_cost, algorithm = "hybrid", threshold = 4 )$changepoints tribble( ~`Description`, ~`Changepoints`, ~`Hausdorff Distance`, "Expected solution", comma_format(expected_multivariate_example), deviation(expected_multivariate_example), "Exact algorithm estimate", comma_format(exact_multivariate_changepoints), deviation(exact_multivariate_changepoints), "Hierarchical algorithm estimate", comma_format(hieralg_multivariate_changepoints), deviation(hieralg_multivariate_changepoints), "Hybrid algorithm estimate with threshold 50", comma_format(hybrid_multivariate_changepoints), deviation(hybrid_multivariate_changepoints), "Hybrid algorithm estimate with threshold 4", comma_format(hybrid_multivariate_changepoints_threshold), deviation(hybrid_multivariate_changepoints_threshold) ) %>% kable(caption="Comparison of solutions of different algorithms to segment the data set described in (ref:eq-examplecolumns), measuring how far each solution is from the ideal solution using the Hausdorff distance.")
Therefore, in Table \@ref(tab:hausdorff-multivariate-comparison) we can see the comparison of different algorithms with different parameters. We notice the "exact" algorithm manages to properly match the expected change points solution, whereas the "hierarchical" algorithm finds an extra segment, which causes the Hausdorff distance to be bigger than zero. The two "hybrid" algorithm cases are interesting in the sense it that it shows how to algorithm works by either adopting the exact or the hierarchical algorithm depending on the threshold and the size of the segment being analyzed. When the threshold argument is large, it applies the exact algorithm to the data set, whereas when the threshold is small, it applies the hierarchical algorithm instead.
Consider now the simulated data set of segments with similar averages described in equation (ref:eq-geneticexample). We do the same algorithm comparison with this data set in Table \@ref(tab:hausdorff-mean-comparison). In contrast with the previous experiment, all the algorithms manage to find the expected solution to the problem.
expected_mean_example <- c(6, 16) deviation <- partial( segment_distance, changepoints2=expected_mean_example ) exact_mean_changepoints <- segment( D_genetic, cost = penalized_heterogeneity_cost, algorithm = "exact" )$changepoints hieralg_mean_changepoints <- segment( D_genetic, cost = penalized_heterogeneity_cost, algorithm = "hierarchical" )$changepoints hybrid_mean_changepoints <- segment( D_genetic, cost = penalized_heterogeneity_cost, algorithm = "hybrid" )$changepoints hybrid_mean_changepoints_threshold <- segment( D_genetic, cost = penalized_heterogeneity_cost, algorithm = "hybrid", threshold = 4 )$changepoints tribble( ~`Description`, ~`Changepoints`, ~`Hausdorff Distance`, "Expected solution", comma_format(expected_mean_example), deviation(expected_mean_example), "Exact algorithm estimate", comma_format(exact_mean_changepoints), deviation(exact_mean_changepoints), "Hierarchical algorithm estimate", comma_format(hieralg_mean_changepoints), deviation(hieralg_mean_changepoints), "Hybrid algorithm estimate with threshold 50", comma_format(hybrid_mean_changepoints), deviation(hybrid_mean_changepoints), "Hybrid algorithm estimate with threshold 4", comma_format(hybrid_mean_changepoints_threshold), deviation(hybrid_mean_changepoints_threshold) ) %>% kable(caption="Comparison of solutions of different algorithms to segment the data set described in (ref:eq-geneticexample), measuring how far each solution is from the ideal solution using the Hausdorff distance.")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.