In cwcartmell/maRkov: Goodness-of-fit Tests for Binary Markov Chains

The test statistics utilized in the maRkov package are drawn from the paper "Exact Goodness-of-Fit Tests for Markov Chains" [@BIOM:BIOM12009]. This vignette will attempt to give a brief introduction so that the user of the maRkov package is better equipped to understand the guts of the included programs.

Introduction to Markov chains

Markov chains were developed over a hundred years ago by Andrey Markov in an attempt to analyze the distribution of vowels and consonants in Russian poetry. Over the years, their use has expanded into a wide array of sciences, including genetics, biology, and physics.

Despite being extremely powerful tools, Markov chains are very simple processes at their most basic level. The chains are composed of a set of states, called a state space, as well as a set of probabilities for transitioning from each state to each other state, usually encoded in a transition matrix. Because this package is primarily concerned with binary (two state) Markov chains, we'll be using those as examples.

The defining feature of Markov chains is that they are memoryless. That is to say, the probability of transitioning from one state to another is not affected in any way by the history of the chain. The only determining factor for the transition probabilities is the state that they are currently in.

+-----+---+---+ |State| 0 | 1 | +-----+---+---+ | 0 |.6 |.4 | +-----+---+---+ | 1 |.3 |.7 | +-----+---+---+

The above table is a transition matrix. The first column represents the state that the Markov chain is currently in, and the first row represents the state that it is transitioning to. In this case, the probability of transitioning from state 0 to 0 is .6, from 0 to 1 is .4, from 1 to 0 is .3, etc...

The Markov chain described in the above table is a first order Markov chain. This means that each state consists of the state of the process in a single time step. But it doesn't have to be this way. We can expand our state space so that it includes all possible doubles of individual states, thus capturing two time steps at once. An example of a second order Markov chain is given in the table below.

+-----+-----+-----+-----+-----+ |State|(0,0)|(0,1)|(1,0)|(1,1)| +-----+-----+-----+-----+-----+ |(0,0)| .6 | .4 | 0 | 0 | +-----+-----+-----+-----+-----+ |(0,1)| 0 | 0 | .3 | .7 | +-----+-----+-----+-----+-----+ |(1,0)| .6 | .4 | 0 | 0 | +-----+-----+-----+-----+-----+ |(1,1)| 0 | 0 | .3 | .7 | +-----+-----+-----+-----+-----+

This table is read in the same way as the first. The first thing that pops out is the fact that it is impossible to transition to a certain states from any given states. This is because, despite the fact that each state captures two time steps, it is still only possible to take one step at once. Therefor, if the process is in, say, state (0,0), when time moves forward one step, the italic 0 in the first position will be "pushed out" of the state, and the bold 0 in the second position will be pushed into its place. Hence, the process is in state (0,0), the next state it is in will necessarily have a first element of 0.

While second order Markov chains are certainly useful, and may in many cases have more power to describe real life processes, there is a serious issue of computational efficiency. As you can see, increasing the order of a Markov chain increases the number of states by a factor of the number of states in the first order chain. Thus, to reduce computational demand, it is well within our interest to try to keep the total number of states at a minimum.

Questions of interest & test statistics

The general principle underpinning the package is that it is possible to use single or multiple real sequences of binary data to produce a set of possible exchangeable sequences that are generated to preserve a certain property found in the original sequence This property is the count of transitions between states of the chain. For example, the sequence of data a = { 1, 0, 0, 0, 1, 0, 1} has two transitions from 1 to 0, two transitions from 0 to 0, and two transitions from 0 to 1. An example of a different sequence with the same transition counts would be b = { 1, 0, 0, 1, 0, 0, 1}.

In order to generate a large number of such sequences that are independent of the original one, the the procedure attempts swapping entries in a supplied sequence some number of times. For each attempted swap, the sequence is checked to make sure that the transition counts are the same. If they are, the swap is made and the sequence is saved. If not, another swap is attempted. The swaps that are attempted are determined by randomly generating indices of the sequence(s).

Once all the swaps have been attempted, the resulting sequence of data is saved. Then, using that new sequence as a starting point for each new one, the procedure attempts more swaps, after which it saves the resulting sequence. However many sequences of generated data are desired are created in this fashion, and all are independent from the original supplied sequence.

Single sequences

For single sequences of data, the primary question of interest addressed by the maRkov package is: is a first order binary Markov chain appropriate to describe the given sequence compared to a second order one? The function used to answer this question is single_binary_test(), which gives three test statistics: the likelihood ratio test statistic, the Pearson's chi-squared ($\chi^2$) test statistic, and the run test statistic for a run of a given length. The likelihood ratio test statistic addresses the above question of interest. The $\chi^2$ test statistic functions as an approximation of the likelihood ratio test statistic, because they both have similar distributions. Lastly, the run test statistic is useful for judging the homogeneity of the sequence, although it does not test for a rigorous question of interest presented here.

Once single_binary_test() has generated a sufficient amount of exchangeable sequences, it goes about generating the aforementioned test statistics. Since the test statistics in the maRkov package are concerned with determining the appropriateness of first order chains against second order ones, a table of second order transition counts is compiled for each sequence. From this table, the program generates test statistics.

$$ LRT = 2 \sum_{i,j,k}^{} n_{ijk} log\Bigg(\frac{\frac{n_{ijk}}{n_{ij+}}}{\frac{n_{+jk}}{n_{+j+}}}\Bigg) $$

$$ \chi^2 = \sum_{i,j,k}^{} \frac{\bigg(n_{ijk}-\frac{n_{ij+}n_{+jk}}{n_{+j+}}\bigg)^2}{\frac{n_{ij+}n_{+jk}}{n_{+j+}}} $$

In the equations above, $n_{ijk}$ represents the frequency of the transition from state $ij$ to state $jk$ in a second order chain. The $n's$ with $+$ in the subscript indicate summation across all values of either $i, j, k$ or some combination thereof. It should be noted, additionally, that $i = j = k = {0,1}$.

Run test statistics simply represent the number of times that a run of a certain specified length occurs within the sequence. The exact formula is shown below for a run of length $p$, where $x_i$ is the $i^{th}$ index of a sequence of data $x$ of length $l$.

$$ Run = \sum_{i=1}^{l-p} 1_{x_i =1 \& x_{i+1} =1 \& \cdots \& x_{i+p} =1} $$

Multiple sequences

For multiple binary chains, the question of interest is different. Now, instead asking whether to use a first or second order chain, we ask ourselves whether a single binary Markov chain is adequate to describe a given set of sequences of data, or if multiple binary Markov chains are required. The likelihood ratio test statistic is used to answer this question. The $\chi^2$ distribution is once again an approximation of the distribution of likelihood ratio test statistics.

Unlike single_binary_test(), multiple_binary_test() isn't at all concerned with second order chains, so this time a table of first order transitions is compiled.

$$ LRT = 2 \sum_{i,j,t}^{} n_{ijk}^{(t)}*log\Bigg(\frac{\frac{n_{ijk}^{(t)}}{n_{ij+}^{(t)}}}{\frac{n_{+jk}^{(t)}}{n_{+j+}^{(t)}}}\Bigg) $$

$$ \chi^2 = \sum_{t=0}^{n} \sum_{i,j}^{} \frac{{\Bigg(n_{ij}^{(t)}-\frac{n_{i+}^{(t)}n_{ij}^{(+)}}{n_{i+}^{(+)}}\Bigg)}^2}{n_{i+}^{t}+\frac{n_{ij}^{(+)}}{n_{i+}^{(+)}}} $$

Where $n_{ij}^{(t)}$ represents the frequency of the transition from state $i$ to state $j$ in the $t^{th}$ chain in the set. $+$ indicates the same thing as in the single chain case. $i = j = {0,1}$, and $t$ is the set of natural numbers smaller than the number of new sets of sequences to generate, which is given as the parameter n in multiple_binary_test().

The run test statistic is calculated in the same fashion as the run test statistic in the single sequence case, but the run test statistics from each sequence are added together. The exact formula is shown below for a run of length $p$, where $x_i^{(j)}$ is the $i^{th}$ index of a sequence of data $x^{(j)}$ within a set of $n$ sequences $x$, all of length $l$.

$$ Run= \sum_{j=}^n \sum_{i=1}^{l-p} 1_{ x_i^{(j)} =1 \& x_{i+1}^{(j)} =1 \& \cdots \& x_{i+p}^{(j)} =1} $$