In cwcartmell/maRkov: Goodness-of-fit Tests for Binary Markov Chains

It rains a lot in Washington. This will come as no surprise to anyone who has been to the Evergreen State, but the specific patterns of the rainfall in the region are important to scientists. For this example, we will be investigating whether precipitation at Snoqualmie Falls Washington during the month of January can be modeled with a single binary Markov chains. To determine the answer to this question, we will be testing the null hypothesis of a simple binary chain against the alternative of a second order binary chain for every year of rainfall data that we have from Snoqualmie Falls, which runs from 1948 to 1983, and then test the null hypothesis that we only need one chain for all of the years against the alternative that we need different chains for different years. The data is sourced from Stochastic Modeling of Scientific Data [@guttorp1995stochastic].

To do this, we will be using the single_binary_test() and multiple_binar_test() functions. But first we need to encode the rainfall data in the proper way. The data is below. Note that a "1" denotes that there was more than 0.01 inches of rain on a given day, and a "0" denotes that there was less rainfall than that. Each row is a separate year and each column is a day of the month of January.

library(maRkov)
knitr::kable(maRkov::snoqualmie)

We'll need to encode this data in a way that single_binary_test and multiple_binary_test() can understand, so we're going to turn it into a matrix with rows representing each single year chain, and columns representing days of January. Luckily, the as.matrix function makes it easy to do this.

snoqualmie_matrix <- as.matrix(maRkov::snoqualmie)

Before we can examine whether a simple binary Markov chain is appropriate for describing rainfall in the month of January in general, we need to ask ourselves if each January that we have is appropriately modeled by a simple binary Markov chain. To determine this, we will generate a vector of likelihood ratio p-values for each year, and a histogram:

lrt_p_values <- apply(snoqualmie_matrix, 1,
                      function(X)(single_binary_test(binary_chain = X,
                                                     swaps = 10000, n = 10000))$p_value_lrt)
hist(lrt_p_values, breaks = 10)

This is quick and dirty analysis, but it is clear from looking at the histogram of p-values that very low ones are generally uncommon. Given a p-value cutoff of .1, we would expect to see $36*.1=3.6$ p-values below .1. In the above histogram, you see about 3 results with p-values below .1. This gives us no reason to doubt that simple binary Markov chains can be used to model the month of January in each year, as the years with low p-values may very well be the result of having data from a great many years.

We're finally ready to call multiple_binary_test(), which will produce an object of class multiple_binary_test. Afterwards, we'll look at some plots and a summary of our data. Specifically, we'll be taking a look at the distribution of test statistics and the p-value for the likelihood ratio test (LRT). This test will give us a answer to the question of whether or not the null hypothesis of a single binary chain is acceptable against the alternative of a second order binary chain.

snoqualmie_test <- multiple_binary_test(binary_chains = snoqualmie_matrix, swaps = 10000,
                                        n = 10000)
plot(snoqualmie_test)

By looking at these plots, we notice a few things. First, plotting a object of class multiple_binary_test gives us a whole lot of information, much of which isn't particularly interesting to us. We'll just be looking at the tile plot and the histogram of likelihood ratio test statistics. The tile plot lets us get a general feel for the makeup of our set of chains, and the histogram of likelihood ratio statistics shows us that the test statistic of the real set of chains lies somewhere in the meaty part of the distribution, making it unlikely that it is an unusual observation. For more detailed and precise analysis, let's turn our attention to the summary of snoqualmie_test.

summary(snoqualmie_test)

While the output of this code is a little jumbled due to the formatting that knitr coerces it into, you can see the general structure of the output. The likelihood ratio test statistics are broken down by quantile, and there is a p-value printed next to them as well. In this case, the p-value of the likelihood ratio test indicates that a single binary chain is probably acceptable for modeling this data.