In cwcartmell/maRkov: Goodness-of-fit Tests for Binary Markov Chains

In "A Markov Chain Model of Tornadic Activity" [@Caren_amarkov], the authors propose that tornadoes in the United States can be modeled with Markov chains. In the paper, they encode their data in a matrix with each row representing a year from 1953 to 1998, and each column representing a day of the year. A 1 in an entry in the matrix indicates that at least one tornado occurred on the corresponding year and date and 0 in an entry indicates that no tornadoes occurred on the corresponding day.

If the presence of tornadoes in the greater United States can be modeled with a binary Markov chain, it stands to reason that tornadoes in Oklahoma could be modeled in the same fashion. The data encoded in the included data set ok_tornado is encoded in the same way as the data in the Drton et al. paper, but it covers only tornadoes in Oklahoma, and instead of including only the years 1953 to 1998, it includes all years between 1950 and 2015. It's time to get started.

The first thing that we are going to investigate is whether every year is adequately modeled by a first order Markov chain. We'll run single_binary_test() on every single year that we have data, and compile a vector of likelihood ratio test statistic p-values to see if there are issues with using binary Markov chains for each one.

We need to take our data and transform it into forms that single_binary_test() and multiple_binary_test() can understand:

library(maRkov)
tornadoes <- as.matrix(ok_tornado)

Now that we have a vector representing the days with tornadoes in 2009 and a matrix representing days with tornadoes from 1950-2015, we'll run single_binary_test() on the chain from 2009 and look at a summary of the results:

lrt_p_values <- apply(tornadoes, 1,
                      function(X)(single_binary_test(binary_chain = X,
                                                     swaps = 10000, n = 10000))$p_value_lrt)
hist(lrt_p_values)

This is not good. As you can see from the histogram of p-values, we have ~14 p-values that indicate significance at the .1 level. We would expect to see $66*.1=6.6$ if all the sequences could be adequately modeled by first order Markov chains. None the less, we'll forge on our investigation to see if we can identify what is going on.

tornadoes_test <- multiple_binary_test(tornadoes, swaps = 10000, n = 10000)

And then we'll plot the results:

plot(tornadoes_test)

As you can see from the likelihood ratio test statistic histogram, our data seems to be extremely unusual. Let's take a look at the summary of tornadoes.test:

summary(tornadoes_test)

From this summary, we can see that our p-value for the likelihood ratio test statistics is quite small, so we can conclude that a simple first order Markov chain is probably not appropriate to model this data. Why could this be? if you look at the first plot generated by plot(tornadoes_test), you will notice that there is a remarkable clustering effect in roughly the second quarter of the year. If you go to the website of Oklahoma City, and navigate to their tornado safety page, found at [http://www.okc.gov/safety/tornado/], you will find the following:

When is tornado season?

Tornado season is generally March through August, although tornadoes can occur at any time of the year.

Which suggests that days with tornadoes are going to be found predominantly between March 1st and August 31st, or between roughly the 60th and 244th days of the year. Let's see what our tile plot of Oklahoma tornadoes looks like when we superimpose those two dates upon it.

torn_data <- tornadoes_test$data[[1]]
reverse <- nrow(torn_data):1
torn_data <- torn_data[reverse, ]

image(c(1:ncol(torn_data)), c(1:nrow(torn_data)), t(torn_data), axes = FALSE, ylab = " ", xlab = "Visual representation of the actual Markov Chain.", col = c("aquamarine3", "chocolate1"))
segments(60, 0, 60, 67)
segments(244, 0, 244, 67)
text(35, 63, "March 1st")
text(274, 63, "August 31st")

As you can see, there does seem to be marked difference between the distribution of tornadoes between these two dates and the distribution of tornadoes in the rest of the year. This fact begs the question: is it possible that we could model tornado season in Oklahoma with a Markov chain? Let's find out. We'll need to truncate our data to include only the days between March and August.

tornadoes_season <- tornadoes[ , 60 : 244]
tornadoes_season_test <- multiple_binary_test(tornadoes_season, swaps = 10000, n = 10000)

Now we'll look at the summary of the data, since we've already produced the tile plot.

summary(tornadoes_season_test)

Again, it turns out that a simple, first order binary Markov chain does is not sufficient for modeling this data. Why is this?

It wasn't mentioned at the beginning of this vignette for the sake of constructing an interesting problem, but the authors of the Drton et al. paper did not actually use a first order binary Markov chain to describe their data. In their case, they used what are called non-homogeneous Markov chains to describe their data. These are Markov chains for which the transition probabilities change as the chain moves through states. Given a little bit of thought, the requirement for this kind of chain makes perfect sense, as there are climate factors that fluctuate throughout the year that affect tornado formation. Another issue is that tornadoes in an area limited to the size of Oklahoma are affected by far more local weather patterns than tornadoes across the entire United States, which may make them harder to model with Markov chains. Unfortunately, this is the limit of the functionality of the package, so all we can conclude is that simple first order binary Markov chains are not sufficient to represent this data.