In rrrlw/ripserr: Calculate Persistent Homology with Ripser-Based Engines

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Background

In 2015, the Zika virus (ZIKV) was first introduced and spread across Brazil due to a plethora of Aedes aegypti mosquitoes.[^Gatherer2016] This species transmits ZIKV to humans, which can cause mild symptoms or even severe health conditions such as Guillain-Barré Syndrome and microcephaly.[^Winer2014][^Johansson2016] Lo and Park (2018) modeled the spread of ZIKV using linear regression, while comparing results between two models — one with predictors calculated from persistent homology (PH) and one without.[^Lo2018] They show that using the number of 0-dimensional and 1-dimensional homological features and the maximum lifetime (persistence) of any 1-dimensional feature can better predict the number of ZIKV cases in Brazil.

Here, we demonstrate how to use topological features to model disease transmission by A. aegypti, similar to the ZIKV model built by Lo and Park.

Methods

To imitate the model from Lo and Park with available data, we would predict the number of ZIKV cases confirmed in each state using linear regression. However, data points of A. aegypti mosquitoes in Brazil are only available for the year 2013, while documented ZIKV cases are available only since 2015. Instead of using the number of ZIKV cases as the response variable in our regression model, we obtained the number of Dengue cases in Brazil for the year 2013,[^dengue] since Dengue is also transmitted by A. aegypti mosquitoes. Predictions of number of cases in each state will be made using average temperatures, precipitation levels,[^weather] and human population densities.[^population][^area] Notice that the source for temperature data provided by Lo and Park is no longer available. The accessible dataset for temperatures and precipitations are averaged throughout multiple years, thus it may not be completely accurate for 2013.

Besides these state-level attributes, we will also utilize spatial information on A. aegypti occurrences to model the disease.[^aegypti] We include the number of 0-dimensional features (H0) as a predictor since a higher value of H0 indicates that a state harbors more mosquitoes. We also include the number of 1-dimensional features (H1) in our model, as we interpret a larger value of H1 to mean that A. aegypti mosquitoes are more spread out. In addition, extending the analysis of Lo and Park, we include the maximum lifetime (H1ML) and the median lifetime (H1MD) of the 1-dimensinoal features, which we expect give additional insight into the statewide connectivity of A. aegypti distributions. A high H1ML indicates mosquitoes have a wider spread and are less clustered at a small number of specific regions. We are also curious about whether the median of H1 lifetime can improve upon the effect of H1ML demonstrated by Lo and Park.

Datasets

First, we load the aegypti dataset containing geographic locations of A. aegypti mosquitoes. Also, we load the case_predictors dataset to obtain human population density, average temperature, precipitation level, and the number of cases in each state.

We will use state abbreviations to join elements from these two data sets and from the PH calculations. The vector stateAbbSort, de-duplicated from the A. aegypti dataset, provides a convenient tool for these tasks.

devtools::load_all()
data(aegypti)
data(case_predictors)

stateAbbSort <- sort(aegypti$state_code[!duplicated(aegypti$state_code)])

We will incrementally add the predictive variables to the data frame caseModel.

caseModel <- data.frame()

Persistent homology

Before building the model, the following code provides an example of using Vietoris-Rips filtration on the mosquito occurrence patterns for a single state. First, we retrieve occurrence coordinates in the state of Acre (AC) from the A. aegypti dataset. Applying the vietoris_rips function to these coordinates, we receive a data frame that contains the birth and death information of every 0-dimensional (H0) and 1-dimensional (H1) feature. We plot these persistence data in the following graph, where the horizontal and vertical axes represent the birth and death of each feature, respectively.

AC_coord <- aegypti[aegypti$state_code == "AC", c("x", "y"), drop = FALSE]
AC_rips <- vietoris_rips(AC_coord) ##filtration

plot.new()
plot.window(
  xlim = c(0, max(AC_rips$death)),
  ylim = c(0, max(AC_rips$death)),
  asp = 1
)
axis(1L)
axis(2L)
abline(a = 0, b = 1)
points(AC_rips[AC_rips$dimension == 0L, c("birth", "death")], pch = 16L)
points(AC_rips[AC_rips$dimension == 1L, c("birth", "death")], pch = 17L)

For reference, we can also plot the mosquito occurrences based on their latitudes and longitudes in the aegypti dataset. Note, however, that the plot is not exactly faithful to the geography, since the distances between longitude lines vary from location to location. This also means that our PH calculations are not exactly right. The same limitation applies to the analysis of Lo and Park.

plot(x = AC_coord$x, y = AC_coord$y, asp = 1, xlab = "X", ylab = "Y",
     main = "Aedes Aegypti Occurrences in AC")

The following function takes in a single state's persistence data and returns a vector consisting of the values H0, H1, H1ML, and H1MD.

topologicalFeatures <- function(state_rips){
  numH0 <- length(which(state_rips$dimension == 0))
  numH1 <- length(which(state_rips$dimension == 1))
  if (numH1 != 0) {
    maxPerst <- max(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])
    medianPerst <- median(state_rips$persistence[(numH0 + 1) : (numH0 + numH1)])

  } else {
    maxPerst <- 0
    medianPerst <- 0
  }
  return(c(numH0, numH1, medianPerst, maxPerst))
}

Iterating over all the states in Brazil, we bind every state's topological features to caseModel.

for(val in stateAbbSort) {
  state_coord <- aegypti[aegypti$state_code == val, c("x", "y"), drop = FALSE]
  state_rips <- vietoris_rips(state_coord)
  state_rips$persistence<-(state_rips$death - state_rips$birth)

  features <- topologicalFeatures(state_rips)
  caseModel <- rbind(
    caseModel,
    c(features[1], features[2], features[3], features[4])
  )
}
colnames(caseModel) <-  c("H0", "H1", "H1MD","H1ML")
rownames(caseModel) <- stateAbbSort

We merge caseModel with case_predictors, matched by state code. We have now loaded all potential predictors to our model. The following code prints out first few lines of the caseModel to demonstrate how states and their variables lie within this data frame.

caseModel <- merge(caseModel, case_predictors, by = "row.names")[, -1]
rownames(caseModel) <- stateAbbSort
head(caseModel)

Cases Visualization

The plots below visualize the number of cases within each state. To reproduce the model by Lo and Park, we will log-transform the number of cases to reduce data skew and the influence of outliers.

par(mfrow = c(2L, 1L))

hist(caseModel$CASE, main = "Distribution of Case Counts")

hist(log(caseModel$CASE), 
     main = "Distribution of Log-Transformed Case Counts")

We use the following plot to check if there is any correlation between predictors. We see that H0 and H1 are linearly related to each other, so H1 may not be a necessary predictor. During the model selection process, there will be a comparison testing inclusion of H1 as a predictor and decisions will be made at that point.

pairs(caseModel, pch = 16L)

First fit: Model without Topological Features

First, we fit a linear model to log-transformed case counts predicted by population, temperature, precipitation, and H0 (the number of aegypti occurrences). We reject the null hypothesis of no predictive value ($p = 0.0199$).

fit.1 <- lm(log(CASE) ~ POP + TEMP + PRECIP + H0, data = caseModel)
summary(fit.1)

To check error assumptions for linear regression, we plot residuals versus fitted values and theoretical quantiles versus standardized residuals. We observe that residuals for each fitted values are evenly distributed above and below the mean 0, and the relationship between residuals and fitted values seem to have a linear relation. In the quantile-quantile plot, standardized residuals closely line up with normal values, with a few outliers. Therefore, the normality assumption on the error term is satisfied.

par(mfrow = c(1L, 2L))
plot(fit.1, which = 1)
plot(fit.1, which = 2)

Next, we use the likelihood ratio test (LRT) to assess if we should drop the precipitation predictor PRECIP. In the LRT, $p = 0.8135$, which fails to reject the nested model. This result suggests that the simpler model without PRECIP is better. We therefore replace the original model with this more parsimonious one. Even though TEMP is not statistically significant, we preserve it in the model as temperature is an effectual predictor in most epidemiology modeling, and one whose point estimate we may want to be able to report.

fit.nested <- lm(log(CASE) ~ POP + TEMP + H0, data = caseModel)
lmtest::lrtest(fit.nested, fit.1)
fit.1 <- lm(log(CASE) ~ POP + TEMP + H0, data = caseModel)
summary(fit.1)

Second fit: Model with Topological Features

Let's add H1, H1ML, and H1MD to our current regression model and name it fit.2. From the summary output of the second model, $F = 3.268$ and $p = 0.01878$. We know that coefficients of these predictors are not all 0. However, we might overfit the data as many predictors are not statistically significant. Thus, we prefer to exclude irrelevant variables.

fit.2 <- lm(log(CASE) ~ POP + TEMP + H0 + H1 + H1ML + H1MD,
            data = caseModel)
summary(fit.2)

Similarly, the two plots below exhibit that error terms from the regression model are normally distributed and centered at zero. There are a few outliers from the Q-Q plot, but their effects are trivial in our regression model.

par(mfrow = c(1L, 2L))
plot(fit.2, which = 1)
plot(fit.2, which = 2)

Using LRT to see if H1MD is a useful predictor, we fail to reject the nested model when H1MD is absent ($p = 0.4126$). Therefore, we leave out H1MD in our model.

fit.test0 <- lm(log(CASE) ~ POP + TEMP + H0 + H1 + H1ML,
                data = caseModel)
summary(fit.test0)
lmtest::lrtest(fit.test0, fit.2)

We observe that coefficients of both H0 and H1 are not statistically significant. Also, it is concerning because the coefficient of H0 in the nested model above is negative, which means more mosquito occurrences tend to decrease the number of cases. We decide to test the model without H1, as it seems to correlate with H0 and cause confounding.

fit.test1 <- lm(log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)
summary(fit.test1)
lmtest::lrtest(fit.test1, fit.test0)

The likelihood ratio test suggests that we need to choose the simpler model without H1 ($p = 0.2161$). In this case, we only need H1ML as our topological predictor.

fit.2 <- lm(log(CASE) ~ POP + TEMP + H0 + H1ML, data = caseModel)

Results

We compare regression summaries of our final first and second models and observe an improvement of the adjusted R-squared when adding H1ML as an additional predictor.

summary(fit.1)
summary(fit.2)

H0 and H1ML are both positively related to the number of cases in each state. A higher value of H0 indicates more A. aegypti occurrences within the given state, so it leads to more confirmed cases. Since we include both H0 and H1ML in the model, given a fixed number of A. aegypti occurrences, a more persistent cycle would predict more cases as disease outbreaks can involve more locations. To interpret this, we select three states with similar H0 but different H1ML and compare the number of cases in each.

We choose states Pernambuco, Ceará, and Goiás for this example and print out H0, H1ML, POP, and CASE for each state. We observe that Goiás has the largest H1ML and the most cases, while Pernambuco has the smallest H1ML and the least number of cases. To reduce the effect of H0 on case numbers and H1ML, H0 is fixed since H0 for three selected states all fall into the range around 200.

Example_States <- c("PE", "CE", "GO")
Coord_Example <- data.frame()
for(val in Example_States){
  targetRow <- caseModel[rownames(caseModel) == val,]
  Coord_Example <- rbind(Coord_Example,
                         c(targetRow$H0, targetRow$H1ML, targetRow$CASE))
}
rownames(Coord_Example) <- Example_States
colnames(Coord_Example) <- c("H0", "H1ML", "CASE");Coord_Example

While controlling H0, we want to give a visualized interpretation of how H1ML positively relates to the number of cases. To achieve this, we plot A. aegypti occurrences on an xy-axis. We also limit the range of the x-axis to 10 in all three plots to fix the geographic dimension, in order to have a fair comparison between aegypti distributions.

par(mfrow = c(1L, 3L))

PE_coord <- aegypti[aegypti$state_code == "PE", c("x", "y"), drop = FALSE]
PE_rips <- vietoris_rips(PE_coord)
plot(x = PE_coord$x, y = PE_coord$y, asp = 1, col = "green",
     xlim = c(-42, -32), xlab = "X", ylab = "Y", main = "Pernambuco")

CE_coord <- aegypti[aegypti$state_code == "CE", c("x", "y"), drop = FALSE]
CE_rips <- vietoris_rips(CE_coord)
plot(x = CE_coord$x, y = CE_coord$y, asp = 1, col = "blue",
     xlim = c(-43, -33), xlab = "X", ylab = "Y", main = "Ceará")

GO_coord <- aegypti[aegypti$state_code == "GO", c("x", "y"), drop = FALSE]
GO_rips <- vietoris_rips(GO_coord) 
plot(x = GO_coord$x, y = GO_coord$y, asp = 1, col = "red",
     xlim = c(-54, - 44), xlab = "X", ylab = "Y", main = "Goiás")

par(mfrow = c(1L, 1L))

Distributions of aegypti populations serve as a guidance map for interpreting H1ML: a bigger H1ML comes from sparse and wider range of occurrences where infections can extend to more locations. In Pernambuco, which has the tiniest H1ML, most occurrences appear to form a single cluster. Hence, the spread of disease in Pernambuco may be easier to prevent and contain in the population close to this aegypti cluster, such as by applying pesticides and increasing predators. This may be part of the reason that Pernambuco, with the tiniest H1ML, has a small number of cases. In Ceará, occurrences are not clustered in one chunk but are separated into two clusters, one to the north and the other to the south. Based on the aegypti distribution in Goiás, we observe that multiple aegypti points are sparsely located instead of tightly attached to one another, which can lead to a higher value of H1ML. Comparing the shape of distributions among state Ceará and Goiás, occurrences in Goiás have a larger spread and are more uniformly distributed across the state, so the disease in Goiás can be transmitted across a wider range to reach more distant locations. More uniformed occurrences of aegypti species can increase the number of infections as it bridges more neighboring communities. Overall, predictor H1ML effectively detects patterns among clusters of occurrences and explains that the increase in cases may be due to a wider range of disease spread.

Discussion

We have shown that using topological features from A. aegypti occurrences in conjunction with standard epidemiology variables could better use the spatial information of existing data to benefit the prediction of ZIKV cases. Compared to using different years' datasets by Lo and Park, we have kept all our datasets to 2013 so our model leads to a different result. Also, improvements in our regression model after using topological features are not extraordinarily significant. In addition to variables included in our model, it is possible to further involve interaction terms, quadratic predictors, and log transformation of topological features when they are contextually and statistically appropriate. We have tested our model while incorporating the interaction between H1 and H1ML as an additional predictor, in order to follow the approach by Lo and Park. However, this interaction term is not significant and does not contribute to the overall fitness of our model. Even though variable selections and results deviate somewhat from the regression model built by Lo and Park, our model successfully demonstrates how to take advantage of topological features generated by the Vietoris-Rips filtration.

References

[^Gatherer2016]: Gatherer D and Kohl A. Zika virus, a previously slow pandemic spreads rapidly through the Americas. Journal of General Virology. 2016 97: 269–273. https://doi.org/10.1099/jgv.0.000381 PMID: 26684466

[^Winer2014]: Winer J. An update in Guillain-Barre´ Syndrome. Autoimmune Diseases. 2014. https://doi.org/10.1155/ 2014/793024 PMID: 24511391

[^Johansson2016]: Johansson MA, Mier-y-Teran-Romero L, Reefhuis J, Gilboa S and Hills S. Zika and the risk of microcephaly. N Engl J Med. 2016 375: 1–4. https://doi.org/10.1056/NEJMp1605367 PMID: 27222919

[^Lo2018]: Lo D, Park B (2018) Modeling the spread of the Zika virus using topological data analysis. PLoS ONE 13(2): e0192120. https://doi.org/10.1371/journal.pone.0192120

[^aegypti]: Kraemer, Moritz U. G. et al. 2017. Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence, Dryad, Dataset, https://doi.org/10.5061/dryad.47v3c

[^dengue]: Brazil Ministry of Health. 2014. Boletim Epidemiológico. ISSN 2358-9450. https://web.archive.org/web/20210209122713/https://www.gov.br/saude/pt-br/assuntos/boletins-epidemiologicos-1/por-assunto

[^weather]: Ipeadata. http://www.ipeadata.gov.br/Default.aspx

[^population]: Brazilian Institute of Geography and Statistics. 2015. Retrieved from https://ftp.ibge.gov.br/Estimativas_de_Populacao/

[^area]: Brazilian Institute of Geography and Statistics. 2020. https://www.ibge.Goiásv.br/geociencias/organizacao-do-territorio/estrutura-territorial/15761-areas-dos-municipios.html?edicao=30133&t=acesso-ao-produto

rrrlw/ripserr documentation built on July 12, 2022, 11:13 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com