knitr::opts_chunk$set(echo = FALSE) # By default, hide code; set to TRUE to see code
knitr::opts_chunk$set(fig.pos = "p") # Places figures on their own pages
knitr::opts_chunk$set(out.width = "100%", dpi = 300) # Figure resolution and size
knitr::opts_chunk$set(fig.env = "figure") # Latex figure environment
knitr::opts_knit$set(eval.after = "fig.cap")

library(xtable) # Creates tables that follow OUP guidelines; other options, such as kable, may also be workable
library(tidyverse)
library(ggrepel)
library(scales)
library(corrplot)
library(ggcorrplot)
library(NBZIMM)
theme_set(theme_bw())

Introduction

In the mist of a virus pandemic, unraveling the constant flow of epidemiological data is of paramount importance, not only to guide the evaluation and implementation of non-pharmacological interventions (NPI), but also to optimize drug development.

For example, in early phases of the pandemic, analysis and modeling of COVID-19 confirmed cases and deaths has been employed to assess the effects of NPI in China and Europe [@flaxman2020; @prem2020tlph]. More recently, the increased flow of COVID-19 data, and the integration of different sources of information (seasonality of other coronoviruses, U.S. clinical care) has allowed even more long-term predictions of the feasibility and effectiveness of possible containment strategies [@kissler2020s].

Similarly, early evidences of the correlation between Bacille Calmette-Guérin vaccination and COVID-19 outcomes spur several clinical investigations [@miller2020m; @shet2020m; WHO]. However, the implications and conclusions of that initial observation were curtailed by subsequent models that included more factors [e.s. age; @fukui2020m]. Overall, these few examples underscore the importance in general, of providing public access to ongoing COVID-19 metrics and in particular, of including multiple heterogeneous collections of data in modeling and analysis in order to improve predictions and ultimately, to address the many challenges of the current pandemic emergency.

The current R package provides tools to rapidly extract United States and Italy COVID-19 epidemiological metrics (at county and regional level, respectively) from different sources, and to combine them with other demographic and health related datasets. The goal of the package is to facilitate multifactorial analysis and modeling of COVID-19 data by the scientific community. Specific effort was made to provide a detailed documentation for each of the variables returned by the functions, and to list external sources and methodology of their collection, with the objective of promoting appropriate analysis.

Alghorithm and Sources

A family of get functions is employed by the R package to dynamically extract updated time-series data from different on-line sources, combine them, and to return a dataframe.

For U.S the prefix of the functions to extract data is getus_, and it is followed by the specific metric of interest:

Data regarding the household composition, population sex, race, age and poverty levels (2018), were retrieved from the American Community Survey. Medical conditions, tobacco use, cancer and, data relative to the number of medical and emergency visits (2017) of medicare beneficiaries were obtained from the Mapping Medicare Disparities. The number of hospital beds per county (2020) was calculated from data of the Homeland Infrastructure Foundation. Emissions of particulate 2.5 (2016) were reported by the Atmoshpheric Composition Analysis Group.

For Italy, the prefix of the function is getit_ followed by covid or all.

Age and sex of the population (2019), first aid and medical guard visits (2018), smoking status (2018), prevalence of chronic conditions (2018), annual-household income (2017), household crowding index (2018) and body-mass index were collect from ISTAT. Prevalence of types of cancer patients (2016), influenza-vaccination coverage (2019) and the number of hospital beds per 1000 people (2017) were obtained from Ministero della Salute. Data of particulate 2.5 (2017) was obtained from the Istituto Superiore Per La protezione Ambientale.

The package documentation reports and describes each variable (colnames) and lists all relative data sources. Because of the large amount of variables and in order to facilitate exploration of the documentation, it was deemed more practical to create separate functions with separate documentation for each of the country.

Static U.S and Italy datasets can be accessed directly using data(). The country that data refer to is specified in the first 2 letters of the object name. For example us_dem contains demographic information (sex and age) of U.S counties, whereas it_dem of regions of Italy.

The package is current available on github. The following code launch the functions and assign the returned dataframes to different names.

\bigskip

library(covid19census)
dat_us <- getus_all(repo = "jhu")
dat_it <- getit_all()

unlist(lapply(list(dat_it, dat_us), class))

\bigskip

Information of the dataframes generated by the two functions are reported in the table below [table \ref{tab:tab_dat}].

\bigskip

tab1 <- as.data.frame(
  tribble(
    ~getus_all,
    ~getit_all,
    as.character(ncol(dat_us)),
    as.character(ncol(dat_it)),
    as.character(length(unique(dat_us$fips))),
    as.character(length(unique(dat_it$region))),
    "7",
    "4",
    "2020-01-21",
    "2020-02-24"
  )
)

rownames(tab1) <-
  c("columns", "counties-regions", "sources", "from")


print(
  xtable(
    tab1,
    caption = "Information regarding the dataframes retuned by the functions.
    The table reports, for the dataframes returned by each of the functions: i. number of columns; ii.number of unique regions (Italy) and counties (U.S.);  number of unique data sources; earliest date of COVID-19 metric. Note that some of 
    the U.S variables are at the state level (e.s tests)
    ",
    label = "tab:tab_dat",
    align = c("c", "c", "c")
  ),
  comment = FALSE
)

Example of use

Data exploration and modeling can be conveniently performed on a (single) dataframe that contains COVID-19 as well as many other metrics retrieved from multiple sources. In the following example, the package covid19census was employed to replicate the findings of Wu et al. [-@wu2020m]. In that study, the authors investigated the impact of fine particulate matter on COVID-19 mortality rates in U.S, using a model that took in account several possible confounding factors.

In the current example, an additional confounder, percentage of people suffering of hypertension, was added to the model. Update data of deaths, cases and tests as well as other demographic indexes were retrieved with the function getus_all, processed. A total of 19 variables were selected for analysis. Figure \ref{fig:fig_corr} displays a correlation analysis of pairs of selected variables.

library(dplyr)
library(covid19census)

dat_us <- getus_all(repo = "jhu")

us_last <-
  dat_us %>%
  filter(date == max(date)) %>%
  # 65 and over by summing all the intervals of age
  mutate(age65_over = rowSums(dplyr::select(., matches("perc_[6][5-9]|[7|8][0-9]")))) %>%
  # calculate percentage of black or afroamerican, white, hispanic or latinos
  mutate(perc_black = total_nlat_blackaa_alone / total_pop * 100) %>%
  mutate(perc_white = total_nlat_white_alone / total_pop * 100) %>%
  mutate(perc_lat = total_lat / total_pop * 100) %>%
  # calculate total number of tests
  mutate(total_tests = positive + negative + pending) %>%
  # variables that are percent are divided by 100
  mutate_at(vars(starts_with("perc")), function(x) x / 100) %>%
  dplyr::select(
    deaths,
    cases,
    pm2.5,
    median_income,
    perc_lat,
    perc_black,
    perc_edu_somecollege,
    age65_over,
    total_tests,
    perc_obesity,
    perc_hypertension,
    perc_tobacco_use,
    total_beds,
    summer_temp,
    winter_temp,
    summer_hum,
    winter_hum,
    total_pop,
    state
  )
fig_cap <- "Correlation Matrix of selected U.S metrics. An example of exploratory analysis on data returned by the function `getus\\_all`. Colours indicates Pearson's correlation between pairs of variables."

new_name <- expression("pm2.5 (" ~ mu ~ g / m^3 ~ ")")

us_last %>%
  dplyr::select(-state) %>%
  rename(
    "median income" = median_income,
    "hispanic or latinos" = perc_lat,
    "black or afroamerican" = perc_black,
    "college education" = perc_edu_somecollege,
    "age 65 and over" =  age65_over,
    "number of tests" =  total_tests,
    "obesity" =  perc_obesity,
    "hypertension" =  perc_hypertension,
    "tabacco use" = perc_tobacco_use,
    "hospital beds" = total_beds,
    "summer temperature" = summer_temp,
    "winter temperature" = winter_temp,
    "summer humidity" = summer_hum,
    "winter humidity" = winter_hum,
    "total_pop" = total_pop
  ) %>%
  na.omit() %>%
  cor() %>%
  ggcorrplot(tl.cex = 6) +
  theme(legend.title = element_blank()) +
  scale_y_discrete(labels = c("pm2.5" = new_name)) +
  scale_x_discrete(labels = c("pm2.5" = new_name))

The main analysis of Wu et al. [-@wu2020m], a zero-inflated negative binomial mixed models, was replicated using updated U.S data and by including an additional factor (percent of people suffering of hypertension) in the model.

# remotes::install_github("nyiuab/NBZIMM")
# library(NBZIMM)
pm2.5_model <- glmm.zinb(
  fixed =
    deaths ~
    pm2.5 +
      scale(median_income) +
      scale(perc_edu_somecollege) +
      scale(perc_lat) +
      scale(perc_black) +
      scale(age65_over) +
      scale(total_tests) +
      scale(total_beds) +
      scale(perc_obesity) +
      scale(perc_hypertension) +
      scale(perc_tobacco_use) +
      scale(summer_temp) +
      scale(winter_temp) +
      scale(summer_hum) +
      scale(winter_hum) +
      offset(log(total_pop)),
  random = ~ 1 | state,
  data = us_last
)

Results [table \ref{tab:tab_coef}] indicates, as in Wu et al. [-@wu2020m], a significant effect of fine particulate 2.5 on COVID-19 mortality.

tab2 <- as.data.frame(coef(summary(pm2.5_model)))

row.names(tab2) <- c(
  "(Intercept)",
  "$\\mu_g/m^3$",
  "median income",
  "college education",
  "hispanic or latinos",
  "black or afroamerican",
  "age 65 and over",
  "number of tests",
  "hospital beds",
  "obesity",
  "hypertension",
  "tabacco use",
  "summer temperature",
  "winter temperature",
  "summer humidity",
  "winter humidity"
)




caption_tab <- paste0("Coefficients of the model. A zero-inflated negative binomial mixed model was fitted of data of U.S counties ", "(", max(dat_us$date), ")")

print(
  xtable(
    tab2,
    caption = caption_tab,
    label = "tab:tab_coef",
    align = rep("c", 6)
  ),
  comment = FALSE,
  sanitize.rownames.function = identity
)

Conclusions

The R package covid19census extracts and integrates epidemiological COVID-19 data (Italy and U.S at the regional and county level, respectively) with several other demographic and health related indexes. Currently, the dataframes returned by the main functions getus_all and getit_all, consist of r ncol(dat_us) variables per r length(unique(dat_us$fips)) U.S counties, and of r ncol(dat_it) variables per 21 Italy regions (19 regions and autonomous provinces), respectively. By combining data form different sources, the package is aimed at promoting and simplifying the analysis and modeling of COVID-19 data by the scientific community.

References



c1au6i0/covid19census_dev documentation built on May 8, 2020, 1:01 a.m.