In cmu-delphi/epiprocess: Tools for basic signal processing in epidemiology

source("_common.R")

Overview

This vignette provides a brief introduction to the {epiprocess} package. We will do the following:

Get data into epi_df() format and plot the data
Perform basic signal processing tasks (lagged differences, rolling average, cumulative sum, etc.)
Detect outliers in the data and apply corrections
Calculate the growth rate of the data
Get data into epi_archive() format and perform similar signal processing tasks

Getting data into `epi_df` format

We'll start by getting data into epi_df() format, which is just a tibble with a bit of special structure. As an example, we will get COVID-19 confirmed cumulative case data from JHU CSSE for California, Florida, New York, and Texas, from March 1, 2020 to January 31, 2022. We have included this example data in the epidatasets::covid_confirmed_cumulative_num object, which we prepared by downloading the data using epidatr::pub_covidcast().

library(epidatr)
library(epiprocess)
library(dplyr)
library(tidyr)
library(withr)

covid_confirmed_cumulative_num
class(covid_confirmed_cumulative_num)
colnames(covid_confirmed_cumulative_num)

The same data can be downloaded with {epidatr} as follows:

covid_confirmed_cumulative_num <- pub_covidcast(
  source = "jhu-csse",
  signals = "confirmed_cumulative_num",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = epirange(20200301, 20220131),
)

The tibble returned has the columns required for an epi_df object, geo_value and time_value, so we can convert it directly to an epi_df object using as_epi_df().

edf <- covid_confirmed_cumulative_num %>%
  select(geo_value, time_value, cases_cumulative = value) %>%
  as_epi_df() %>%
  group_by(geo_value) %>%
  mutate(cases_daily = cases_cumulative - lag(cases_cumulative, default = 0))
edf

In brief, we can think of an epi_df object as snapshot of an epidemiological data set as it was at a particular point in time (recorded in the as_of attribute). We can easily plot the data using the autoplot() method (which is a convenience wrapper to ggplot2).

edf %>%
  autoplot(cases_cumulative)

We can compute the 7 day moving average of the confirmed daily cases for each geo_value by using the epi_slide_mean() function. For a more in-depth guide to sliding, see vignette("epi_df").

edf %>%
  group_by(geo_value) %>%
  epi_slide_mean(cases_daily, .window_size = 7, na.rm = TRUE)

We can compute the growth rate of the confirmed cumulative cases for each geo_value. For a more in-depth guide to growth rates, see vignette("growth_rate").

edf %>%
  group_by(geo_value) %>%
  mutate(cases_growth = growth_rate(x = time_value, y = cases_cumulative, method = "rel_change", h = 7))

Detect outliers in daily reported cases for each geo_value. For a more in-depth guide to outlier detection, see vignette("outliers").

edf %>%
  group_by(geo_value) %>%
  mutate(outlier_info = detect_outlr(x = time_value, y = cases_daily)) %>%
  ungroup()

Add a column to the epi_df object with the daily deaths for each geo_value and compute the correlations between cases and deaths for each geo_value. For a more in-depth guide to correlations, see vignette("correlation").

df <- pub_covidcast(
  source = "jhu-csse",
  signals = "deaths_incidence_num",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = epirange(20200301, 20220131),
) %>%
  select(geo_value, time_value, deaths_daily = value) %>%
  as_epi_df() %>%
  arrange_canonical()
edf <- inner_join(edf, df, by = c("geo_value", "time_value"))
edf %>%
  group_by(geo_value) %>%
  epi_slide_mean(deaths_daily, .window_size = 7, na.rm = TRUE) %>%
  epi_cor(cases_daily, deaths_daily)

Note that if an epi_df object loses its geo_value or time_value columns, it will decay to a regular tibble.

edf %>% select(-time_value)

Getting data into `epi_archive` format

We can also get data into epi_archive() format, which can be thought of as an aggregation of many epi_df snapshots. We can perform similar signal processing tasks on epi_archive objects as we did on epi_df objects, though the interface is a bit different.

library(epidatr)
library(epiprocess)
library(data.table)
library(dplyr)
library(purrr)
library(ggplot2)

dv <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = epirange(20200601, 20211201),
  issues = epirange(20200601, 20211201)
) %>%
  select(geo_value, time_value, issue, percent_cli = value) %>%
  as_epi_archive()
dv

library(epidatr)
library(epiprocess)
library(data.table)
library(dplyr)
library(purrr)
library(ggplot2)
dv <- archive_cases_dv_subset$DT %>%
  select(-case_rate_7d_av) %>%
  tidyr::drop_na() %>%
  as_epi_archive()
dv