source("_common.R")
This vignette provides a brief introduction to the {epiprocess}
package. We will
do the following:
epi_df()
format and plot the dataepi_archive()
format and perform similar signal processing
tasksepi_df
formatWe'll start by getting data into epi_df()
format, which is just a tibble with
a bit of special structure. As an example, we will get COVID-19 confirmed
cumulative case data from JHU CSSE for California, Florida, New York, and Texas,
from March 1, 2020 to January 31, 2022. We have
included this example data in the epidatasets::covid_confirmed_cumulative_num
object, which we prepared by downloading the data using
epidatr::pub_covidcast()
.
library(epidatasets) library(epidatr) library(epiprocess) library(dplyr) library(tidyr) library(withr) covid_confirmed_cumulative_num class(covid_confirmed_cumulative_num) colnames(covid_confirmed_cumulative_num)
The same data can be downloaded with {epidatr}
as follows:
covid_confirmed_cumulative_num <- pub_covidcast( source = "jhu-csse", signals = "confirmed_cumulative_num", geo_type = "state", time_type = "day", geo_values = "ca,fl,ny,tx", time_values = epirange(20200301, 20220131), )
The tibble returned has the columns required for an epi_df
object, geo_value
and time_value
, so we can convert it directly to an epi_df
object using
as_epi_df()
.
edf <- covid_confirmed_cumulative_num %>% select(geo_value, time_value, cases_cumulative = value) %>% as_epi_df() %>% group_by(geo_value) %>% mutate(cases_daily = cases_cumulative - lag(cases_cumulative, default = 0)) edf
In brief, we can think of an epi_df
object as snapshot of an epidemiological
data set as it was at a particular point in time (recorded in the as_of
attribute). We can easily plot the data using the autoplot()
method (which is
a convenience wrapper to ggplot2
).
edf %>% autoplot(cases_cumulative)
We can compute the 7 day moving average of the confirmed daily cases for each
geo_value by using the epi_slide_mean()
function. For a more in-depth guide to
sliding, see vignette("epi_df")
.
edf %>% group_by(geo_value) %>% epi_slide_mean(cases_daily, .window_size = 7, na.rm = TRUE)
We can compute the growth rate of the confirmed cumulative cases for each
geo_value. For a more in-depth guide to growth rates, see vignette("growth_rate")
.
edf %>% group_by(geo_value) %>% mutate(cases_growth = growth_rate(x = time_value, y = cases_cumulative, method = "rel_change", h = 7))
Detect outliers in daily reported cases for each geo_value. For a more in-depth
guide to outlier detection, see vignette("outliers")
.
edf %>% group_by(geo_value) %>% mutate(outlier_info = detect_outlr(x = time_value, y = cases_daily)) %>% ungroup()
Add a column to the epi_df object with the daily deaths for each geo_value and
compute the correlations between cases and deaths for each geo_value. For a more
in-depth guide to correlations, see vignette("correlation")
.
df <- pub_covidcast( source = "jhu-csse", signals = "deaths_incidence_num", geo_type = "state", time_type = "day", geo_values = "ca,fl,ny,tx", time_values = epirange(20200301, 20220131), ) %>% select(geo_value, time_value, deaths_daily = value) %>% as_epi_df() %>% arrange_canonical() edf <- inner_join(edf, df, by = c("geo_value", "time_value")) edf %>% group_by(geo_value) %>% epi_slide_mean(deaths_daily, .window_size = 7, na.rm = TRUE) %>% epi_cor(cases_daily, deaths_daily)
Note that if an epi_df object loses its geo_value
or time_value
columns, it
will decay to a regular tibble.
edf %>% select(-time_value)
epi_archive
formatWe can also get data into epi_archive()
format, which can be thought of as an
aggregation of many epi_df
snapshots. We can perform similar signal processing
tasks on epi_archive
objects as we did on epi_df
objects, though the
interface is a bit different.
library(epidatr) library(epiprocess) library(data.table) library(dplyr) library(purrr) library(ggplot2) dv <- pub_covidcast( source = "doctor-visits", signals = "smoothed_adj_cli", geo_type = "state", time_type = "day", geo_values = "ca,fl,ny,tx", time_values = epirange(20200601, 20211201), issues = epirange(20200601, 20211201) ) %>% select(geo_value, time_value, issue, percent_cli = value) %>% as_epi_archive(compactify = TRUE) dv
library(epidatr) library(epiprocess) library(data.table) library(dplyr) library(purrr) library(ggplot2) dv <- archive_cases_dv_subset$DT %>% select(-case_rate_7d_av) %>% tidyr::drop_na() %>% as_epi_archive(compactify = TRUE) dv
See vignette("epi_archive")
for a more in-depth guide to epi_archive
objects.
This document contains a dataset that is a modified part of the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University as republished in the COVIDcast Epidata API. This data set is licensed under the terms of the Creative Commons Attribution 4.0 International license by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
From the COVIDcast Epidata API: These signals are taken directly from the JHU CSSE COVID-19 GitHub repository without changes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.