knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 )
This document provides an under-the-hood walk-through of the steps used to
produce an output for the confirmed case metrics for ingestion into Tableau
using the functions in the {ohiCovidMetrics
} package. These functions were
written up to the specifications in the Confirmed Case Metric Reproducibility
Document
on the OHI Surveillance Team's SharePoint site.
library(tidyverse) library(ohiCovidMetrics) library(readxl)
Eventually, this package will contain multiple ways to pull the data depending
on the user's level of access. In this vignette, we will grab data from the
public facing COVID-19 Historical Data Table via
an API call with the pull_histTable()
function which takes no arguments. Note,
I stopped the time series at 17 June 2020 to be able to comparet the results to the
official file prepared by Jeff Bond for the dashboard.
hdt_raw <- pull_histTable(end_date = "2020-06-17")
This function does six things:
st_read()
function from the {sf
} package.clean_reversals()
function. Note: this is done at the county level and
then aggregated up to the State level.[^1]: It pulls down the new deaths and tests time series as well for more context.
The result is a data.frame where there is one row per county/HERC region/state per day.
The confirmed case metrics are all based on 7 day bins looking back from the current date, e.g., For data ending on 10 June 2020, the 7 day bins are:
and so on. To make things easier, we need to aggregate the daily counts
within geographies, or regions, and 7 day bins. This can be done with
the shape_case_data()
function:
hdt_clean <- shape_case_data(hdt_raw)
This function:
rolling_week()
.Thus the data are in a format that is ready for Tableau ingestion and that facilitates the calculations of each metric in its own column.
Finally, we can calculate the metrics. For production purposes, I have
wrapped all of these calculations into the process_confirmed_cases()
function, which takes the cleaned case time-series data as it's only
argument:
hdt_out <- process_confirmed_cases(hdt_clean)
Since this is the meat of the analysis, I want to go through them each, step-by-step.
See the top panel of Table 1 in the Confirmed Case Metric Reproducibility
Document
for the definitions. Following the CDC's State Indicator Report, we calculate
Burden based on the number of confirmed cases over the past 14 days per 100,000
population in the region. The numerical value of burden is calculated with the
score_burden()
function and mapped onto the 4 categories using class_burden()
as shown below.
burden <- score_burden(curr = hdt_clean$case_weekly_1, prev = hdt_clean$case_weekly_2, pop = hdt_clean$pop_2018) burden_c <- class_burden(burden)
The definitions for the confirmed case metrics are shown in the middle panel
of Table 1 and the methods are described in the text that follows it. In short,
we want to pay attention both to the statistical significance of the changing
case counts as well as the substantive significance, or magnitude, of the
change. Thus, we calculate the trajectory as the ratio of cases in the current
7 day period to the previous 7 day period using the score_trajectory()
function. We calculate the statistical significance using a two-sided exact
test for equality of these two counts via the poisson.test()
function from
the {stats
} package with the pval_trajectory()
function. Once we have these
two quantities, we can classify the trajectory using class_trajectory()
The final trajectory based metric that we calculate is the Benjemini and Hochman's
False Discovery Rate (with the p.adjust(method = 'fdr')
function from the {stats
}
package). This is of most use to us at DHS since by calculating 80 p-values we should
expect 4 statistically significant trajectories to occur each week just by chance.
trajectory <- score_trajectory(curr = hdt_clean$case_weekly_1, prev = hdt_clean$case_weekly_2) trajectory_p <- pval_trajectory(curr = hdt_clean$case_weekly_1, prev = hdt_clean$case_weekly_2) trajectory_c <- class_trajectory(traj = trajectory, pval = trajectory_p) trajectory_f <- fdr_trajectory(pval = trajectory_p)
The bottom panel of Table 1 illustrates how we summarize a regions
confirmed case burden and trajectory into a single composite status.
This status is based on the cross-classification of the burden classification
and the trajectory classification and takes on one of the following three
values: Low (light blue), Moderate (blue), High (dark blue). This
cross-classification is done with the confirmed_case_composite()
function.
composite <- confirmed_case_composite(traj_class = trajectory_c, burd_class = burden_c)
The process_confirmed_cases()
function also does some final
cleaning: expressing trajectory in percentage point changes, rounding
burden and trajectory, renaming columns, and reordering columns to fit the
Tableau team's requirements.
The final product is:
knitr::kable(hdt_out, caption = "Confirmed case metrics output for the period ending 17 June 2020")
The real test is to see how the estimates in these functions compare to the ones that Jeff produced, For reference, I am using the 17 June 2020 file that Jeff produced (link here). First we will pull it in and then compare the our results to his. Note: I am not comparing the Count or Rate columns because the Tableau team just removes them.
Note: I have updated the output file to meet the Tableau Team's Specifications
metrics_jeff <- read_xlsx("../inst/file_for_vignette/OHI_BBB_Surveillance_CaseMetric_20200617.xlsx") %>% mutate( Region = ifelse(Region == "Saint Croix", "St. Croix", Region), Trajectory_Class = ifelse(Trajectory_Class == "No statistically significant change", "No significant change", Trajectory_Class), Composite_Class = ifelse(Composite_Class == "Moderate", "Medium", Composite_Class) ) #Identical Date identical(as.Date(unique(metrics_jeff$Date)), unique(hdt_out$Date)) #Identical Region_ID identical(sort(unique(metrics_jeff$Region_ID)), sort(unique(hdt_out$Region_ID))) #Identical Region identical(sort(unique(metrics_jeff$Region)), sort(unique(hdt_out$Region))) #Burden inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Burden = Burden), select(hdt_out, Region_ID, Region, My_Burden = Burden), by = c("Region_ID", "Region")) %>% mutate(diff = Jeff_Burden - My_Burden) %>% filter(diff != 0) #Trajectory #re-create original trajectory calculation before cleaning hdt_out$Trajectory_Orig <- signif(trajectory, 2) inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Trajectory = Trajectory), select(hdt_out, Region_ID, Region, My_Trajectory = Trajectory_Orig), by = c("Region_ID", "Region")) %>% mutate(diff = Jeff_Trajectory - My_Trajectory) %>% filter(diff != 0) #Burden_Class inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Burden_Class = Burden_Class), select(hdt_out, Region_ID, Region, My_Burden_Class = Burden_Class), by = c("Region_ID", "Region")) %>% mutate(same = Jeff_Burden_Class == My_Burden_Class) %>% filter(same == FALSE) #Trajectory_Class #Change Jeff's labels to new wording inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Trajectory_Class = Trajectory_Class), select(hdt_out, Region_ID, Region, My_Trajectory_Class = Trajectory_Class), by = c("Region_ID", "Region")) %>% mutate(same = Jeff_Trajectory_Class == My_Trajectory_Class) %>% filter(same == FALSE) #Trajectory P-value inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Trajectory_P = Trajectory_P), select(hdt_out, Region_ID, Region, My_Trajectory_P = Trajectory_P), by = c("Region_ID", "Region")) %>% mutate(diff = Jeff_Trajectory_P - My_Trajectory_P) %>% filter(diff > 1e-15) #Trajectory FDR inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Trajectory_FDR = Trajectory_FDR), select(hdt_out, Region_ID, Region, My_Trajectory_FDR = Trajectory_FDR), by = c("Region_ID", "Region")) %>% mutate(diff = Jeff_Trajectory_FDR - My_Trajectory_FDR) %>% filter(diff > 1e-15) #Composite inner_join(select(metrics_jeff, Region_ID, Region, Jeff_Composite_Class = Composite_Class), select(hdt_out, Region_ID, Region, My_Composite_Class = Composite_Class), by = c("Region_ID", "Region")) %>% mutate(diff = Jeff_Composite_Class != My_Composite_Class) %>% filter(diff == TRUE)
This information is not included in the Tableau reports, but are used to accumulate information from the entire time series and is especially useful for smaller counties or counties with smaller counts. Since these cases use case data from more than the past two weeks, we need to start with the raw data and re-aggregate it into weekly counts within regions.
last_date <- max(hdt_raw$post_date) #Prep Data for CUSUM functions hdt_cusum <- hdt_raw %>% group_by(fips, geo_name) %>% mutate( week_num = rolling_week(date_vector = post_date, end_date = last_date), ) %>% group_by(fips, geo_name, week_num) %>% summarize( case_weekly = as.integer(sum(case_daily)), week_end = max(post_date), .groups = "keep" ) %>% group_by(fips, geo_name) %>% arrange(fips, geo_name, week_num) %>% filter(!is.na(week_num)) %>% mutate( weeks_ago = as.integer(week_num - 1), S_N = cumsum(case_weekly) - case_weekly[weeks_ago == 0], S_N_z = S_N / weeks_ago - case_weekly[weeks_ago == 0], WC_W = case_weekly[weeks_ago == 0], lower_control_limit = map2_dbl(WC_W, weeks_ago, rev_cusum_lcl), upper_control_limit = map2_dbl(WC_W, weeks_ago, rev_cusum_ucl) )
plot_rev_cusum <- function(rev_cusum_df, region_name) { rev_cusum_df %>% filter(geo_name == region_name, weeks_ago > 0) %>% ggplot(aes(x = as.character(week_end), group = geo_name, y = S_N_z, ymax = upper_control_limit, ymin = lower_control_limit)) + geom_hline(yintercept = 0, color = "red") + geom_point(aes(y = S_N_z)) + geom_ribbon(fill = NA, color = "black") + theme_minimal() + theme(axis.text.x = element_text(angle = 90)) + labs(x = "End date for 7 Day bins", y = "", title = paste("Reverse CUSUM Chart for", region_name)) } plot_rev_cusum(hdt_cusum, "Wisconsin")
plot_rev_cusum(hdt_cusum, "La Crosse")
plot_rev_cusum(hdt_cusum, "Dane")
plot_rev_cusum(hdt_cusum, "Iron")
plot_rev_cusum(hdt_cusum, "Fox Valley Area")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.