An introduction to incidence2

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
data.table::setDTthreads(2)

What does it do?

incidence2 is an R package that implements functions to compute, handle and visualise incidence data. It aims to be intuitive to use for both interactive data exploration and as part of more robust outbreak analytic pipelines.

The package is based around objects of the namesake class, incidence2. These objects are a tibble subclass with some additional invariants. That is, an incidence2 object must:

Functions at a glance

To create and work with incidence2 objects we provide a number of functions:

Basic Usage

Examples in the vignette utilise three different sets of data:

Computing incidence from a linelist

Broadly speaking, we refer to data with one row of observations (e.g. 'Sex', 'Date of symptom onset', 'Date of Hospitalisation') per individual as a linelist

library(incidence2)

# linelist from the simulated ebola outbreak  (removing some missing entries)
ebola <- subset(outbreaks::ebola_sim_clean$linelist ,!is.na(hospital))
str(ebola)

To compute daily incidence we pass to incidence() our linelist data frame as well as the name of a column in the data that we can use to index over time. Whilst we refer to this index as the date_index there is no restriction on it's type, save the requirement that is has an inherent ordering.

(daily_incidence <- incidence(ebola, date_index = "date_of_onset"))

incidence2 also provides a simple plot method (see help("plot.incidence2")) built upon ggplot2.

plot(daily_incidence)

The daily data is quite noisy, so it may be worth grouping the dates prior to calculating the incidence. One way to do this is to utilise functions from the grates package. incidence2 depends on the grates package so all of it's functionality is available directly to users. Here we use the as_isoweek() function to convert the 'date of onset' to an isoweek (a week starting on a Monday) before proceeding to calculate the incidence:

(weekly_incidence <- 
    ebola |>
    mutate(date_of_onset = as_isoweek(date_of_onset)) |> 
    incidence(date_index = "date_of_onset"))
plot(weekly_incidence, border_colour = "white")

As this sort of date grouping is often required we have chosen to integrate this within the incidence() function via the interval parameter. interval can take any of the following values:

As an example, the following is equivalent to the weekly_incidence output above:

(dat <- incidence(ebola, date_index = "date_of_onset", interval = "isoweek"))
# check equivalent
identical(dat, weekly_incidence)

If we wish to aggregate by specified groups we can use the groups argument. For instance, to compute the weekly incidence by gender:

(weekly_incidence_gender <- incidence(
    ebola,
    date_index = "date_of_onset",
    groups = "gender",
    interval = "isoweek"
))

For grouped data, the plot method will create a faceted plot across groups unless a fill variable is specified:

plot(weekly_incidence_gender, border_colour = "white", angle = 45)
plot(weekly_incidence_gender, border_colour = "white", angle = 45, fill = "gender")

incidence() also supports multiple date inputs and allows renaming via the use of named vectors:

(weekly_multi_dates <- incidence(
    ebola,
    date_index = c(
        onset = "date_of_onset",
        infection = "date_of_infection"
    ), 
    interval = "isoweek",
    groups = "gender"
))

For a quick, high-level, overview of grouped data we can use the summary() method:

summary(weekly_multi_dates)

When multiple date indices are given, they are used for rows of the resultant plot, unless the resultant variable is used to fill:

plot(weekly_multi_dates, angle = 45, border_colour = "white")
plot(weekly_multi_dates, angle = 45, border_colour = "white", fill = "count_variable")

Computing incidence from pre-aggregated data

In terms of this package, pre-aggregated data, is data where we have a single column representing time and associated counts linked to those times (still optionally split by characteristics). The included Covid data set is in this wide format with multiple count values given for each day.

covid <- subset(
    covidregionaldataUK,
    !region %in% c("England", "Scotland", "Northern Ireland", "Wales")
)
str(covid)

Like with our linelist data, incidence() requires us to specify a date_index column and optionally our groups and/or interval. In addition we must now also provide the counts variable(s) that we are interested in.

Before continuing, take note of the missing values in output above. Where these occur in one of the count variables, incidence() warns users:

monthly_covid <- incidence(
    covid,
    date_index = "date",
    groups = "region",
    counts = "cases_new",
    interval = "yearmonth"
)
monthly_covid

Whilst we could have let incidence() ignore missing values (equivalent to setting sum(..., na.rm=TRUE)), we prefer that users make an explicit choice on how these should (or should not) be imputed. For example, to treat missing values as zero counts we can simply replace them in the data prior to calling incidence():

(monthly_covid <-
     covid |>
     tidyr::replace_na(list(cases_new = 0)) |> 
     incidence(
         date_index = "date",
         groups = "region",
         counts = "cases_new",
         interval = "yearmonth"
     ))
plot(monthly_covid, nrow = 3, angle = 45, border_colour = "white")

Plotting in style of European Programme for Intervention Epidemiology Training (EPIET)

For small datasets it is convention of EPIET to display individual cases as rectangles. We can do this by setting show_cases = TRUE in the call to plot() which will display each case as an individual square with a white border.

dat <- ebola[160:180, ]

incidence(
    dat,
    date_index = "date_of_onset",
    date_names_to = "date"
) |> 
plot(color = "white", show_cases = TRUE, angle = 45, n_breaks = 10)
incidence(
    dat,
    date_index = "date_of_onset",
    groups = "gender",
    date_names_to = "date"
) |> 
plot(show_cases = TRUE, color = "white", angle = 45, n_breaks = 10, fill = "gender")

Support for tidy-select semantics

When working interactively it can feel a little onerous constantly having to quote inputs for column names. To alleviate this we include the functions incidence_() and regroup_() which both support tidy-select semantics in their column arguments (i.e. date_index, groups and counts).

For now we have chosen to distinguish the functions via the post-fix underscore and have a preference for the standard version for non-interactive (e.g. programmatic usage). This could change over time if users feel having two similar functions is confusing.

Working with incidence objects

On top of the incidence constructor function and the basic plotting, printing and summary we provide a number of other useful functions and integrations for working with incidence2 objects.

Note: The following sections utilise tidy-select semantics and hence use the post-fix version of the incidence function (incidence_())

regroup()

If you've created a grouped incidence object but now want to change the internal grouping, you can regroup() to the desired aggregation:

# generate an incidence object with 3 groups
(x <- incidence_(
    ebola,
    date_index = date_of_onset,
    groups = c(gender, hospital, outcome),
    interval = "isoweek"
))
# regroup to just two groups
regroup_(x, c(gender, outcome))
# standard (non-tidy-select) version
regroup(x, c("gender", "outcome"))
# drop all groups
regroup(x)

complete_dates()

Sometimes your incidence data does not span consecutive units of time, or different groupings may cover different periods. To this end we provide a complete_dates() function which ensures a complete and identical range of dates are given counts (by default filling with a 0 value).

dat <- data.frame(
    dates = as.Date(c("2020-01-01", "2020-01-04")),
    gender = c("male", "female")
)

(incidence <- incidence_(dat, date_index = dates, groups = gender))
complete_dates(incidence)

keep_first(), keep_last() and keep_peaks()

Once your data is grouped by date, you can select the first or last few entries based on a particular date grouping using keep_first() and keep_last():

weekly_incidence <- incidence_(
    ebola,
    date_index = date_of_onset,
    groups = hospital,
    interval = "isoweek"
)

keep_first(weekly_incidence, 3)
keep_last(weekly_incidence, 3)

Similarly keep_peaks()returns the rows corresponding to the maximum count value for each grouping of an incidence2 object:

keep_peaks(weekly_incidence)

Bootstrapping and estimating peaks

estimate_peak() returns an estimate of the peak of an epidemic curve using bootstrapped samples of the available data. It is a wrapper around two functions:

Note that the bootstrapping approach used for estimating the peak time makes the following assumptions:

influenza <- incidence_(
    outbreaks::fluH7N9_china_2013,
    date_index = date_of_onset,
    groups = province
)

# across provinces (we suspend progress bar for markdown)
estimate_peak(influenza, progress = FALSE) |> 
    subset(select = -count_variable)
# regrouping for overall peak
plot(regroup(influenza))
estimate_peak(regroup(influenza), progress = FALSE) |> 
    subset(select = -count_variable)
# return the first peak of the grouped and ungrouped data
first_peak(influenza)
first_peak(regroup(influenza))
# bootstrap a single sample
bootstrap_incidence(influenza)

cumulate()

You can use cumulate() to easily generate cumulative incidences:

(y <- cumulate(weekly_incidence))
plot(y, angle = 45, nrow = 3)

Building on incidence2

The benefit incidence2 brings is not in the functionality it provides (which is predominantly wrapping around the functionality of other packages) but in the guarantees incidence2 objects give to a user about the underlying object structure and invariants that must hold.

To make these objects easier to build upon we give sensible behaviour when the invariants are broken, an interface to access the variables underlying the incidence2 object and methods, for popular group-aware generics, that implicitly utilise the underlying grouping structure.

Class preservation

As mentioned at the beginning of the vignetted, by definition, incidence2 objects must:

Due to these requirements it is important that these objects preserve (or drop) their structure appropriately under the range of different operations that can be applied to data frames. By this we mean that if an operation is applied to an incidence2 object then as long as the invariants of the object are preserved (i.e. required columns and uniqueness of rows) then the object will retain it's incidence class. If the invariants are not preserved then a tibble will be returned instead.

# create a weekly incidence object
weekly_incidence <- incidence_(
    ebola,
    date_index = date_of_onset,
    groups = c(gender, hospital),
    interval = "isoweek"
)

# filtering preserves class
weekly_incidence |> 
    subset(gender == "f" & hospital == "Rokupa Hospital") |> 
    class()

class(weekly_incidence[c(1L, 3L, 5L), ])

# Adding columns preserve class
weekly_incidence$future <- weekly_incidence$date_index + 999L
class(weekly_incidence)
weekly_incidence |> 
    mutate(past = date_index - 999L) |> 
    class()

# rename preserve class
names(weekly_incidence)[names(weekly_incidence) == "date_index"] <- "isoweek"
str(weekly_incidence)

# select returns a data frame unless all date, count and group variables are
# preserved in the output
str(weekly_incidence[,-1L])
str(weekly_incidence[, -6L])

# duplicating rows will drop the class but only if duplicate rows
class(rbind(weekly_incidence, weekly_incidence))
class(rbind(weekly_incidence[1:5, ], weekly_incidence[6:10, ]))

Accessing variable information

We provide multiple accessors to easily access information about an incidence2 object's structure:

# the name of the date_index variable of x
get_date_index_name(weekly_incidence)
# alias for `get_date_index_name()`
get_dates_name(weekly_incidence)
# the name of the count variable of x
get_count_variable_name(weekly_incidence)
# the name of the count value of x
get_count_value_name(weekly_incidence)
# the name(s) of the group variable(s) of x
get_group_names(weekly_incidence)
# the date_index variable of x
str(get_date_index(weekly_incidence))
# alias for get_date_index
str(get_dates(weekly_incidence))
# the count variable of x
str(get_count_variable(weekly_incidence))
# the count value of x
str(get_count_value(weekly_incidence))
# list of the group variable(s) of x
str(get_groups(weekly_incidence))

Grouping aware methods

incidence2 provides methods for popular group-aware generics from both base R and the wider package ecosystem:

When called on incidence2 objects, these methods will utilise the underlying grouping structure without the user needing to explicitly state what it is. This makes it very easy to utilise in analysis pipelines.

Example - Adding a rolling average

weekly_incidence |>
    regroup_(hospital) |> 
    mutate(rolling_average = data.table::frollmean(count, n = 3L, align = "right")) |> 
    plot(border_colour = "white", angle = 45) +
    ggplot2::geom_line(ggplot2::aes(x = date_index, y = rolling_average))


Try the incidence2 package in your browser

Any scripts or data that you put into this service are public.

incidence2 documentation built on June 22, 2024, 11:05 a.m.