options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(intervalaverage)
Note: This package and vignette makes extensive use of data.table
. If
you're unfamiliar with the data.table
syntax, a brief review of that
package's introductory
vignette
may be useful.
intervalaverage
do?The intervalaverage
package is intended to do a specific job very
efficiently. Namely, it averages data measured non-continuously over
arbitrary intervals. It is implemented in C++ and data.table
in order
to do this job as fast and in as memory-efficient a way as possible.
The motivation for this package was to efficiently create time- and location-weighted averages of air pollution exposure for participants in epidemiological studies. In these studies, air pollution exposure is modeled on a weekly, monthly, or annual basis for study participants' home addresses. In order to create a long-term average of each study participant's air pollution exposure, it is necessary to calculate an average of the modeled exposures for each address for the period during which they lived there, which may not align nicely with the exposure periods produced by the models.
There are likely many other applications for averaging non-continuous measurements over arbitrary intervals, but this documentation will demonstrate the application of calculating long-term air pollution exposures for individuals over multiple locations.
We will use two data sets (included in this package) to demonstrate
intervalaverage
's main functionality.
no2
Nitrogen Dioxide (NO~2~) modeled annually at 62 home addresses
(identified by location_id
)
address_history
The time periods for which 25 people (identified
by person_id
) lived at each of these home address.
library(intervalaverage) data("no2") data("address_history") setDT(no2) setDT(address_history) head(no2[]) head(address_history[])
Note that each of these data sets is already a data.table
and the date columns
are IDate
s. However, due to a quirk in data.table
, we need to run setDT
on data loaded with the data
function.
Inspecting the no2
data set above, we see that each location has a
single NO~2~ measurement for each calendar year. It would be simple to
find the average measurement over, for example, 2002-01-01 through
2005-12-31, because that period aligns with our measurements. However,
what if we wanted to find the average NO~2~ level at each of the periods
in our address_history
? Not only does each address have a different
date range to average over, but that range doesn't align nicely with the
measurements! For each address, we need to take a mean of the
measurements, weighted by how much of the measurement was overlapped by
the address history.
This is straightforward with intervalaverage
's intervalaverage
function!
The y
argument of intervalaverage
tells it what date range to
average for every group in the values data set (in this case, every
location_id
in no2
). For now, we don't care about who lived at each
address, only the unique addresses and date ranges. Since some people in
our address history shared homes, we begin by creating a set of unique
addresses and their associated dates.
unique_addresses <- unique(address_history[ , .(location_id, start_date, end_date)])
Now we can use the intervalaverage
function:
averaged_exposures <- intervalaverage( x = no2, y = unique_addresses, interval_vars = c("start_date", "end_date"), value_vars = "no2", group_vars = "location_id" ) head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
You'll notice that there are many missing values. By default,
intervalaverage
returns NA
unless the interval specified in y
is
completely covered by the values in x
, and many of our addresses were
occupied outside the time period covered by the NO~2~ model. To relax
this requirement and return a value even when not all of the interval
had NO~2~ values, use the required_percentage
argument. For example,
here we accept an average if at least 80% of the interval has values.
averaged_exposures <- intervalaverage( x = no2, y = unique_addresses, interval_vars = c("start_date", "end_date"), value_vars = "no2", group_vars = "location_id", required_percentage = 80 ) head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
intervalaverage
also includes some diagnostic columns in its output
that can be useful in determining why NA
s are being returned (this is
why we selected only location_id
, start_date
, end_date
, and no2
to display above). See the "Value" section of help(intervalaverage)
for an explanation of what these columns mean.
If you look closely at our address history, you'll see that some
locations and dates are shared among multiple person_id
s. This makes
sense, since more than one person often lives in the same house! You'll
also notice that people move (so each person_id
may be associated with
multiple location_id
s).
To create an average for each person (rather than location), we'll need
to link the no2
measurement at each location to the person who lived
there at the time. Remember, multiple people might have lived at the
same location at the same or different times.
The intervalaverage
package does this with the intervalintersect
function. It looks very similar to the intervalaverage
function:
no2_by_person <- intervalintersect( x = no2, y = address_history, interval_vars = c('start_date', 'end_date'), group_vars = 'location_id' ) head(no2_by_person)
To understand what's happening here, let's look at person 5.
This person moved from location 40 to location 41 in November of 1996.
address_history[person_id == 5]
If you look at the result of intervalintersect
, you'll see that person
5 has two no2
values for 1996, and that each one is associated with
the part of the year when they lived where that value applied.
no2_by_person[person_id == 5 & year(start) %in% 1995:1997]
Now that we have a set of NO~2~ values associated with the person_id
rather than location_id
, we can use intervalaverage
to find the
average for any period of time. That could be a period that is unique to each participant.
Or it could be consistent for all people in the data set, such as a yearly average
(where each year is a time-weighted average of the places the person
lived that year).
We will work through both, and introduce one more function along the way.
The package includes another data.table with each person's enrollment date in our study. We will use it to calculate the average exposure for each participant in the year before they joined our study.
data(enrollment) setDT(enrollment) head(enrollment)
Again, the y
argument of intervalaverage
tells it what date range to
average for every group in the values data set (x
). So we first need to create a
data.table
that contains the beginning and end of our desired averaging period for each participant from the enrollment
data set. We'll define the "year prior" to enrollment in our study as the 365 days preceding the enrollment date
(i.e. the period from 365 days before enrollment to 1 day before enrollment).
pre_enrollment <- enrollment[ , .(person_id, start = enroll_date - 365L, end = enroll_date - 1L)] head(pre_enrollment)
You may notice I've used start
and end
here rather than start_date
and end_date
as I did previously.
This is because of a quirk of intervalintersect
where it returns start
and end
as the interval column names
regardless of the input names, and in the next step, the interval column names need to match. This will be fixed
in future versions of the package.
Now we simply apply intervalaverage
, again allowing for 20% missingness in no2
values:
no2_pre_enrollment <- intervalaverage( x = no2_by_person, y = pre_enrollment, interval_vars = c('start', 'end'), group_vars = 'person_id', value_vars = 'no2', required_percentage = 80 ) head(no2_pre_enrollment[, .(person_id, start, end, no2)])
Finally, let's calculate annual (calendar-year) averages for 2000-2004. We'll need to build a y
argument
for the intervalaverage
function that has a start date, end date for each year. However, even though
the periods we are averaging over are the same for each participant, we also need to repeat each
year once for every participant. This is because because y
must contain the same grouping variable(s) as x
.
This gives us an opportunity to try a convenient helper function in intervalaverage
: CJ.dt
. This is a more
convenient version of data.table::CJ
(CJ
as in "Cross Join") that can take data.tables
as arguments instead of just vectors.
By cross joining our unique person_id
s with the years 2000-2004, we can construct a y
argument that will work
in the intervalaverage
function.
# Build a data.table with start and end dates for each year years <- data.table(year = 2000:2004) years[ , start := as.IDate(paste(year, "01", "01", sep = "-"))] years[ , end := as.IDate(paste(year, "12", "31", sep = "-"))] years[]
# Cross join with the unique person_ids in our data years <- CJ.dt(years, unique(address_history[ , .(person_id)])) head(years[order(year, person_id)])
Now we can apply the intervalaverage
function again to get a yearly average for each person. Recall that the
NO~2~ data is also annual, so the only rows that will actually be averaged will be those for years where a person
moved (since a person will have had an NO~2~ value for each address they lived in that year).
no2_annual <- intervalaverage( x = no2_by_person, y = years, interval_vars = c('start', 'end'), group_vars = 'person_id', value_vars = 'no2', required_percentage = 80 ) head(no2_annual[, .(person_id, start, end, no2)])
For a more detailed and advanced tutorial of how to use the intervalaverage
package, see vignette("intervalaverage-advanced")
.
For information about the inner workings of the package (intended for maintainers), see vignette("intervalaverage-technicaloverview")
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.