knitr::opts_chunk$set(echo = TRUE)
An essential piece of analysis for large data sets is efficient granularization:
computing aggregations like sum
, mean
, sd
, min
, and max
, in which a
single number gives insight into the nature of a larger population of measurements.
Time series aggregation is the aggregation of all data points over a specified
period. Within the AirSensor package, this is achieved with pat_aggregate()
which applies an aggregating function, similar to those mentioned above, over a
temporal subset of data. By default time series data is broken up into 1-hour
periods. The result of the aggregation is a new dataset where each data point reflects a
statistical view of the collected and aggregated data points over each hour.
To demonstrate this feature we'll load a 24-hour period of Purple Air data and compare the data to the aggregated data.
# AirSensor setup library(AirSensor) setArchiveBaseUrl("http://data.mazamascience.com/PurpleAir/v1") # Load the PurpleAir sensor data pas <- pas_load(archival = TRUE) pat <- pat_load( label = 'SCSC_33', pas = pas, startdate = 20200501, enddate = 20200502 )
A standard 24-hour period of non-aggregated data typically consists of 720 data entries -- one record every 2 minutes.
nrow(pat$data)
In the multi-plot below we can see the high temporal resolution of the raw data.
pat_multiPlot(pat, sampleSize = NULL)
Using pat_aggregate()
we can aggregate the pat
object to an hourly average
of the data. Hourly reporting is the standard for most regulatory air quality
monitoring and is the recommended period to use. It is also the default.
hourly_pat <- pat_aggregate(pat) nrow(hourly_pat$data)
As we'd expect, an hourly aggregated pat
contains 24 records, one for each hour.
Spikes seen in the raw data contribute to each hourly average but the overall
effect is a much smoother time series.
pat_multiPlot(hourly_pat)
With care, we can extend the use of pat_aggregate()
to summarize time series
pat
data for nearly any period. Sub-hour aggregation may be useful in
creating custom QC functions.
You can create different aggregation periods by explicitly providing unit
, a
string describing the period to split by unit =
'hours'
, 'minutes'
,
'weeks'
, 'months'
, etc. and count
, the number of units to aggregate in each
bin. For example, a 15-minute standard deviation (sd
) aggregation would look
like this:
# Aggregate the standard deviation of 15-minute periods sd_fifteen_minute_pat <- pat_aggregate(pat, function(x) { sd(x, na.rm = TRUE) }, unit = 'minutes', count = 15) # View first 5 entries of data head(sd_fifteen_minute_pat$data)
In order to write custom aggregation functions for use with pat_aggregate()
,
we must first familiarize ourselves with pat_aggregate()
's underlying
algorithm.
When executed, pat_aggregate(pat, FUN)
utilizes the datetime
axis
of a PurpleAir Timeseries object (pat
) to split the data into time-granular
bins. For each column of numeric data within pat$data
, pat_aggregate()
applies FUN
to the binned data to produce an hourly (by default) vector
of values.
FUN
may be assigned to any valid R function! (Caveat: With great power
comes great responsibility.) The only requirements for
the function are that FUN
must operate on univariate numeric data and return a
scalar value (think sum
, or mean
). The last step in pat_aggregate()
is to
combine the transformed bins along a similarly binned datetime
axis and return
a data object of the same pat
class (pa_timeseries
).
The ability to create custom functions for use in aggregation opens the door wide for exploratory data analysis and QC design.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.