knitr::opts_chunk$set(fig.width = 7, fig.height = 5, comment = NA) options(width = 105)
This vignette documents the monitor_nowcast()
function, which converts a
ws_monitor
object's values to NowCast values. We provide details on the
NowCast algorithm and our implementation thereof. We also provide examples to
highlight specific attributes and potential points of confusion in the algorithm.
This vignette also briefly covers the monitor_aqi()
function, which uses the
monitor_nowcast()
function to convert raw PM2.5 data into NowCast-based AQI
values.
NowCast is an air quality data smoothing algorithm that puts an emphasis on recent values when measurements are unstable, and approaches a long-term (e.g. 12-hour) average when measurements are stable.
The original algorithm, known as the Conroy method, was developed in 2003 to make real-time air quality measurements roughly comparable to established regulatory air quality health thresholds (e.g. 24‐hour PM2.5 standards). However, that method was shown to be slow to respond to rapidly changing air quality conditions, which, at best, reduced public confidence in disseminated AQI values, and at worst, had the potential to adversely affect the health of those in high impact areas (e.g. near wildfires).
In response, EPA developed a new method -- the Reff method -- in 2013 to be more responsive to rapidly changing air quality conditions. We provide technical support for applying the new NowCast algorithm to hourly PM2.5, PM10, and O3 data, though theoretically it could be applied to regular-interval time series data of any type, including other criteria pollutants.
We provide algorithm specifics for PM2.5, PM10, and O3 below.
Sources:
https://en.wikipedia.org/wiki/NowCast_(air_quality_index)
https://forum.airnowtech.org/t/the-nowcast-for-ozone-and-pm/172
The NowCast value for a given hour can be calculated as follows:
$$ NowCast = \frac{\sum_{i=1}^{N}{w^{i-1}c_i}}{\sum_{i=1}^{N}w^{i-1}} $$
where $i$, $N$, $w$ and $c$ are as defined in the sections below.
Sources:
https://en.wikipedia.org/wiki/NowCast_(air_quality_index)
The NowCast algorithm uses hourly averages from the prior $N$ clock hours, where the value of $N$ depends on the pollutant being processed:
The hourly averages are denoted below by $c_i$, where $i$ represents the number of hours before present. For example, $c_1$, $c_2$, $c_3$, $...$, $c_N$ represent the hourly averages for the most recent 1, 2, 3, $...$, $N$ hours.
Source:
https://forum.airnowtech.org/t/the-nowcast-for-ozone-and-pm/172
Using the $N$ data points specified above, define the weight factor $w^*$ as follows:
$$ w^* = 1- \frac{c_{max}-c_{min}}{c_{max}} = \frac{c_{min}}{c_{max}} $$
where
NowCast-related literature usually gives the equation for $w^*$ in one of two basic forms, both of which are included above for reference. Note that both forms are equivalent.
Sources:
https://en.wikipedia.org/wiki/NowCast_(air_quality_index)
https://forum.airnowtech.org/t/the-nowcast-for-ozone-and-pm/172
Before plugging into the NowCast equation, $w*$ is updated to $w$ for PM2.5 and PM10 as follows:
$$ w = \begin{cases} w^ & \text{if} & w^>\frac{1}{2} \ \frac{1}{2} & \text{if} & w^*\leq \frac{1}{2} \ \end{cases} $$
For O3, there is no minimum weight factor, so we define $w = w^*$
Source:
https://forum.airnowtech.org/t/the-nowcast-for-ozone-and-pm/172
Final NowCast values are truncated based on the type of data being processed:
Source:
https://forum.airnowtech.org/t/the-nowcast-for-ozone-and-pm/172
For PM2.5, which uses 12 hours of data, the NowCast equation can be expanded as follows:
$$ NowCast = \frac {w^0c_1 + w^1c_2 + w^2c_3 + w^3c_4 + w^4c_5 + w^5c_6 + w^6c_7 + w^7c_8 + w^8c_9 + w^9c_{10} + w^{10}c_{11} + w^{11}c_{12}} {w^0 + w^1 + w^2 + w^3 + w^4 + w^5 + w^6 + w^7 + w^8 + w^9 + w^{10} + w^{11}} $$
Note that:
When written this way, it is easy to see that in the extreme case where $w = 1$ (i.e. if $c_{min} = c_{max}$) the equation above reduces to:
$$ NowCast = \frac{\sum_{i=1}^{12}{c_i}}{12} $$
which is just a simple 12-hour arithmetic average. Incidentally, all 12 hourly averages and the 12-hour average itself would all be equivalent in this case.
In the case of highly variable PM2.5 data, $w$ would be set to the minimum value of $1/2$, and the most recent data would carry the majority of the weight in the equation above.
The NowCast algorithm ignores terms corresponding to hours for which a valid observation is not available. For example, suppose PM2.5 is invalid for all but the first three and last three hours of a 12-hour period. Then the PM2.5 NowCast equation takes the following form:
$$ NowCast = \frac {w^0c_1 + w^1c_2 + w^2c_3 + \color{gray}{\text{[note: middle values ignored]}} + w^9c_{10} + w^{10}c_{11} + w^{11}c_{12}} {w^0 + w^1 + w^2 + \color{gray}{\text{[note: middle values ignored]}} + w^9 + w^{10} + w^{11}} $$
Minimum data availability requirements do apply, however. See the following section for details.
To get a valid NowCast value, the NowCast algorithm simply requires valid data for at least two of the three most recent clock hours. This means that a valid NowCast value can be calculated from as little as two valid hours, even if 12 hours are typically used in the calculation.
However, note that https://en.wikipedia.org/wiki/NowCast_(air_quality_index) says:
Because the most recent hours of data are weighted so heavily in the NowCast when PM levels are changing, EPA does not report the NowCast when data is missing for c~1~ or c~2~.
While this seems like a reasonable approach, we could not find a source for
this statement (in the Wikipedia page sources or elsewhere). We take a
compromise approach and still allow valid NowCast values to be calculated when
c~2~ is invalid but return NA
when c~1~ is invalid.
includeShortTerm
As mentioned above, the NowCast algorithm only requires two valid hours to
calculate valid values. Does this mean that the monitor_nowcast()
function can
begin reporting valid values after the second hour in the data (assuming the
first two hours are both valid)?
We assert that it would be inappropriate to do so, usually.
In most cases a user will have created a ws_monitor object using one of the
~_createMonitorObject()
or ~_load()
, functions, which return data for a
specific period of time. Data before this period is not necessarily invalid, it
was simply not retrieved. But the function itself has no way of knowing whether
such earlier data exists, so it has no choice but to consider earlier hours
"invalid". This means that, if followed by-the-book, the NowCast algorithm could
return different values for a given hour depending on whether or not the earlier
data had been retrieved. This is not a desirable behavior, so by default the
monitor_nowcast()
returns invalid NowCast values until the $N$^th^ hour of data.
However, we provide a manual override in includeShortTerm=TRUE
which causes
the function to return valid values as per the bare-bones data availability
requirements described above, treating the hours before the beginning of the
data as invalid. Thus, it can return "valid" NowCast values as early as the
second (valid) hour in the data.
This argument may be useful (or even appropriate) for datasets where the beginning of the data truly corresponds to the beginning of the monitoring, such as when a new monitor has just been installed. In this case, data prior to the first hour is truly invalid since it does not exist. The argument may also be useful for field personnel looking to ensure that their monitor has been successfully plugged in to the data processing pipeline, even if the values themselves are not truly representative of the data for the past $N$ hours.
In the NowCast literature we find no mention of negative values, which, while
aphysical, are common in air quality monitoring data. Thus,
we do not adjust negative values up to zero in the monitor_nowcast()
function
itself. However, note that this may be done when creating a ws_monitor
object
in the first place.
Negative values are handled prior to converting NowCast or other values to AQI.
While PM2.5 and PM10 have minimum weight factors of $1/2$, there is no minimum
weight factor for O3. Does this mean O3 could have negative weight factors? We
find no mention of this in the NowCast literature; however, we assume $w_{min}$
cannot be negative since otherwise the calculation could become overwhelmed by
outrageous weight factors, even as large as negative infinity
(e.g. if $c_{min}<0$ and $c_{max}=0$). This assumption is built into our
monitor_nowcast()
function.
version
The version
argument sets defaults for the number of hours in the lookback $N$
(numHrs
), the minimum weight factor $c_{min}$ (weightFactorMin
), and the
number of digits to which the final data is truncated (digits
).
version='pm25'
(default)numHrs <- 12
weightFactorMin <- 0.5
digits <- 1
version='ozone'
numHrs <- 8
weightFactorMin <- NA
digits <- 3
version='pmAsian'
numHrs <- 3
weightFactorMin <- 0.1
digits <- 1
The default setting is version='pm25'
since this is the parameter most
commonly stored in ws_monitor
objects.
version='ozone'
supports the O3 NowCast as described above.
version='pmAsian'
supports an alternative shorter-term NowCast as proposed here:
https://aqicn.org/faq/2015-03-15/air-quality-nowcast-a-beginners-guide/
Although the NowCast algorithm itself supports PM10, we do not currently provide
functionality for this parameter in the monitor_nowcast()
function.
In the future we may allow manual override of the settings described above to allow for custom NowCast-type algorithms.
The EPA uses an Air Quality Index to put different pollutants on the same scale. From their site:
Think of the AQI as a yardstick that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 represents good air quality with little potential to affect public health, while an AQI value over 300 represents hazardous air quality.
We provide the monitor_aqi()
function to convert the PM2.5 data in a
ws_monitor
object into
NowCast-based AQI values in the 0-500 range.
The following examples demonstrate the functionality of monitor_nowcast()
and
specifics of its implementation.
For the following examples we will use the Northwest Megafires data from the PWFSLSmoke package. In particular, we will look at PM2.5 data from Omak, WA, which was heavily impacted by smoke from wildfires during the second half of August, 2015:
suppressPackageStartupMessages(library(PWFSLSmoke)) N_M <- monitor_subset(Northwest_Megafires, tlim=c(20150801,20150831), timezone="America/Los_Angeles") Omak <- monitor_subset(N_M, monitorIDs='530470013_01') Omak_nowcast <- monitor_nowcast(Omak)
monitor_timeseriesPlot(Omak, type='l', lwd=2) monitor_timeseriesPlot(Omak_nowcast, add=TRUE, type='l', col='purple', lwd=2) addAQILines() addAQILegend(lwd=1, pch=NULL, 'topleft') legend("topright", lwd=2, col=c('black','purple'), legend=c('hourly','nowcast')) title("Hourly and Nowcast PM2.5 Values\nOmak, Washington; August, 2015")
In the code above we used the monitor_nowcast()
function to calculate PM2.5
NowCast values for the Omak ws_monitor
object. Let's see if we can verify the
function's output for a single hour.
Below is the Omak PM2.5 data for the first 12 hours of 8/21/15. Let's see if we can verify the accuracy of the NowCast value for the last hour in this series, 8/21/15 Hour 11.
Omak_2015_08_21 <- monitor_subset(Omak, tlim=c(2015082100, 2015082111)) (example1_df <- Omak_2015_08_21$data)
First we'll pull out just the values themselves, and reverse them so the most recent values come first (i.e. so the vector represents $c_1$, $c_2$, $...$, $c_N$).
(example1_values <- rev(example1_df$`530470013_01`))
Per the NowCast algorithm, we define $w*$ as $\frac{c_{min}}{c_{max}}$:
(w_star <- min(example1_values)/max(example1_values))
We now define $w$ based on $w^*$ and the minimum weight factor; recall that $w_{min}=\frac{1}{2}$ for PM2.5:
(w <- max(1/2, w_star))
Thus, the numerator of the NowCast equation for Hour 11
$$ w^0c_1 + w^1c_2 + w^2c_3 + w^3c_4 + w^4c_5 + w^5c_6 + w^6c_7 + w^7c_8 + w^8c_9 + w^9c_{10} + w^{10}c_{11} + w^{11}c_{12} $$
becomes the following:
$0.5^0 \times 46.3 + 0.5^1 \times 27.4 + 0.5^2 \times 59.8 + 0.5^3 \times 129.2 + 0.5^4 \times 130.6 + 0.5^5 \times 215.4$ +
$0.5^6 \times 143.2 + 0.5^7 \times 93.7 + 0.5^8 \times 101.8 + 0.5^9 \times 49.3 + 0.5^{10} \times 80.2 + 0.5^{11} \times 123.3$
which we can calculate in R as follows
(numer <- sum(w^(0:11)*example1_values))
Meanwhile, the denominator
$$ w^0 + w^1 + w^2 + w^3 + w^4 + w^5 + w^6 + w^7 + w^8 + w^9 + w^{10} + w^{11} $$
becomes the following:
$0.5^0 + 0.5^1 + 0.5^2 + 0.5^3 + 0.5^4 + 0.5^5 + 0.5^6 + 0.5^7 + 0.5^8 + 0.5^9 + 0.5^{10} + 0.5^{11}$
which we can calculate in R as follows
(denom <- sum(w^(0:11)))
Dividing the numerator by the denominator gives a value of
numer/denom
which we truncate to one decimal place for a final NowCast value of 54.8 for 8/21/15 Hour 11.
So how does this compare to our monitor_nowcast()
output for the same hour?
monitor_subset(Omak_nowcast, tlim=rep(2015082111, 2))$data
Right on!
So we have verified the calculation for a period with 12 valid hours. But how does the function handle missing data periods? Let's take a look at a short period of missing data to find out.
Omak_2015_08_24 <- monitor_subset(Omak, tlim=c(2015082412, 2015082423)) (example2_df <- Omak_2015_08_24$data)
Here we see that the monitor data is missing for Hour 18. We want to verify that this hour is properly excluded from the NowCast calculation for subsequent hours. We also want to ensure that NowCast returns valid values for all hours in the vicinity, since all hours meet the minimum data availability requirements (i.e. two of three most recent hours valid).
Let's see if we can verify the accuracy of the NowCast value for Hour 23.
We again begin by pulling out our vector of values $c_1$, $c_2$, $...$, $c_N$
(example2_values <- rev(example2_df$`530470013_01`))
and calculating $w*$ and $w$
w_star <- min(example2_values, na.rm=TRUE)/max(example2_values, na.rm=TRUE) (w <- max(1/2, w_star))
This time our numerator and denominator should both exclude the 6th term, since $c_6$ is invalid. So we have
validIndexes <- which(!is.na(example2_values)) numer <- sum(w^(validIndexes-1)*example2_values[validIndexes]) denom <- sum(w^(validIndexes-1)) numer/denom
We truncate the value above to one decimal place for a final NowCast value of 543.8 for 8/24/15 Hour 23.
So how does this compare to our monitor_nowcast()
output for the same hour?
monitor_subset(Omak_nowcast, tlim=rep(2015082423, 2))$data
Again, right on!
A quick look at the monitored data alongside the NowCast data shows that all NowCast hours are valid except for the c~1~ moment when the monitoring data is also invalid.
example2_df$nowcast <- monitor_subset(Omak_nowcast, tlim=c(2015082412, 2015082423))$data$`530470013_01` colnames(example2_df) <- c("datetime", "monitored", "nowcast") example2_df
We now look at a longer period of missing data, from the day prior on 8/23/15.
Omak_2015_08_23 <- monitor_subset(Omak, tlim=c(2015082312, 2015082323)) (example3_df <- Omak_2015_08_23$data)
Here we see that the monitored data is missing for Hours 16-20. We again want to verify that these hours are properly excluded from the NowCast calculation for subsequent hours. This time we also want to ensure that NowCast returns invalid values for hours in which the minimum data availability requirements are not met (i.e. two of three most recent hours valid).
Let's see if we can verify the accuracy of the NowCast value for Hour 23.
We again begin by pulling out our vector of values $c_1$, $c_2$, $...$, $c_N$
(example3_values <- rev(example3_df$`530470013_01`))
and calculating $w*$ and $w$
w_star <- min(example3_values, na.rm=TRUE)/max(example3_values, na.rm=TRUE) (w <- max(1/2, w_star))
This time our numerator and denominator should both exclude the 4th-8th terms, since $c_4$, $c_5$, $c_6$, $c_7$ and $c_8$ are invalid. So we have
validIndexes <- which(!is.na(example3_values)) numer <- sum(w^(validIndexes-1)*example3_values[validIndexes]) denom <- sum(w^(validIndexes-1)) numer/denom
We truncate the value above to one decimal place for a final NowCast value of 164.7 for 8/23/15 Hour 23.
So how does this compare to our monitor_nowcast()
output for the same hour?
monitor_subset(Omak_nowcast, tlim=rep(2015082323, 2))$data
Again, right on!
Now we'll look at the monitored data alongside the NowCast data to see how the validity of the NowCast data was affected by the missing monitored hours.
example3_df$nowcast <- monitor_subset(Omak_nowcast, tlim=c(2015082312, 2015082323))$data$`530470013_01` colnames(example3_df) <- c("datetime", "monitored", "nowcast") example3_df
There's a lot going on here so let's walk through the data one step at a time.
Hours 12-15: NowCast data is valid because the monitored data is valid for
all of the three most recent hours.
Status: GOOD
Hour 16: This is the first hour for which the monitored data is invalid.
NowCast should also return invalid for this c~1~ timestep.
Status: GOOD
Hour 17: This NowCast value is invalid because the monitored data is only
valid for one of the three most recent hours (Hour 15).
Status: GOOD
Hours 18-20: NowCast is invalid since the monitored data is valid for none
of the three most recent hours.
Status: GOOD
Hour 21: This is the first hour for which the monitored data is valid again.
However, NowCast still returns an invalid value this hour since the monitored
data is only valid for one of the three most recent hours (Hour 21).
Status: GOOD
Hour 22: This is the first hour for which the NowCast data is valid again.
This is because the monitored data is again valid for two of the three most
recent hours (Hours 21 and 22).
Status: GOOD
Hour 23: NowCast data is valid because the monitored data is valid for all of the three most recent hours. Status: GOOD
So, it appears the monitor_nowcast()
returns data that has been correctly
validated according to the NowCast algorithm and associated data availability requirements.
A subtle point about the monitor_nowcast()
function which falls out of the
example above: the "current" hour is considered to be a part of the three
"most recent clock hours". This may seem strange at first, but to understand
why we chose this approach, one must think about how the data is captured and
how measurement timestamps correspond to the period they actually represent.
For particle pollution, measurements typically represent the mass of particles that accumulate on a filter during a set period of time, e.g. the past hour. Measurements come in at the end of the hour, but these measurements aren't necessarily representative of conditions at the exact times of the measurements. Instead, they are actually representative of concentrations during the previous clock hour. For example, if an hourly measurement comes in at 12:00, the measurement is actually representative of data during Hour 11, NOT Hour 12. So we call the 12:00 measurement the Hour 11 data point.
This is really just a labeling convention, and it is no different from how we treat other datasets. For example, suppose you wanted to take an average of the low temperatures for every day in July. You would have to wait until August 1st to do so, since the temperature might still be dropping at the end of the day on July 31st. But does this mean you would call this value the average low temperature for August, since it wasn't calculated until August? Of course not: the data belongs to July, even if it couldn't be calculated until August. It is no different with particle pollution: the Hour 11 data belongs to Hour 11, even if it wasn't calculated until 12:00. As a result of this convention, timestamps are usually an entire hour (or more) earlier than the time the measurements were actually taken (exact differences depend on several factors).
Another reason for including the "current" hour in the NowCast "three most recent hours" is for speed of updates. Suppose it is 12:04, and a measurement just came in at 12:00 (the Hour 11 measurement). It would be inappropriate to wait until 13:00 to calculate the updated NowCast value. For this reason, we calculate NowCast values using the monitored data for the "current" hour and the $N-1$ prior hours.
includeShortTerm
argumentFor our final NowCast example we will explore the includeShortTerm
argument,
including a demonstration of why it defaults to FALSE
.
Suppose we wanted to know the NowCast values for Omak for the first half of
8/25/15, since concentrations were extremely high at the time. A normal course
of action might be to create a ws_monitor
object for Omak for 8/25/15:
tlim <- c(2015082500,2015082523) Omak2 <- monitor_subset(Northwest_Megafires, tlim=tlim, monitorIDs = '530470013_01') Omak2_nowcast <- monitor_nowcast(Omak2) Omak2_nowcast$data
Unfortunately, the first 12 hours are invalid. But you read in the documentation
that includeShortTerm=TRUE
will return values for the second valid hour
onwards. So you try again, this time setting includeShortTerm=TRUE
. Let's
check it out.
Omak2_nowcast <- monitor_nowcast(Omak2, includeShortTerm = TRUE) Omak2_nowcast$data
Nice. We were able to get NowCast values for all but the first hour!
But there's a problem lurking. Setting includeShortTerm=TRUE
caused the
monitor_nowcast()
function to treat hours prior to Hour 0 as invalid.
It has no idea that valid data is available for this period.
As an experiment, let's see how the NowCast values would look if we calculated them from a larger dataset that includes the data prior to 8/25/15 Hour 0.
example4_df <- monitor_subset(Omak_nowcast, tlim=tlim)$data example4_df$shortTerm_T <- Omak2_nowcast$data$`530470013_01` colnames(example4_df) <- c("datetime", "fullDataset", "shortTerm") example4_df
While not too extreme, we see differences in the NowCast values calculated by the different approaches, even as late as the 11th hour of the period. As expected, the values match from the 12th hour on.
The discrepancies displayed above are why, by default, includeShortTerm=FALSE
.
It should be used only when necessary, and with an understanding that the first
$N$ hours' values might not necessarily be true NowCast values.
If it exists, we recommend always grabbing an extra day of data at the beginning of the period if you think you might want to calculate NowCast values on a dataset.
Finally, we'll perform a brief demonstration of the monitor_aqi()
function
using data from a prior example.
aqi <- monitor_aqi(Omak) example5_df <- data.frame("datetime"=Omak$data$datetime, "monitored"=Omak$data$`530470013_01`, "aqi"=aqi$data$`530470013_01`) example5_df <- example5_df[500:650,] plot(example5_df$datetime, example5_df$monitored, xlab='Date', ylab='') lines(example5_df$datetime, example5_df$aqi, col="blue")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.