knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(123)
library(deweather)
Meteorology plays a central role in affecting the concentrations of pollutants in the atmosphere. When considering trends in air pollutants it can be very difficult to know whether a change in concentration is due to emissions or meteorology.
The deweather package uses a powerful statistical technique based on boosted regression trees using the {gbm}
package ^[Ridgeway G, Developers G (2024). gbm: Generalized Boosted Regression Models. R package version 2.2.2, https://CRAN.R-project.org/package=gbm]. Statistical models are developed to explain concentrations using meteorological and other variables. These models can be tested on randomly withheld data with the aim of developing the most appropriate model.
The deweather package comes with a comprehensive data set of air quality and meteorological data. The air quality data is from Marylebone Road in central London (obtained from the {openair}
package) and the meteorological data from Heathrow Airport (obtained from the {worldmet}
package).
The road_data
data frame contains various pollutants such a NO~x~, NO~2~, ethane and isoprene as well as meteorological data including wind speed, wind direction, relative humidity, ambient temperature and cloud cover. Code to obtain this data directly can be found here.
#| label: "load-pkg" #| eval: TRUE head(road_data)
To speed up the examples in this article we'll randomly sample some of road_data
.
The testMod()
function is used to build and test various models to help derive the most appropriate.
In this example, we will restrict the data to model to 4 years. Note that variables such as "hour"
and "weekday"
are used as variables that can be used to explain some of the variation. "hour"
for example very usefully acts as a proxy for the diurnal variation in emissions. These temporal emission proxies are also important to include to help the model differentiate between emission versus weather-related changes. For example, emissions tend to change throughout a day and so do variables such as wind speed and ambient temperature.
#| label: "testMod" #| eval: TRUE #| message: TRUE #| fig.height: 4.5 #| fig.cap: "A statistical summary of a `{deweather}` model test." #| fig.alt: "A scatter plot showing predicted nitrogen dioxide on the #| x-axis and measured nitrogen dioxide on the y-axis. Alongside is a table #| of statistical values describing the model performance, including R, #| RMSE, NMGE, NMB, MGE, MB, FAC2 and n." library(openair) # select only part of the data set dat_part <- selectByDate(road_data, year = 2001:2004) # to speed up the example, sample rows in `dat_part` # not needed in reality dat_part <- dplyr::slice_sample(dat_part, prop = 1/10) # test a model with commonly used covariates mod_test <- testMod( dat_part, vars = c("trend", "ws", "wd", "hour", "weekday", "air_temp", "week"), pollutant = "no2" )
The output shows by default the performance of the model when applied to a withheld random 20% (by default) of the data, i.e., the model is evaluated against data not used to build the model. Common model evaluation metrics are also given.
Assuming that a good model can be developed, it can now be explored in more detail using the optimum number of trees from testMod()
.
#| label: "buildMod" #| eval: TRUE mod_no2 <- buildMod( dat_part, vars = c("trend", "ws", "wd", "hour", "weekday", "air_temp", "week"), pollutant = "no2", n.trees = mod_test$optimum_trees, n.core = 6 )
This function returns a deweather
object that can be interrogated as shown below.
One of the benefits of the boosted regression tree approach is that the partial dependencies can be explored. In simple terms, the partial dependencies show the relationship between the pollutant of interest and the covariates used in the model while holding the value of other covariates at their mean level.
#| label: "plotAll" #| eval: TRUE #| fig.width: 7 #| fig.height: 11 #| fig.cap: "The 7 partial dependencies of the deweather model." #| fig.alt: "Seven line charts showing the partial dependencies of the #| deweather model. In order of influence: wind direction, hour of day, #| long-term trend, weekday, wind speed, week of the year, and finally air #| temperature." plotPD(mod_no2, nrow = 4)
It can be very useful to plot important two-way interactions. In this example the interaction between "ws"
and "air_temp"
is considered. The plot shows that NO~2~ tends to be high when the wind speed is low and the temperature is low, i.e., stable atmospheric conditions. Also NO~2~ tends to be high when the temperature is high, which is most likely due to more O~3~ available to convert NO to NO~2~. In fact, background O~3~ would probably be a useful covariate to add to the model.
#| label: "plot2way" #| eval: TRUE #| fig.width: 5 #| fig.height: 4.5 #| out.width: "50%" #| fig.cap: "A two-way interaction plot showing the interaction between #| wind speed and air temperature" #| fig.alt: "A heatmap showing the interaction between air temperature and #| wind speed in the deweather model. Nitrogen dioixde is shown to be high #| when wind speed is low and temperature is either very low or above around #| 25 degrees Celcius." plot2Way(mod_no2, variable = c("ws", "air_temp"))
An indication of the meteorologically-averaged trend is given by the plotPD()
function above, and can even be isolated using the variable
argument.
#| label: "pd-trend" #| fig.width: 7 #| fig.height: 3.5 #| fig.cap: "The partial dependence plot of the 'trend' component." #| fig.alt: "Patial dependence plot of the 'trend' component of the deweather model #| with date on the x-axis and NO2 on the y-axis. The trend is jagged, but shows an increase in 2003." plotPD(mod_no2, variable = "trend")
A much better indication is given by using the model to predict many times with random sampling of meteorological conditions. This sampling is carried out by the metSim()
function. Note that in this case there is no need to supply the "trend"
component because it is calculated using metSim()
.
#| label: "metSim" #| eval: TRUE demet <- metSim(mod_no2, newdata = dat_part, metVars = c("ws", "wd", "air_temp") )
Now it is possible to plot the resulting trend.
#| label: "plotTrend" #| eval: TRUE #| fig.width: 7 #| fig.height: 3.5 #| fig.cap: "A deweathered nitrogen dioxide trend." #| fig.alt: "A line chart with date on the x-axis and deweathered NO2 on #| the y-axis. The trend is very noisy, but shows an increase in #| concentrations in 2003." openair::timePlot(demet, "no2", ylab = "Deweathered no2 (ug/m3)")
The plot shows the trend in NO~2~ controlling for the main weather variables. The plot now reveals the strong diurnal and weekly cycle in NO~2~ that is driven by variations in the sources of NO~2~ (NO~x~) rather than meteorology i.e. road traffic which has strong hourly and daily variations throughout the year. It can be useful to simply average the results to provide a better indication of the overall trend. For example:
#| label: "plotTrendAve" #| eval: TRUE #| fig.width: 7 #| fig.height: 3.5 #| fig.cap: "A time-averaged deweathered nitrogen dioxide trend." #| fig.alt: "A line chart with date on the x-axis and deweathered NO2 on #| the y-axis. The trend has been time averaged to show weekly mean #| concentrations, clearly illustrating a sharp increase in 2003." openair::timePlot(demet, "no2", avg.time = "week", ylab = "Deweathered no2 (ug/m3)")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.