In bcgov/rcaaqs: Tools to implement CAAQS analysis for air quality

knitr::opts_chunk$set(
  dpi = 150,
  collapse = TRUE,
  comment = "#>",
  out.width = "100%",
  fig.width = 8,
  warning = FALSE,
  message = FALSE
)

The rcaaqs package strives to automate the calculation of CAAQS metrics outlined by the Canadian Council of Ministers of the Environment. All calculations are derived from the Guidance Document on Achievement Determination Canadian Ambient Air Quality Standards for Fine Particulate Matter and Ozone and Methods for SO~2~ and NO~2~ outlined by Environmental Reporting BC

See the section "Under the hood" for a better idea of how rcaaqs performs the underlying calculations.

CAAQS Metrics

There are 7 CAAQS metrics for 4 different pollutants, in the rcaaqs package each has a corresponding function:

pm_24h_caaqs() - Fine Particulate Matter (PM2.5) over 24 hours (3-yr rolling average)
pm_annual_caaqs() - Fine Particulate Matter (PM2.5) over a year (3-yr rolling average)
o3_caaqs() - Ozone over 8 hours (3-yr rolling average)
so2_3yr_caaqs() - SO~2~ over 1 hour (3-yr rolling average)
so2_1yr_caaqs() - SO~2~ over a year
no2_3yr_caaqs() - NO~2~ over 1 hour (3-yr rolling average)
no2_1yr_caaqs() - NO~2~ over a year

While the underlying calculations are different, in general the procedure to calculate any given metric is similar. Therefore we will focus on Annual PM2.5.

Let's start by loading some helpful packages

# For plotting
library(ggplot2)  

# For data manipulation
library(dplyr)    
library(tidyr)

# For calculations
library(rcaaqs)

The rcaaqs package contains several sample data sets, let's take a look at the Ozone (O~3~) sample data:

#head(o3_sample_data)
head(pm25_sample_data)

Data Preparation

The clean_neg() and date_fill() are useful helper functions for ensuring all the pollutant values are positive or zero, and that you are starting with a complete dataframe (i.e. no missing hourly readings):

pm25_sample_data_clean <- pm25_sample_data %>% 
  mutate(value = clean_neg(value, type = "pm25"))

pm25_sample_data_clean <- pm25_sample_data_clean %>% 
  group_by(ems_id, site) %>% 
  do(., date_fill(., date_col = "date_time",
                  fill_cols = c("ems_id", "site"),
                  interval = "1 hour")) %>% 
  ungroup()

head(pm25_sample_data_clean)

Calculate Ambient CAAQS Metric & CAAQS Achievement

Calculating a CAAQS metric is done using one of the 7 the caaqs_metrics functions. Note you must include grouping variables (i.e. station information in our case here):

pm <- pm_annual_caaqs(pm25_sample_data_clean, 
                     by = c("ems_id", "site"))
pm

The resulting object is of class caaqs plus an additional class denoting the parameter analyzed:

class(pm)

The object contains all of the intermediate analysis results as well as the final caaqs results. There are accessor/extractor functions to get the different components. See ?get_caaqs for the different functions. We can get the caaqs results using the get_caaqs() function:

get_caaqs(pm)

Besides the actual CAAQS metric values (metric_value), there are several important columns in the caaqs results dataframe.

CAAQS details

caaqs_year - The year for which the CAAQS metric is compiled
min_year & max_year - The first year and last years included in the rolling average (in this example)
n_years - The total number of years included in a metric. Note that metric_values are NA for any year that has < 2 years included

CAAQS values

metric - The name of the metric calculated
metric_value - The actual CAAQS metric, calculated and rounded according to the documentation
caaqs - The 'achievement' level of the CAAQS metric

Data completeness

flag_daily_incomplete - Whether any of the daily summaries were kept, but flagged as incomplete
flag_yearly_incomplete - Whether any of the yearly summaries were kept, but flagged as incomplete
flag_two_of_three_years - Whether the 3-yr rolling average is based on only 2 years

Note that in all cases values of NA indicate that flagging didn't apply to this metric (see CAAQS Documentation for more details)

Calculate Management CAAQS Metric Values and Management Levels

Prior to determining management levels, jurisdictions have the option of adjusting their air zone metric values to remove external influences over which jurisdictions have little or no control, such as those related to transboundary flows and exceptional events. These arrangements aim to ensure that jurisdictions are responsible for managing only the emissions sources they can control.

Excluding Dates based on Exceptional Events or Transboundary Flows

First let's take a look at the daily averages. To do this, we need to get the intermediate daily analysis from the CAAQS results useing the extractor functions:

pm_daily_avg <- get_daily(pm)
pm_daily_avg

ggplot(data = pm_daily_avg, aes(x = date, y = avg_24h)) +
  geom_line() +
  facet_wrap(~ site)

We can also plot the time series with CAAQS results using the built-in plot_ts function, providing an id (and the column in which to look) of the station we wish to plot:

plot_ts(pm, id = "0310172", id_col = "ems_id", rep_yr = 2013)

Let's say, for illustration purposes, that all high values in the summer of 2012 were due to forest fires and we want to exclude them.

We'll filter() the daily data to just these dates and select() only the columns that reflect the grouping of our data and the dates.

high_dates <- pm_daily_avg %>%
  filter(date >= as.Date("2012-05-01"),
         date < as.Date("2012-10-01"),
         avg_24h > 10) %>%
  select(ems_id, site, date)
high_dates

Determine Air Quality Management System Management Levels Now we can pass this excluded data on to the caaqs_management()function via the exclude_df argument, which specifies the data frame, and the exclude_df_dt which specifies the name of the column holding the dates.

pm_excl <- caaqs_management(pm, exclude_df = high_dates, exclude_df_dt = "date")
pm_excl
pm_excl_caaqs <- get_caaqs(pm_excl)
pm_excl_caaqs

Note that there are some new columns in the caaqs plus managament results data frame:

excluded - Whether any days were excluded due to Exceptional Events or Transboundary Flows (see below). The TRUE values reflect the fact that some dates in that metric have been excluded. (Note that excluding data in this manner does not influence the evaluation of data completeness).
mgmt_metric_value - The recalculated "management" CAAQS metric, calculated and rounded with excluded days according to the documentation
mgmt_level - The 'management' level of the management CAAQS metric

We can also define excluded dates with date ranges, but here we must specify both start and end column names:

high_dates <- data.frame(ems_id = "0310162",
                         site = "Port Moody Rocky Point Park",
                         start = as.Date(c("2012-06-11", "2012-07-01")),
                         end = as.Date(c("2012-06-25", "2012-08-14")))
high_dates

pm_excl <- caaqs_management(pm, exclude_df = high_dates,
                            exclude_df_dt = c("start", "end"))

pm_excl_caaqs <- get_caaqs(pm_excl)
pm_excl_caaqs

Under the hood

To understand what's happening within the CAAQS functions we can take a look at this function flow chart:

knitr::include_graphics(system.file('function_overview.png', package = 'rcaaqs'))

Each step from the documentation has a corresponding function (or functions) in the rcaaqs package, but most of these are hidden. The green boxes reflect the wrapper CAAQS functions that perform all the underlying steps. For example, the o3_caaqs() function performs all the steps outlined in the second green column.

Airzones

We can match our data to an airzone by using the assign_airzones() function but you must have the following:

A map file reflecting airzones, specifically an sf POLYGON (or MULTIPOLYGON) object
CAAQS data with latitude and longitude for each station/site

For the following examples, we'll use our pm CAAQS data, but only for 2013. To get the management levels, we must run caaqs_management() on the caaqs object, even if we are not adding a data frame of dates to exclude.

pm <- caaqs_management(pm)
pm

pm_caaqs <- get_caaqs(pm) %>%
  filter(caaqs_year == 2013)
pm_caaqs

We will get our map file from the bcmaps package (note that you must also install the bcmaps.rdata package).

library(bcmaps)
az <- bcmaps::airzones()
class(az)
az

For the purposes of this vignette, we'll get our stations coordinates from a file included in the package. You can obtain the most up to date stations data from the BC Data Catalogue.

stations_file <- system.file("air_stations_2018-11-07.csv", package = "rcaaqs")

stations <- readr::read_csv(stations_file, na = "N/A") %>%  # Read csv file
  rename_all(tolower) %>%                                   # Rename columns to lower case
  filter(ems_id %in% unique(pm_caaqs$ems_id)) %>%           # Filter to ids in our pm data
  select(ems_id, latitude, longitude) %>%                   # Select only the important columns
  mutate(latitude = as.numeric(latitude),                   # Transform lat and lon to numeric
         longitude = as.numeric(longitude)) %>%
  distinct()                                                # Remove any duplicates

stations

Now we'll merge them into our CAAQS data

pm_caaqs <- left_join(pm_caaqs, stations, by = "ems_id")
pm_caaqs

Now we have our airzones map and our station coordinates, we can match stations to airzones:

pm_az <- assign_airzone(pm_caaqs, airzones = az, coords = c("longitude", "latitude"))
pm_az

pm_az2 <- airzone_metric(pm_az)
pm_az2

Mapping airzones

library(sf)
library(gridExtra)

az_plot <- st_transform(az, crs = "+proj=longlat") %>%
  left_join(pm_az2, by = c("Airzone" = "airzone"))

g1 <- ggplot() +
  geom_sf(data = az_plot, aes(fill = caaqs_ambient)) +
  geom_sf(data = st_as_sf(pm_az, coords = c("lon", "lat"), crs = "+proj=longlat"),
          shape = 21, colour = "black", aes(fill = caaqs_ambient)) +
  scale_fill_manual(values = get_colours(type = "achievement"), drop = FALSE) +
  labs(title = "Achievement Status: PM2.5 Annual CAAQS") +
  theme(legend.position = c(0.85, 0.85))

g2 <- ggplot() +
  geom_sf(data = az_plot, aes(fill = mgmt_level)) +
  geom_sf(data = st_as_sf(pm_az, coords = c("lon", "lat"), crs = "+proj=longlat"),
          shape = 21, colour = "black", aes(fill = mgmt_level)) +
  scale_fill_manual(values = get_colours(type = "management"), drop = FALSE) +
  labs(title = "Management Status: PM2.5 Annual CAAQS") + 
  theme(legend.position = c(0.85, 0.85))

grid.arrange(g1, g2, ncol = 1)