In MazamaScience/PWFSLSmoke: Utilities for Working with Air Quality Monitoring Data

knitr::opts_chunk$set(fig.width=7, fig.height=5)

This vignette explores how the PWFSLSmoke package can be used to retrieve and explore raw, ambient air quality data from multiple types of AIRSIS monitoring stations. Examples demonstrate using real monitoring data to answer a variety of questions.

Although this vignette focuses on monitors whose data is made available through AIRSIS, the following examples would also support exploration of raw data made available through WRCC for which similar functionality exists, e.g. wrcc_createRawDataframe().

Package Setup

\textcolor{red}{NOTE: This is all mentioned in the readme on the github page -- may be able to just point folks there for installation and setup.}

First things first, we need to load the PWFSLSmoke package and set up logging. You can pass arguments to the logger.setup() function to send logging information to log files. In the following example we do not use log files but instead use logger.setLevel() to send any ERROR output to the console.

suppressPackageStartupMessages(library(PWFSLSmoke))
initializeMazamaSpatialUtils()
logger.setLevel(ERROR)

If you are still having trouble setting things up, refer to the full package documentation and README at the following link: https://github.com/MazamaScience/PWFSLSmoke

Data Acquisition

The first step in any data exploration is data acquisition. We'll use the airsis_createRawDataframe() function for this task.

The airsis_createRawDataframe() function creates a dataframe of raw monitoring data from a single monitor over a specific time period. For example, below we retrieve data from September 1 through October 15, 2016 for monitor IDs 1033 and 1049. Both monitors' data is provided by the U.S. Forest Service.

plain_AIRSIS_EBAM <- airsis_createRawDataframe(20160906, 20161006, provider='USFS', unitID=1033)
nile_AIRSIS_ESAM <- airsis_createRawDataframe(20160906, 20161006, provider='USFS', unitID=1049)

These two monitors were part of a prescribed burn pilot program in Washington state which also involved the following monitors:

Monitor ID | Location (Alias) | Type ----------- | --------------- | ---------- 1012 | Kettle Falls, WA (Fish Hatchery) | EBAM 1013 | Manson, WA (Kenned Meadows) | EBAM 1031 | Liberty, WA | EBAM 1033 | Plain, WA | EBAM 1034 | Curlew, WA | EBAM 1049 | Nile/Pinecliff, WA | E-Sampler 1050 | Kettle Falls, WA | E-Sampler

Note that in our example above we have pulled data for two different monitor types: the Plain monitor is an EBAM monitor, whereas the Nile/Pinecliff monitor is an E-Sampler. We'll explore the differences in data structures between the two monitor types a bit later on.

The airsis_createRawDataframe() function is nearly identical to the airsis_createMonitorObject() function, which is explained in depth in the AIRSIS_Data_Handling vignette within this package. The process is briefly summarized below.

Data retrieval: airsis_downloadData() downloads data from the specified provider for the monitor and time period requested. The data is ingested as a single, long text string with comma separators and end-line characters (e.g. \r\n).
Parsing: airsis_parseData() converts the long text string into a dataframe.
Quality Control: airsis_qualityControl() performs some basic data cleanup and integrity checks. A uniform UTC timestamp field, 'datetime', is added with data assigned to unique hours and duplicates removed as needed.
Clustering: addClustering() identifies unique monitor "deployments"" (aka locations) and assigns a unique monitorID, a latitude and a longitude to each deployment. Unique monitor deployments are identified using k-means clustering based on the 'clusterDiameter' argument passed to the addClustering() function via the same argument in the airsis_createRawDataframe() function.

Up to this point this is the exact same process that the airsis_createMonitorObject() function uses to create a ws_monitor object. However, the processes deviate at the end with respect to what data is retained, and how it is stored. Specifically:

the airsis_createRawDataframe() function retains all data fields (including "engineering"" data), whereas the ws_monitor object retains only the meta data and the time-stamped PM2.5 data.
the airsis_createRawDataframe() function stores all meta and monitoring data in a single flat dataframe, whereas the ws_monitor object separates the meta data from the monitoring data to comply with the PWFSLSmoke ws_monitor data model.

The return value from the airsis_createRawDataframe() function is a single dataframe with time-stamped meta-, monitoring-, and engineering-data.

Data Handling

Before we can begin exploring our data we should ensure that this data adheres to a consistent format regardless of the source. Otherwise, we would need complicated data handling procedures in all of our downstream functions

Returning to our example data from above, we created two datasets from different sources, so naturally we expect that there may be some differences between them. Let's take a look at what fields exist in the raw dataframes for each monitor type.

sort(names(plain_AIRSIS_EBAM))
sort(names(nile_AIRSIS_ESAM))

We observe that the fields don't match up one-for-one. Careful inspection, or better, a few simple lines of code, allows for a quick evaluation of which fields are or are not shared between the two. First let's check which fields are shared between the types:

names(plain_AIRSIS_EBAM)[names(plain_AIRSIS_EBAM) %in% names(nile_AIRSIS_ESAM)]

Only about half of the fields in each vector are shared, of which nearly half were created by the Clustering and Quality Control processes (i.e. ,"deploymentID" and "monitorName", "monitorType", "datetime", "mediodLon", "mediodLat").

Now we'll check which fields are in the EBAM data but not the E-Sampler data.

names(plain_AIRSIS_EBAM)[!(names(plain_AIRSIS_EBAM) %in% names(nile_AIRSIS_ESAM))]

And, for completeness, we'll check the other direction as well.

names(nile_AIRSIS_ESAM)[!(names(nile_AIRSIS_ESAM) %in% names(plain_AIRSIS_EBAM))]

From the latter two sets we can see that most of the differences in fields between the two monitor types come down to naming conventions. For example, both monitor types include data on concentration ("COncRT" and "Conc.mg.m3"), flow ("Flow" and "Flow.l.m"), air temperature ("AT" and "AT.C"), as well as several other parameters. However, we need to be careful about units of measure. Even if we had a perfect match of field names it would be unreasonable to simply assume that two different monitor types output data in the same units of measure.

Additionally, we have some fields that appear in one set, but not the other. For example, we see FT (filter temperature) in the EBAM fields, but no such parameter presents itself in the E-Sampler fields. Indeed, no such parameter exists for E-Samplers because, unlike EBAM monitors, E-Samplers do not use filters directly to monitor PM~2.5~ (see brief overview of sampling techniques for the two monitors, and others, here). Regardless, an engineer may be interested in plotting filter temperature for an EBAM monitor, so our raw data structure needs retain all data fields to allow for such analysis.

All this inconsistency calls for some form of harmonization to allow for simple and elegant use and exploration of the data, independent of its source. For this, we turn to the raw_enhance() function.

Raw Enhance

The raw_enhance() function accepts a raw dataframe from any of a variety of types/sources and standardizes the data for downstream data exploration. Specifically, the function ensures that the enhanced raw dataframe includes at least the following fields:

datetime: unique 1-hour time stamps (UTC)
temperature: air temperature (C)
humidity: relative humidity (%)
windSpeed: wind speed (m/s)
windDir: wind direction (degrees)
pm25: PM~2.5~ concentration (ug/m3)
longitude: longitude (degrees East)
latitude: latitude (degrees North)
pressure: atmospheric pressure (hPa)
dataSource: data source (e.g. 'AIRSIS')
timezone: time zone (e.g. 'America/Los_Angeles')

You might notice that some of the fields above were already present in our raw dataframes before we ran them through the raw_enhance() function, albeit with perhaps slightly different names (e.g. Latitude vs. latitude). You might also notice that some fields (e.g. filter temperature) are not included in the list above. Both of these observations are related and highlight two critical aspects of how the raw_enhance() function works to ensure maximum utility of our enhanced raw dataframes. Specifically, the function:

Guarantees that the fields listed above will be present in all enhanced raw dataframes, with the exact names as shown above. The benefit of this consistency is two-fold:
- It ensures the functionality of all downstream rawPlot_*() functions that depend on an enhanced raw dataframe with exact names as listed above.
- It frees the analyst from the burden of having to remember the exact name of each parameter for each specific monitor/source type, in which there are often subtle differences. For example, wind direction is "W.D" for EBAMs and "WD.Deg." for E-Samplers; with the raw_enhance() output the analyst only needs to remember (or refer to this vignette) the small set of guaranteed names listed above -- in this case 'windDir' would be the parameter of interest.
Maintains all raw data. Unlike the ws_monitor object, which reduces the dataset to meta data and time-stamped PM2.5 data, the raw_enhance() function returns a dataframe that retains all of the original raw data, plus additional data that has been set to the guaranteed names and units of measure as specified above. This allows the analyst to refer to the enhanced raw dataframe not only for the guaranteed names listed above, but for original and/or additional parameters as he or she so pleases. For example, filter temperature is only available for EBAM monitors so we have not included it as one of the guaranteed parameters; however, should an analyst be working with EBAM data, the filter temperature data would be available for reference if he or she so pleased.

A few caveats:

The parameters listed above are simply guaranteed to exist; there is no guarantee that they will necessarily contain data. For example, atmospheric pressure is included as a guaranteed parameter since it is a key indicator of ambient conditions; however, EBAM monitors do not report pressure. Users will find pressure field populated with NAs for EBAM monitors. Missing data for any monitor type (e.g. due to sensor failure) would result in a similar situation.
The framework above does result in quite a bit of redundancy; however, we have not found raw dataframe size to be large enough for this to matter. Thus, any grief with such redundancy is a matter of principle only, and should be far outweighed by the flexibility the framework allows for end users to quickly and easily access both standard and engineering parameters for on-the-fly analysis.

The end result of the raw_enhance() function is a dataframe with a subset of consistent data columns which enables rapid exploration of data regardless of its original source.

To continue our prior example, let's create enhanced raw dataframes for the same two monitors as before:

plain_enhanced <- raw_enhance(plain_AIRSIS_EBAM)
nile_enhanced <- raw_enhance(nile_AIRSIS_ESAM)

For demonstration purposes, we'll show the common parameters between the two dataframes now that they have been enhanced.

names(plain_enhanced)[names(plain_enhanced) %in% names(nile_enhanced)]

Now that we have enhanced raw data dataframes we are ready to begin the data exploration process!

Data Exploration

In our example above we created enhanced raw dataframes for two monitors in Washington State during late summer/early fall 2016. We'll use the PWFSLSmoke package rawPlot_*() functions to explore data from Plain, as this was the more interesting location of the two during the period.

Plain is a small town in the Cascade Mountains of Washington State, about 80 miles east northeast of Seattle as the crow flies. The city is nestled in a quiet mountain valley; its unique geography often subjecting the location to poor air quality, often the result of woodstove combustion and/or temperature inversions in fall and winter and wildfires and/or prescribed burns in spring, summer and fall.

knitr::opts_chunk$set(fig.width=5, fig.height=5)

plain <- airsis_createMonitorObject(20160914, 20160914, provider='USFS', unitID=1033)
monitor_esriMap(plain, zoom=8)
monitor_esriMap(plain, zoom=12, maptype='worldTopoMap')

knitr::opts_chunk$set(fig.width=7, fig.height=5)

Late summer of 2016 saw frequent air quality episodes in Plain. We'll explore this period below.

rawPlot_timeseries

First, we'll use the rawPlot_timeseries() function to get a high-level view of the PM2.5 data during the period.

rawPlot_timeseries(plain_enhanced)

Here we see a time-series plot of hourly PM2.5 concentrations. The alternating pattern of light and dark in the background represents periods of day and night, respectively (this feature can be disabled by setting shadedNight=FALSE). We observe multiple days with elevated PM2.5 concentrations; the days with elevated concentrations look to be grouped into 4 or 5 separate episodes with an apparent diurnality to the concentrations (many of the highest peaks are observed near dawn).

Note that the only argument passed to the function above is the dataframe itself. This is a good example of how our rawPlot*() functions depend on the existence of the guaranteed parameters as listed above; in this case 'pm25' is hard coded into the function as the default parameter to plot. A different parameter (either from the list above, or otherwise) can be plotted instead by passing the field's name into the "parameter" argument.

We can also utilize the tlim argument to subset the plot by time. Let's 'zoom in' on the multi-day episode in mid-September.

rawPlot_timeseries(plain_enhanced, tlim=c(20160912,20160916), dayLwd=1)

Here we see a clear diurnality to the observations -- growing concentrations during nighttime hours, and lower concentrations during daylight hours. What could explain such a pattern?

Our very same rawPlot_timeseries() function can help us answer this question. We'll visualize PM2.5 during the same period as before, but we'll add a line for temperature, and we'll also replace the nighttime shading with a background shading that is based on another parameter -- let's try wind speed.

rawPlot_timeseries(plain_enhanced, tlim=c(20160912,20160916),
                   shadedNight=FALSE, shadedBackground="windSpeed",
                   sbLwd=4, dayLwd=1,
                   ylab="PM2.5 (ug/m3) and Temperature (deg C)",
                   main="PM2.5, Temperature, and Wind Speed")
rawPlot_timeseries(plain_enhanced, parameter='temperature',
                   add=TRUE, lty='dashed', col='brown')

From this plot it is very clear what is going on -- smoke from a persistent source (or sources) is slowly building up during the calm, quiet nighttime hours before being washed out by the wind which picks up with the advent of convection due to surface heating every morning. So, one can see how even a single plotting function, paired with a guaranteed data structure, can prove to be very powerful.

A few other arguments not demonstrated here are:

localTime: logical specifying whether to plot tlim in local time or UTC; applies to tlim argument as well. Defaults to plot local time (i.e. TRUE).
gridPos: position of optional grid lines; 'over' = on top of all other plot shapes, 'under' = beneath all other plot shapes. Defaults to '' (i.e. no grid lines).
gridCol: color for optional grid lines. Defaults to 'black'.
gridLwd: width for optional grid lines. Defaults to 1.
gridLty: type for optional grid lines. Defaults to 'solid'.
hourInterval: interval for optional hour interval lines. Defaults to 6; max is 12. Line positions are locked to start of day.
hourLwd: width for optional hour interval lines. Defaults to 0 (i.e. no lines).
...: additional arguments to change parameters associated with plotted parameter. For example, setting lty=2 will change main line from 'solid' to 'dashed'. Additional arguments are passed to the lines() function.

rawPlot_timeOfDaySpaghetti

We saw in our previous example that there was a strong diurnality to the PM2.5 data in Plain from September 12-16, 2016. We can demonstrate this temporal relationship in a more compact manner using the rawPlot_timeOfDaySpaghetti() function. Below we plot the hourly PM2.5 data for the same period -- September 12-16, 2016 -- but with each day plotted as a separate line on a time axis that represents the hours in each day.

rawPlot_timeOfDaySpaghetti(plain_enhanced, tlim=c(20160912,20160916),
                           main="PM2.5 9/12/16-9/16/16", highlightDates = 20160913)

Note that we used the highlightDates argument to color in blue the line representing the day with the highest PM2.5 concentrations, i.e. 9/13/2016. This will come in handy momentarily. The thick black line represents the average for each hour of the day over the period plotted.

We leverage the rawPlot_timeOfDaySpaghetti() function again to similarly visualize the daily pattern for temperature and wind speed -- see below. Again, we use the highlightDates argument to color in blue the line representing the day with the highest PM2.5 concentrations.

oldpar <- par(mfrow=c(1,2),mar=c(2.5,1.5,1.5,1))
rawPlot_timeOfDaySpaghetti(plain_enhanced,parameter = 'temperature',
                           tlim = c(20160912,20160916),main="Temperature",
                           highlightDates = 20160913,xlab='')
rawPlot_timeOfDaySpaghetti(plain_enhanced,parameter = 'windSpeed',
                           tlim = c(20160912,20160916),main="Wind Speed",
                           highlightDates = 20160913,xlab='')
par(oldpar)

The patterns above are not terribly useful by themselves -- similar patterns are observed at almost all weather stations on a near-daily basis. However, the higlightDates feature allows us to look at the patterns in context to gain an understanding of what may have contributed to the elevated PM2.5 values on 9/13/2016.

Specifically, we notice that the blue line tends to be on the cold and calm end of the spectrum within this period. A peek back at the plot above reorients us to focus on the morning hours, as this was the time with the highest PM2.5 concentrations on all days in the period. And indeed, we see that that the morning of 9/13/16 -- which had by far the highest PM2.5 concentrations -- was also the coldest by quite a bit! Thus, we may conclude that the surface-level PM2.5 concentration may have been inflated on the morning of 9/13/2016 by stronger subsidence, and perhaps more residential wood smoke as well associated with the colder temperatures.

Of course, 9/13/2016 was the first day in the period, and we might expect that the first day in any episode of elevated PM2.5 -- especially if the result of a nearby fire -- would be the highest. And indeed, such was the case here: smoke from a nearby prescribed burn which commenced on 9/12/2016 affected the area for a number of days. Regardless, this exercise has demonstrated the functionality of the rawPlot_timeseries() and rawPlot_timeOfDaySpaghetti() functions and how they can be used for raw monitor data exploration.

We now turn our attention from time-series-based plots to two other handy raw data-oriented plots: rawPlot_windRose() and rawPlot_pollutionRose().

rawPlot_windRose

While the time series plots demonstrated above are useful for evaluating temporal and day-to-day variations in PM2.5 concentrations or other parameters, they aren't able to tell us much about the source of air pollution. We experienced this first-hand in the last example above in which we concluded that the high PM2.5 values on 9/13/16 were due to a cold night, only to be told after the fact that a large fire burned nearby the day before.

One key element that can be used to identify a source of air pollution is wind direction. Of course, air pollution must originate somewhere, and in most cases it is bound to drift wherever the wind wishes to take it. So, naturally, for a rough idea of where air pollution originates, we must simply look upwind! To this end, we have leveraged open source code from the openair package to create a wind rose plot (based on the openair::windRose() function) that is compatible with our enhanced raw dataframe structure. We demonstrate the plot below for Plain, WA over the same period as above -- September 12-16, 2016.

rawPlot_windRose(plain_enhanced,tlim = c(20160912,20160916))

This is effectively a polar stacked histogram -- the height of each color of each wedge is based on the fraction of hours in the period that were observed at each direction and speed. So, the longer the wedge, the more the wind came from that direction.

Here we see that most of the wind was from the North and Northwest, and most of the wind was below 2 m/s in speed. This would seem to indicate that the source of the air pollution was from the North and Northwest; however, we must be careful about making assumptions such as this. As we saw above the highest concentrations were observed during the early morning hours when the wind speeds were the lowest, so it is certainly possible that the worst air quality could be associated with a wind direction that is completely different from the prevailing wind direction during the day. Thus, we should take care when making statements about the source of air pollution based on wind direction frequencies alone.

What we really need is an approach that also takes into account the concentrations observed by wind direction. For this task, we again turn to the openair package, from which we have modified the openair::pollutionRose() function to work with our enhanced raw dataframe structure. It is demonstrated below, again for September 12-16, 2016.

rawPlot_pollutionRose

rawPlot_pollutionRose(plain_enhanced,tlim = c(20160912,20160916))

Again, this plot represents a polar stacked histogram. The length of the wedges still represents the overall frequency of wind by direction. However, in this case the colors represent the concentrations observed from each wind direction, rather than the wind speeds by direction.

So, in the plot above we again see that most of the wind came from the North and the Northwest, and we observe that a large number of hours with high PM2.5 concentrations also had wind from these same directions. This is beginning to paint a picture -- some major source of air pollution to the North and/or Northwest of the monitoring station.

While the plot above is helpful, the varying lengths of the wedges can make it difficult to compare the overall frequencies of concentrations across different directions. To this end we can add the normalize=TRUE argument to the function above.

rawPlot_pollutionRose(plain_enhanced,tlim = c(20160912,20160916),normalize=TRUE)

As shown, this argument normalizes the lengths of the wedges in the pollutionRose plot, and a black wedge outline is added to represent the overall frequencies of wind direction. Thus, setting normalize=TRUE effectively combines the windRose and pollutionRose plots into one, and it allows for a very thorough analysis of the frequencies and magnitudes of air pollution concentrations by direction. For example, in the plot above we can now more easily see that a large proportion of hours with wind from the North had high PM2.5 concentrations. Similarly, we see relatively high proportions of high PM2.5 concentrations in winds from the east; however, we temper this observation by also noting that the overall frequency of wind from the east was low. In other words, we are more easily able to compare the proportions of different concentrations by direction, while the black wedge outlines help us keep our bearings on the overall frequencies of wind directions.

As a final experiment, let's create the same plot as above, but only for the single day that had the highest PM2.5 concentrations (i.e. 9/16/2016).

rawPlot_pollutionRose(plain_enhanced,tlim = c(20160913,20160913),normalize = TRUE)

This plot produces an interesting result -- we see that the wind on this day was primarily from the North, South, and Northwest. We see the highest concentrations were associated with wind from the North and the East. However, only a small fraction of hours had wind from the East, whereas a large number of hours had wind from the North. The air from the South was mostly clean. Thus, we are able to confidently assert that the source of air pollution -- at least on 9/13/2016 -- was to the North of the monitoring station.

As an aside, note that we reach the same conclusion -- albeit easier -- for the day with the highest PM2.5 spike, observed on 9/29/2016.

rawPlot_pollutionRose(plain_enhanced,tlim = c(20160929,20160929),normalize = TRUE)

In both cases the city of Plain was impacted by fires North of town.

Conclusion

We have outlined the PWFSLSmoke R package and many of the ways it can be used to retrieve and explore raw air quality monitoring data. Specifically, we covered the retrieval of raw data, the conversion to our enhanced raw dataframe, time series plots using the rawPlot_timeseries() and rawPlot_timeOfDaySpaghetti() functions, and polar plots using the rawPlot_windRose() and rawPlot_pollutionRose() functions. We demonstrated the use of each of these functions by exploring periods with elevated PM2.5 concentrations in Plain, Washington during the late Summer/early Fall of 2016.

We hope this vignette and the tools presented herein are helpful in your work. Feel free to reach out to us if you have any suggestions or improvements.

Good luck in your data exploration!

MazamaScience/PWFSLSmoke documentation built on July 3, 2023, 11:03 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com