knitr::opts_chunk$set( echo = T, results = 'hide', collapse = TRUE, comment = "#>" )
Here we provide an example workflow of how to run the HyADS model using hyspdisp
. This vignette provides instruction on how to run the model and ensure the model runs completed. Subsequent vignettes will describe combining, manipulating, and interpreting results. TO DECIDE: split into multiple vignettes, or keep as one?
Begin by installing the package from Github. It is recommended to update all packages when prompted (expecially SplitR
).
devtools::install_github("lhenneman/hyspdisp@dev2") library( hyspdisp)
Several files are required to run the functions in hyspdisp
. This section presents procedures to download these.
The ZIP code linkage procedure requires a ZCTA-to-ZIP code Crosswalk file. ZCTAs are not exact geographic matches to ZIP codes, and multiple groups compile and maintain Crosswalk files. One example is the Crosswalk mainted by UDS Mapper. It can be retrieved and its names changed for consistency with hyspdisp
functions with the following commands:
library( openxlsx) crosswalkin <- data.table( read.xlsx( xlsxFile = "https://www.udsmapper.org/docs/zip_to_zcta_2017.xlsx")) setnames( crosswalkin, old = "ZIP_CODE", new = "ZIP")
While not necessary for the HYSPLIT model or processing of its outputs, population-weighted exposure metrics allow for direct comparisons between power plants. In anticipation of this, we load census data packaged by American Fact Finder and merge it with the crosswalk data. The variable ZCTA
must first be added from the included Id2
variable before merging.
data( ZCTApop2017) ZCTApop2017[, ZCTA := formatC( Id2, format = 'd', flag = '0', width = 5)] crosswalkin <- merge( ZCTApop2017, crosswalkin, by = 'ZCTA')
Equally important is the ZCTA shapefile. These are available from multiple locations, including the US census website. Code below downloads the file from the FTP site, which is more stable but slower. A second alternative (not shown) is to use the tigris
r library. Download the file, unzip it, and read in the shapefile (here I've unzipped it to a directory in my Desktop). Loading the file may take a few moments.
# Define the storage directory, create it if it does not exist zcta_dir <- file.path('~', 'Desktop', 'run_hyspdisp', 'zcta_500k') zcta_file <- file.path( zcta_dir, 'cb_2017_us_zcta510_500k.zip') dir.create( zcta_dir, recursive = T, showWarnings = F) # Download the file if it is not present zcta_file_url <- 'ftp://ftp2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_zcta510_500k.zip' if( !file.exists( zcta_file)){ download.file( url = zcta_file_url, destfile = zcta_file) unzip( zcta_file, exdir = zcta_dir) } # Load the ZCTA shapefile using the shapefile command zcta_shapefile <- file.path( zcta_dir, 'cb_2017_us_zcta510_500k.shp') zcta <- shapefile( x = zcta_shapefile)
It is recommended to transform the ZCTA shapefile to a known projection to maintain consistency throughout the allocation process. Lat-lon projections are preferred, such as the North American Albers Equal Area Conic:
p4s <- "+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m" zcta.trans <- spTransform( x = zcta, CRSobj = p4s)
The final input file is the monthly mean boundary layer heights. For years up to and including 2012, these are available in a single file from NOAA's Earth System Research Library. For years 2013 and later (and available for years 2008 and on), you may use the NCAR/UCAR data archive. The NCAR/UCAR annual files must be downloaded one by one. Below I've included examples for how to download the two types of files if they are not already in the file path.
# Define a directory, create it if it does not exist hpbl_dir <- file.path('~', 'Desktop', 'run_hyspdisp', 'hpbl') dir.create( hpbl_dir, recursive = T, showWarnings = F) # Download the PBL file from NOAA for years up to and including 2012 library( ncdf4) hpbl_file_NOAA <- file.path( hpbl_dir, 'hpbl.mon.mean.nc') url_NOAA <- "ftp://ftp.cdc.noaa.gov/Datasets/20thC_ReanV2/Monthlies/gaussian/monolevel/hpbl.mon.mean.nc" if( !file.exists( hpbl_file_NOAA)) download.file( url = url_NOAA, destfile = hpbl_file_NOAA) # Download the PBL file from NCAR for 2013 hpbl_file_NCAR2013 <- file.path( hpbl_dir, 'hpbl.mon.mean.nc_2013.grb') url_NCAR <- "https://rda.ucar.edu/thredds/catalog/files/e/ds630.1/e5.mnth.mean.an.sfc/2013/catalog.html?dataset=files/e/ds630.1/e5.mnth.mean.an.sfc/2013/e5.mnth.mean.an.sfc.128_159_blh.regn320sc.2013010100_2013120100.grb" if( !file.exists( hpbl_file_NCAR2013)) download.file( url = url_NCAR, destfile = hpbl_file_NCAR2013)
Both files are read in using the brick command, and rotated to match the lat-lon projection applied above. Before reading in, it is necessary to set the system time zone to UTC so that the dates are formatted correctly in the raster files. For the NCAR boundary layer files, it is not neccesary to set the varname
argument in the brick
function.
Sys.setenv(TZ='UTC') hpbl_rasterin <- brick( x = hpbl_file_NOAA, varname = 'hpbl' )
Functions in hyspdisp
require specific inputs. Their setups are detailed in this section.
Select power plants to run. In this case, we'll use the two units in 2005 with the greatest SOx emissions. This package contains annual emissions and stack height data from EPA's Air Markets Program Data and the [Energy Information Agency] (https://www.eia.gov/electricity/data/eia860/) for 2005, 2006, and 2012. [PLUG CHRISTINE's DATA PAPER].
Many of the units in the provided datasets do not have stack height data. In these cases, it is suggested in Henneman et al. (2019) to fill with the average stack height of all units.
data( units2005) units2005 <- data.table( units2005)[, V1 := NULL] units.run <- units2005[ order( -SOx)][1:2] units.run
knitr::kable( units.run)
hyspdisp
creates multiple file types that store information important in interpreting the results. It is recommended to define the locations beforehand so that you:
In general, hyspdisp
functions will create the directories if they do not already exist. If they are not defined, they will be created in the working directory. The following directories, as referenced in the functions below in this document, will contain:
proc_dir
: the overarching directory containing each of the subdirectories, and the working directory for HYSPLIThysraw_dir
: raw hyspdisp output (one file for each emissions event)ziplink_dir
: files containing ZIP code linkagesmeteo_dir
: (reanalysis) meteorology filesrdata_dir
: RData files containing HyADS source-receptor matrices## proc_dir defines where runs happen and output is saved ## up_dir is where input data is saved (upper file structure) proc_dir <- file.path( '~', 'Desktop', 'run_hyspdisp') hysraw_dir <- file.path( proc_dir, 'output_hysplit') ziplink_dir <- file.path( proc_dir, 'output_ziplinks') meteo_dir <- file.path( proc_dir, 'meteo') rdata_dir <- file.path( proc_dir, 'output_rdata') exposure_dir <- file.path( proc_dir, 'exposure_data')
While SplitR (the R package that calls HYSPLIT) includes code that checks if the appropriate metorological files are downloaded, it is recommended to download needed files explicitely before the first run of hyspdisp_fac_model_parallel
because the parallel code does not handle the download well split over multiple cores. Below is shown code to test for the three meteorology files needed for the present run, and download them if they are not already in the meteo_dir
. The reanalysis met files are about 120 MB each.
metfiles.vignette <- c( 'RP200412.gbl', 'RP200501.gbl', 'RP200502.gbl', 'RP200503.gbl', 'RP200504.gbl', 'RP200505.gbl', 'RP200506.gbl', 'RP200507.gbl') dir.create( meteo_dir, recursive = T, showWarnings = F) metfiles.vignette <- metfiles.vignette[!(metfiles.vignette %in% list.files(meteo_dir))] if( length( metfiles.vignette) > 0) get_met_reanalysis(files = metfiles.vignette, path_met_files = meteo_dir)
HYSPLIT, as applied here, tracks air parcels emitted at certain times and locations. hyspdisp
refers to these as emissions events. Once HYSPLIT has been run for each emissions event, the simulated parcel locations are aggregated by source, time, and location. The functions below are written to enable runs of many emissions events.
To define an object that includes all emission events in a given time period, we can use the helper function define_inputtimes
. This takes as inputs a starting and ending day, and outputs a table of value whose rows will later correspond to inputs into the main hyspdisp
worker functions. The following command combines the units defined above with four times a day for January-June in 2005 (we show an example for half the year here to keep the computation manageable for a desktop computer). Four daily emissions events are defined by the start_hours
variable, and duration = 240
denotes that the emitted air parces are tracked for 240 hours (10 days). The resulting data.table's column names are printed in the table below.
## combine dates and hours input_refs <- define_inputs( units = units.run, startday = '2005-01-01', endday = '2005-06-30', start_hours = c( 0, 6, 12, 18), duration = 240) names( input_refs)
knitr::kable( names( input_refs))
The following examples show how to run a small subset (chosen to provide workable examples on a laptop) of emissions events, link to ZIP codes, and plot the results.
The hyspdisp_fac_model_parallel
function runs HYSPLIT for each emissions event described by the rows of the run_ref_tab
. Inputs defined above, including the projected ZCTA shapefile (zcta.trans
), the ZIP-ZCTA crosswalk (crosswalkin
), the planetary boundary layer raster (hpbl_rasterin
), and working directories are necessary in the function call. With link2zip = T
you can link dispersion patterns from individual emissions events to ZIP codes; this, however, somewhat violates the spirit of HyADS, since the large number of emissions events is used as a check on some other uncertainties introduced by the simplifying assumptions.
Here it's run for a sample of twenty (run_sample
) rows of the input_refs
table. The results will be saved in the hysraw_dir
. mc.cores
is an option in the mcapply
function that controls the number of cores should be used; the code below uses the detectCores
function to determine how many cores are available, but this approach is less reliable on some machines.
hysp_raw
, created below, is a list of output describing where the outputs are saved or the reasons that HYSPLIT was not run. The following code will return warnings from the projection transformation applied to the PBL layer rasters. These are expected and can safely be ignored.
If you get an error running HYSPLIT, it is likely because the meteorology files did not download correctly. Check your meteo_dir
to ensure all files are present and at least 100MB in size.
library(parallel) run_sample <- round(seq( 1, dim( input_refs)[1], length.out = 20)) hysp_raw <- mclapply( X = run_sample, FUN = hyspdisp_fac_model_parallel, run_ref_tab = input_refs, zcta2 = zcta.trans, crosswalk = crosswalkin, hpbl_raster = hpbl_rasterin, prc_dir = proc_dir, hyo_dir = hysraw_dir, met_dir = meteo_dir, mc.cores = detectCores() - 1) hysp_raw[1]
library(parallel) knitr::kable( head( hysp_raw, 1))
Most current implementations of HyADS, instead of linking dispersion patterns from individual emissions events, link the patterns by month. With the hysdisp_zip_link
function, users can link all air parcels to ZIP codes by month for an individual unit. Here, we define the variables yearmons
with combinations of years and months. hysdisp_zip_link
reads in all the relevant files (i.e., those that correspond to the provided month_YYYYMM
and unit
) produced by the hyspdisp_fac_model_parallel
function and saved to the hyo_dir
, then links them to ZIP codes. The result is data.table of ZIP codes and relative contributions N
, and an identical .csv
file is saved to the zpc_dir
. N
is not weighted by the unit
's emissions. hyspdisp_zip_link
is parallelizable for different months using mclapply
or similar, and is run for a single unit at a time (we define the unit
as the each unit in unit.run
.)
linked_zips1
and linked_zips2
contain monthly ZIP code exposure attributable to the emissions events modeled by the hyspdisp_fac_model_parallel
function above. Keep in mind that we only modeled a subset of all possible emissions events defined in input_refs
. Therefore, the following objects only contain ZIP code linkages from a subset of all emissions. Since we did not model any emissions events after June, the commands will return warnings for months 7-12.
yearmons <- paste0( 2005, 1:12) linked_zips1 <- mclapply( yearmons, hyspdisp_zip_link, unit = units.run[1], hpbl_raster = hpbl_rasterin, zcta2 = zcta.trans, crosswalk = crosswalkin, prc_dir = proc_dir, zpc_dir = ziplink_dir, hyo_dir = hysraw_dir, mc.cores = detectCores() - 1) linked_zips2 <- mclapply( yearmons, hyspdisp_zip_link, unit = units.run[2], hpbl_raster = hpbl_rasterin, zcta2 = zcta.trans, crosswalk = crosswalkin, prc_dir = proc_dir, zpc_dir = ziplink_dir, hyo_dir = hysraw_dir, mc.cores = detectCores() - 1)
At this point, it makes sense to take a look at the results to see which ZIP codes are impacted by emissions from the facility whose impacts we just modeled. To acheive this, we can use the plot_hyspdisp
function, which takes as input a data.table (or data.frame) linked to simple features ZIP code data. We can use the st_read
function from the sf
package to read in the same ZCTA shapefile we used to link the HYSPLIT output to ZIP codes. Next, we merge the sf
ZIP code object to both the crosswalk file and the linked_zips
object created above, then feed the merged dataset to the plot_hyspdisp
function, which returns a ggplot object.
Keep in mind, since we only modeled a subset of the potential model runs defined in input_refs
, the impacts contained in zip_dataset_sf
and plotted below include only results from a single emissions event. Further, the plotted values are "HyADS raw exposure", and are comparable only to a relative exposure of 0. So, a value of 0.01 represents 10x more exposure than a value of 0.001, but only in relation to unit 3136-1.
library(sf) library(viridis) library(ggplot2) library(scales) zips.sf <- st_read(zcta_shapefile) setnames( zips.sf, 'ZCTA5CE10', 'ZCTA') zips_crosswalk.sf <- merge( zips.sf, crosswalkin, by = "ZCTA", all = F, allow.cartesian = TRUE) zip_dataset_sf <- data.table( units.run[1,], merge( zips_crosswalk.sf, linked_zips1[[1]], by = c('ZIP'), all.y = T)) ziplink_plot <- plot_hyspdisp(hyspdisp_out.sf = zip_dataset_sf, metric = 'N', plot.title = paste('Partial Jan. 2005 HyADS raw exposure, uID:', zip_dataset_sf[1, uID]), legend.title = 'HyADS raw exposure', facility.loc = data.table( x = zip_dataset_sf[1, Longitude], y = zip_dataset_sf[1, Latitude])) ziplink_plot
So far, we have run two worker functions important to the implementation of HyADS:
hyspdisp_fac_model_parallel
: ran HYSPLIT for unit-date-time emission event combinations and saved an output file for each hyspdisp_zip_link
: gathered output files from hyspdisp_fac_model_parallel
, grouped them by month and unit, and linked the HYSPLIT parcel locations contained in the files to ZIP codes.In practice, these two worker functions may need to be implemented in parallel R sessions on a cluster to improve efficiency. It helps for future analyses to gather the relevant monthly files and save them as a single RData file. This is made possible by combine_monthly_ziplinks
, which saves an RData file of annual monthly results to the rda_dir
.
combined_ziplinks <- combine_monthly_ziplinks( month_YYYYMMs = yearmons, zpc_dir = ziplink_dir, rda_dir = rdata_dir) names( combined_ziplinks)
knitr::kable( names( combined_ziplinks))
Up until now, this vignette has focused on producing unweighted HyADS results. Once the procedure above has been followed, it is farily straightforward to weight the unweighted HyADS results by emissions (or other quantity) and aggregate the results by source, receptor, and/or time period.
The hyspdisp
package include monthly power plant emissions, load, and heat input data, accessible using the following command:
data( PP.units.monthly1995_2017)
The calc_zip_exposure
function takes as input the .RData file created by combined_ziplinks
and monthly power plant emissions. The user can select the hyspdisp year (year.H
), the emissions year (year.E
), the weighting pollutant (although it does not necessarily have to be a pollutant), the source aggregation (source.agg
can be designated 'total', 'facility', or 'unit'), the time aggregation (time.agg
can be set to 'month' or year'). Results are saved to the exp_dir
and returned, but return.monthly.data
must be set to true to return data when time.agg = 'month'
(the default was set to save working memory space).
Here, we first calculate the annual HyADS emissions weighted exposure for all emissions modeled above and plot the results. The plots require a large amount of working memory, and may require you to clear your image history (if using RStudio).
zip_exp_ann <- calc_zip_exposure(rda_file = "~/Desktop/run_hyspdisp/output_rdata/hyads_unwgted_2005.RData", units.mo = PP.units.monthly1995_2017, year.E = 2005, year.H = 2005, pollutant = 'SO2.tons', source.agg = 'total', time.agg = 'year', exp_dir = exposure_dir) zip_exp_ann_sf <- data.table( merge( zips_crosswalk.sf, zip_exp_ann, by = c('ZIP'), all.y = T)) zip_exp_ann_plot <- plot_hyspdisp(hyspdisp_out.sf = zip_exp_ann_sf, metric = 'hyads', plot.title = paste('2005 HyADS exposure from', units.run[1, ID], 'and', units.run[2, ID]), legend.title = 'HyADS exposure', facility.loc = data.table( x = units.run[, Longitude], y = units.run[, Latitude])) zip_exp_ann_plot
Next, we calculate the monthly emissions weighted HyADS exposure (again, only from the select few runs we initiated above). Code below plots the monthly exposure contributions of the two units in ZIP code 21613, which is located in Maryland. Note that this ZIP code was not impacted by the facilities in all the modeled months in 2005.
zip_exp_unit_mo <- calc_zip_exposure(rda_file = "~/Desktop/run_hyspdisp/output_rdata/hyads_unwgted_2005.RData", units.mo = PP.units.monthly1995_2017, year.E = 2005, year.H = 2005, pollutant = 'SO2.tons', source.agg = 'unit', time.agg = 'month', exp_dir = exposure_dir, return.monthly.data = T) zip_exp_unit_mo[, uID := as( uID, 'character')] qplot( x = yearmonth, y = hyads, group = uID, color = uID, data = zip_exp_unit_mo[ZIP == 21613], geom = 'line')
In the examples provided above, we selected two coal units a priori to analyze. However, a benefit of HyADS over more complex models is the ability to easily rank-order emissions sources to identify the sources that contribute the most to exposure in a given area. The following code snippet applies the calc_zip_exposure
function to calculate annual impacts by unit (here, only the two units selected above), then rank-orders their population-weighted exposures on Pennsylvania using the rankfacs_by_popwgt_location
function. In this function, the input ZIP-code linkage dataset is defined by link.dt = zip_exp_ann_unit
; the crosswalkin
file employed above is used again, this time including a column named year
to ensure the appropriate population-weighting is used; the census.pop.name
defines the population column in crosswalkin
(recall we linked ACS ZIP-code populations with the crosswalk file above); the rank.by
column defines the variables in zip_exp_ann_unit
to rank by---it is possible to define multiple values (e.g., rank.by = c( 'hyads', 'SOx')
) to rank by multiple ranking variables if they are available; finally, state.value
, city.value
, and zip.value
can be defined to limit the area of influence (here, we only rank these units' influences on Pennsylvania with state.value = 'PA'
).
zip_exp_ann_unit <- calc_zip_exposure(year.E = 2005, year.H = 2005, pollutant = 'SO2.tons', rda_file = file.path( rdata_dir, "hyads_unwgted_2005.RData"), exp_dir = exposure_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'year') crosswalkin2005 <- copy( crosswalkin)[, year := 2005] zip_exp_ann_unit[, year := 2005] unitRanks2005 <- rankfacs_by_popwgt_location( link.dt = zip_exp_ann_unit, census.dt = crosswalkin2005, census.pop.name = 'Estimate; Total', rank.by = c( 'hyads'), state.value = 'PA') unitRanks2005
knitr::kable( unitRanks2005)
Finally, we can make two plots: one that shows the total 2005 emissions from each of the units in 2005, and one that shows the total population-weighted exposure from these units. We start by linking unitRanks2005
back with the unit location and emissions information, then execute the plot_ranked_facs
command twice, once to size the facility units by emissions, and a second to size by hyads.py.sum
, the metric of population-weighted influence calculated by rankfacs_by_popwgt_location
.
library(ggsn) unitRanks2005 <- merge( unitRanks2005, units2005, by = 'uID') SOx_emiss_plot <- plot_ranked_facs( ranks.dt = unitRanks2005, size.var = 'SOx', size.name = "2005 SOx emissions", plot.title = NULL, xlims = c( -81, -75), ylims = c( 38.5, 42.5), dist.scalebar = 150) SOx_emsposure_plot <- plot_ranked_facs( ranks.dt = unitRanks2005, size.var = 'hyads.py.sum', size.name = "2005 SOx HyADS exposure", plot.title = NULL, xlims = SOx_emiss_plot$latlonrange$xlim, ylims = SOx_emiss_plot$latlonrange$ylim, dist.scalebar = 150)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.