In lhenneman/hyspdisp: Run HYSPLIT many times in parallel, aggregate to zip code level

knitr::opts_chunk$set(
echo = T, 
results = 'hide',
collapse = TRUE,
comment = "#>"
)

Here we provide an example workflow of how to run the HyADS model using hyspdisp. This vignette provides instruction on how to run the model and ensure the model runs completed. Subsequent vignettes will describe combining, manipulating, and interpreting results. TO DECIDE: split into multiple vignettes, or keep as one?

Begin by installing the package from Github. It is recommended to update all packages when prompted (expecially SplitR).

devtools::install_github("lhenneman/hyspdisp@dev2")
library( hyspdisp)

Getting started - downloading important files

Several files are required to run the functions in hyspdisp. This section presents procedures to download these.

ZIP code-ZCTA crosswalk file

The ZIP code linkage procedure requires a ZCTA-to-ZIP code Crosswalk file. ZCTAs are not exact geographic matches to ZIP codes, and multiple groups compile and maintain Crosswalk files. One example is the Crosswalk mainted by UDS Mapper. It can be retrieved and its names changed for consistency with hyspdisp functions with the following commands:

library( openxlsx)
crosswalkin <- data.table( read.xlsx( xlsxFile =
                                        "https://www.udsmapper.org/docs/zip_to_zcta_2017.xlsx"))
setnames( crosswalkin, 
          old = "ZIP_CODE", 
          new = "ZIP")

Census population of Zip Code Tabulation Areas

While not necessary for the HYSPLIT model or processing of its outputs, population-weighted exposure metrics allow for direct comparisons between power plants. In anticipation of this, we load census data packaged by American Fact Finder and merge it with the crosswalk data. The variable ZCTA must first be added from the included Id2 variable before merging.

data( ZCTApop2017)
ZCTApop2017[, ZCTA := formatC( Id2, format = 'd', flag = '0', width = 5)]
crosswalkin <- merge( ZCTApop2017, crosswalkin,
                      by = 'ZCTA')

ZCTA shapefile

Equally important is the ZCTA shapefile. These are available from multiple locations, including the US census website. Code below downloads the file from the FTP site, which is more stable but slower. A second alternative (not shown) is to use the tigris r library. Download the file, unzip it, and read in the shapefile (here I've unzipped it to a directory in my Desktop). Loading the file may take a few moments.

# Define the storage directory, create it if it does not exist
zcta_dir <- file.path('~', 'Desktop', 'run_hyspdisp', 'zcta_500k')
zcta_file <- file.path( zcta_dir, 'cb_2017_us_zcta510_500k.zip')
dir.create( zcta_dir, 
            recursive = T, 
            showWarnings = F)

# Download the file if it is not present 
zcta_file_url <- 'ftp://ftp2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_zcta510_500k.zip'
if( !file.exists( zcta_file)){
  download.file( url = zcta_file_url,
                 destfile = zcta_file)
  unzip( zcta_file, exdir = zcta_dir)
}

# Load the ZCTA shapefile using the shapefile command
zcta_shapefile <- file.path( zcta_dir, 'cb_2017_us_zcta510_500k.shp')
zcta <- shapefile( x = zcta_shapefile)

It is recommended to transform the ZCTA shapefile to a known projection to maintain consistency throughout the allocation process. Lat-lon projections are preferred, such as the North American Albers Equal Area Conic:

p4s <- "+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m"
zcta.trans <- spTransform( x = zcta, 
                           CRSobj = p4s)

Monthly global planetary boundary layer

The final input file is the monthly mean boundary layer heights. For years up to and including 2012, these are available in a single file from NOAA's Earth System Research Library. For years 2013 and later (and available for years 2008 and on), you may use the NCAR/UCAR data archive. The NCAR/UCAR annual files must be downloaded one by one. Below I've included examples for how to download the two types of files if they are not already in the file path.

# Define a directory, create it if it does not exist
hpbl_dir <- file.path('~', 'Desktop', 'run_hyspdisp', 'hpbl')
dir.create( hpbl_dir, 
            recursive = T, 
            showWarnings = F)

# Download the PBL file from NOAA for years up to and including 2012
library( ncdf4)
hpbl_file_NOAA <- file.path( hpbl_dir, 'hpbl.mon.mean.nc')
url_NOAA <- "ftp://ftp.cdc.noaa.gov/Datasets/20thC_ReanV2/Monthlies/gaussian/monolevel/hpbl.mon.mean.nc"
if( !file.exists( hpbl_file_NOAA))
  download.file( url = url_NOAA,
                 destfile = hpbl_file_NOAA)

# Download the PBL file from NCAR for 2013
hpbl_file_NCAR2013 <- file.path( hpbl_dir, 'hpbl.mon.mean.nc_2013.grb')
url_NCAR <- "https://rda.ucar.edu/thredds/catalog/files/e/ds630.1/e5.mnth.mean.an.sfc/2013/catalog.html?dataset=files/e/ds630.1/e5.mnth.mean.an.sfc/2013/e5.mnth.mean.an.sfc.128_159_blh.regn320sc.2013010100_2013120100.grb"
if( !file.exists( hpbl_file_NCAR2013))
  download.file( url = url_NCAR,
                 destfile = hpbl_file_NCAR2013)

Both files are read in using the brick command, and rotated to match the lat-lon projection applied above. Before reading in, it is necessary to set the system time zone to UTC so that the dates are formatted correctly in the raster files. For the NCAR boundary layer files, it is not neccesary to set the varname argument in the brick function.

Sys.setenv(TZ='UTC')
hpbl_rasterin <- brick( x = hpbl_file_NOAA, 
                                varname = 'hpbl' )

Setting up the inputs

Functions in hyspdisp require specific inputs. Their setups are detailed in this section.

Selecting units

Select power plants to run. In this case, we'll use the two units in 2005 with the greatest SOx emissions. This package contains annual emissions and stack height data from EPA's Air Markets Program Data and the [Energy Information Agency] (https://www.eia.gov/electricity/data/eia860/) for 2005, 2006, and 2012. [PLUG CHRISTINE's DATA PAPER].

Many of the units in the provided datasets do not have stack height data. In these cases, it is suggested in Henneman et al. (2019) to fill with the average stack height of all units.

data( units2005)
units2005 <- data.table( units2005)[, V1 := NULL]
units.run <- units2005[ order( -SOx)][1:2]
units.run

knitr::kable( units.run)

Designate directories for storing output files

hyspdisp creates multiple file types that store information important in interpreting the results. It is recommended to define the locations beforehand so that you:

can access the files later
can ensure there is enough storage space available

In general, hyspdisp functions will create the directories if they do not already exist. If they are not defined, they will be created in the working directory. The following directories, as referenced in the functions below in this document, will contain:

proc_dir: the overarching directory containing each of the subdirectories, and the working directory for HYSPLIT
hysraw_dir: raw hyspdisp output (one file for each emissions event)
ziplink_dir: files containing ZIP code linkages
meteo_dir: (reanalysis) meteorology files
rdata_dir: RData files containing HyADS source-receptor matrices

## proc_dir defines where runs happen and output is saved
## up_dir is where input data is saved (upper file structure)
proc_dir <- file.path( '~', 'Desktop', 'run_hyspdisp')
hysraw_dir <- file.path( proc_dir, 'output_hysplit')
ziplink_dir <- file.path( proc_dir, 'output_ziplinks')
meteo_dir <- file.path( proc_dir, 'meteo')
rdata_dir <- file.path( proc_dir, 'output_rdata')
exposure_dir <- file.path( proc_dir, 'exposure_data')

Download reanalysis meteorology files

While SplitR (the R package that calls HYSPLIT) includes code that checks if the appropriate metorological files are downloaded, it is recommended to download needed files explicitely before the first run of hyspdisp_fac_model_parallel because the parallel code does not handle the download well split over multiple cores. Below is shown code to test for the three meteorology files needed for the present run, and download them if they are not already in the meteo_dir. The reanalysis met files are about 120 MB each.

metfiles.vignette <- c( 'RP200412.gbl',
                        'RP200501.gbl',
                        'RP200502.gbl',
                        'RP200503.gbl',
                        'RP200504.gbl',
                        'RP200505.gbl',
                        'RP200506.gbl',
                        'RP200507.gbl')
dir.create( meteo_dir,
            recursive = T,
            showWarnings = F)
metfiles.vignette <- metfiles.vignette[!(metfiles.vignette %in% list.files(meteo_dir))]
if( length( metfiles.vignette) > 0)
    get_met_reanalysis(files = metfiles.vignette, 
                       path_met_files = meteo_dir)

Define time periods to be run

HYSPLIT, as applied here, tracks air parcels emitted at certain times and locations. hyspdisp refers to these as emissions events. Once HYSPLIT has been run for each emissions event, the simulated parcel locations are aggregated by source, time, and location. The functions below are written to enable runs of many emissions events.

To define an object that includes all emission events in a given time period, we can use the helper function define_inputtimes. This takes as inputs a starting and ending day, and outputs a table of value whose rows will later correspond to inputs into the main hyspdisp worker functions. The following command combines the units defined above with four times a day for January-June in 2005 (we show an example for half the year here to keep the computation manageable for a desktop computer). Four daily emissions events are defined by the start_hours variable, and duration = 240 denotes that the emitted air parces are tracked for 240 hours (10 days). The resulting data.table's column names are printed in the table below.

## combine dates and hours
input_refs <- define_inputs( units = units.run,
                             startday = '2005-01-01',
                             endday = '2005-06-30',
                             start_hours =  c( 0, 6, 12, 18),
                             duration = 240)

names( input_refs)

knitr::kable( names( input_refs))

Run hyspdisp worker functions

The following examples show how to run a small subset (chosen to provide workable examples on a laptop) of emissions events, link to ZIP codes, and plot the results.

Run HYSPLIT

The hyspdisp_fac_model_parallel function runs HYSPLIT for each emissions event described by the rows of the run_ref_tab. Inputs defined above, including the projected ZCTA shapefile (zcta.trans), the ZIP-ZCTA crosswalk (crosswalkin), the planetary boundary layer raster (hpbl_rasterin), and working directories are necessary in the function call. With link2zip = T you can link dispersion patterns from individual emissions events to ZIP codes; this, however, somewhat violates the spirit of HyADS, since the large number of emissions events is used as a check on some other uncertainties introduced by the simplifying assumptions.

Here it's run for a sample of twenty (run_sample) rows of the input_refs table. The results will be saved in the hysraw_dir. mc.cores is an option in the mcapply function that controls the number of cores should be used; the code below uses the detectCores function to determine how many cores are available, but this approach is less reliable on some machines.

hysp_raw, created below, is a list of output describing where the outputs are saved or the reasons that HYSPLIT was not run. The following code will return warnings from the projection transformation applied to the PBL layer rasters. These are expected and can safely be ignored.

If you get an error running HYSPLIT, it is likely because the meteorology files did not download correctly. Check your meteo_dir to ensure all files are present and at least 100MB in size.

library(parallel)
run_sample <- round(seq( 1, dim( input_refs)[1], length.out = 20))
hysp_raw <- mclapply( X = run_sample,
                      FUN = hyspdisp_fac_model_parallel,
                      run_ref_tab = input_refs,
                      zcta2 = zcta.trans,
                      crosswalk = crosswalkin,
                      hpbl_raster = hpbl_rasterin,
                      prc_dir = proc_dir,
                      hyo_dir = hysraw_dir,
                      met_dir = meteo_dir,
                      mc.cores = detectCores() - 1)
hysp_raw[1]

library(parallel)
knitr::kable( head( hysp_raw, 1))

Link results to ZIP codes

Most current implementations of HyADS, instead of linking dispersion patterns from individual emissions events, link the patterns by month. With the hysdisp_zip_link function, users can link all air parcels to ZIP codes by month for an individual unit. Here, we define the variables yearmons with combinations of years and months. hysdisp_zip_link reads in all the relevant files (i.e., those that correspond to the provided month_YYYYMM and unit) produced by the hyspdisp_fac_model_parallel function and saved to the hyo_dir, then links them to ZIP codes. The result is data.table of ZIP codes and relative contributions N, and an identical .csv file is saved to the zpc_dir. N is not weighted by the unit's emissions. hyspdisp_zip_link is parallelizable for different months using mclapply or similar, and is run for a single unit at a time (we define the unit as the each unit in unit.run.)

linked_zips1 and linked_zips2 contain monthly ZIP code exposure attributable to the emissions events modeled by the hyspdisp_fac_model_parallel function above. Keep in mind that we only modeled a subset of all possible emissions events defined in input_refs. Therefore, the following objects only contain ZIP code linkages from a subset of all emissions. Since we did not model any emissions events after June, the commands will return warnings for months 7-12.

yearmons <- paste0( 2005, 1:12)
linked_zips1 <- mclapply( yearmons,
                          hyspdisp_zip_link,
                          unit = units.run[1],
                          hpbl_raster = hpbl_rasterin,
                          zcta2 = zcta.trans,
                          crosswalk = crosswalkin,
                          prc_dir = proc_dir,
                          zpc_dir = ziplink_dir,
                          hyo_dir = hysraw_dir,
                          mc.cores = detectCores() - 1)

linked_zips2 <- mclapply( yearmons,
                          hyspdisp_zip_link,
                          unit = units.run[2],
                          hpbl_raster = hpbl_rasterin,
                          zcta2 = zcta.trans,
                          crosswalk = crosswalkin,
                          prc_dir = proc_dir,
                          zpc_dir = ziplink_dir,
                          hyo_dir = hysraw_dir,
                          mc.cores = detectCores() - 1)

Plot the results

At this point, it makes sense to take a look at the results to see which ZIP codes are impacted by emissions from the facility whose impacts we just modeled. To acheive this, we can use the plot_hyspdisp function, which takes as input a data.table (or data.frame) linked to simple features ZIP code data. We can use the st_read function from the sf package to read in the same ZCTA shapefile we used to link the HYSPLIT output to ZIP codes. Next, we merge the sf ZIP code object to both the crosswalk file and the linked_zips object created above, then feed the merged dataset to the plot_hyspdisp function, which returns a ggplot object.

Keep in mind, since we only modeled a subset of the potential model runs defined in input_refs, the impacts contained in zip_dataset_sf and plotted below include only results from a single emissions event. Further, the plotted values are "HyADS raw exposure", and are comparable only to a relative exposure of 0. So, a value of 0.01 represents 10x more exposure than a value of 0.001, but only in relation to unit 3136-1.

library(sf)
library(viridis)
library(ggplot2)
library(scales)

zips.sf <- st_read(zcta_shapefile)
setnames( zips.sf, 'ZCTA5CE10', 'ZCTA')

zips_crosswalk.sf <- merge( zips.sf, 
                            crosswalkin, 
                            by = "ZCTA", all = F, allow.cartesian = TRUE) 

zip_dataset_sf <- data.table( units.run[1,],
                              merge( zips_crosswalk.sf,
                                     linked_zips1[[1]],
                                     by = c('ZIP'), all.y = T))

ziplink_plot <- plot_hyspdisp(hyspdisp_out.sf = zip_dataset_sf,
                              metric = 'N',
                              plot.title = paste('Partial Jan. 2005 HyADS raw exposure, uID:',
                                                 zip_dataset_sf[1, uID]),
                              legend.title = 'HyADS raw exposure',
                              facility.loc = data.table( x = zip_dataset_sf[1, Longitude],
                                                         y = zip_dataset_sf[1, Latitude]))
ziplink_plot

Combine all results into RData file

So far, we have run two worker functions important to the implementation of HyADS:

hyspdisp_fac_model_parallel: ran HYSPLIT for unit-date-time emission event combinations and saved an output file for each
hyspdisp_zip_link: gathered output files from hyspdisp_fac_model_parallel, grouped them by month and unit, and linked the HYSPLIT parcel locations contained in the files to ZIP codes.

In practice, these two worker functions may need to be implemented in parallel R sessions on a cluster to improve efficiency. It helps for future analyses to gather the relevant monthly files and save them as a single RData file. This is made possible by combine_monthly_ziplinks, which saves an RData file of annual monthly results to the rda_dir.

combined_ziplinks <- combine_monthly_ziplinks( month_YYYYMMs = yearmons,
                                               zpc_dir = ziplink_dir,
                                               rda_dir = rdata_dir)
names( combined_ziplinks)

knitr::kable( names( combined_ziplinks))

Calculate and extract useful information from the results

Up until now, this vignette has focused on producing unweighted HyADS results. Once the procedure above has been followed, it is farily straightforward to weight the unweighted HyADS results by emissions (or other quantity) and aggregate the results by source, receptor, and/or time period.

Load power plant emissions data

The hyspdisp package include monthly power plant emissions, load, and heat input data, accessible using the following command:

data( PP.units.monthly1995_2017)

Weight the results by emissions

The calc_zip_exposure function takes as input the .RData file created by combined_ziplinks and monthly power plant emissions. The user can select the hyspdisp year (year.H), the emissions year (year.E), the weighting pollutant (although it does not necessarily have to be a pollutant), the source aggregation (source.agg can be designated 'total', 'facility', or 'unit'), the time aggregation (time.agg can be set to 'month' or year'). Results are saved to the exp_dir and returned, but return.monthly.data must be set to true to return data when time.agg = 'month' (the default was set to save working memory space).

Here, we first calculate the annual HyADS emissions weighted exposure for all emissions modeled above and plot the results. The plots require a large amount of working memory, and may require you to clear your image history (if using RStudio).

zip_exp_ann <- calc_zip_exposure(rda_file = "~/Desktop/run_hyspdisp/output_rdata/hyads_unwgted_2005.RData",
                                 units.mo = PP.units.monthly1995_2017,
                                 year.E = 2005,
                                 year.H = 2005,
                                 pollutant = 'SO2.tons',
                                 source.agg = 'total',
                                 time.agg = 'year',
                                 exp_dir = exposure_dir)

zip_exp_ann_sf <- data.table( merge( zips_crosswalk.sf,
                                     zip_exp_ann,
                                     by = c('ZIP'), all.y = T))

zip_exp_ann_plot <- plot_hyspdisp(hyspdisp_out.sf = zip_exp_ann_sf,
                              metric = 'hyads',
                              plot.title = paste('2005 HyADS exposure from',
                                                 units.run[1, ID], 'and', units.run[2, ID]),
                              legend.title = 'HyADS exposure',
                              facility.loc = data.table( x = units.run[, Longitude],
                                                         y = units.run[, Latitude]))

zip_exp_ann_plot

Plot unit-specific impacts over time

Next, we calculate the monthly emissions weighted HyADS exposure (again, only from the select few runs we initiated above). Code below plots the monthly exposure contributions of the two units in ZIP code 21613, which is located in Maryland. Note that this ZIP code was not impacted by the facilities in all the modeled months in 2005.

zip_exp_unit_mo <- calc_zip_exposure(rda_file = "~/Desktop/run_hyspdisp/output_rdata/hyads_unwgted_2005.RData",
                                     units.mo = PP.units.monthly1995_2017,
                                     year.E = 2005,
                                     year.H = 2005,
                                     pollutant = 'SO2.tons',
                                     source.agg = 'unit',
                                     time.agg = 'month',
                                     exp_dir = exposure_dir,
                                     return.monthly.data = T)

zip_exp_unit_mo[, uID := as( uID, 'character')]
qplot( x = yearmonth,
       y = hyads,
       group = uID,
       color = uID,
       data = zip_exp_unit_mo[ZIP == 21613],
       geom = 'line')

Rank facilities

In the examples provided above, we selected two coal units a priori to analyze. However, a benefit of HyADS over more complex models is the ability to easily rank-order emissions sources to identify the sources that contribute the most to exposure in a given area. The following code snippet applies the calc_zip_exposure function to calculate annual impacts by unit (here, only the two units selected above), then rank-orders their population-weighted exposures on Pennsylvania using the rankfacs_by_popwgt_location function. In this function, the input ZIP-code linkage dataset is defined by link.dt = zip_exp_ann_unit; the crosswalkin file employed above is used again, this time including a column named year to ensure the appropriate population-weighting is used; the census.pop.name defines the population column in crosswalkin (recall we linked ACS ZIP-code populations with the crosswalk file above); the rank.by column defines the variables in zip_exp_ann_unit to rank by---it is possible to define multiple values (e.g., rank.by = c( 'hyads', 'SOx')) to rank by multiple ranking variables if they are available; finally, state.value, city.value, and zip.value can be defined to limit the area of influence (here, we only rank these units' influences on Pennsylvania with state.value = 'PA').

zip_exp_ann_unit <- calc_zip_exposure(year.E = 2005,
                                      year.H = 2005,
                                      pollutant = 'SO2.tons',
                                      rda_file = file.path( rdata_dir, 
                                                            "hyads_unwgted_2005.RData"),
                                      exp_dir = exposure_dir,
                                      units.mo = PP.units.monthly1995_2017,
                                      source.agg = 'unit',
                                      time.agg = 'year')


crosswalkin2005 <- copy( crosswalkin)[, year := 2005]
zip_exp_ann_unit[, year := 2005]

unitRanks2005 <- rankfacs_by_popwgt_location( link.dt = zip_exp_ann_unit,
                                              census.dt = crosswalkin2005,
                                              census.pop.name = 'Estimate; Total',
                                              rank.by = c( 'hyads'),
                                              state.value = 'PA')
unitRanks2005

knitr::kable( unitRanks2005)

Plot ranked facilities

Finally, we can make two plots: one that shows the total 2005 emissions from each of the units in 2005, and one that shows the total population-weighted exposure from these units. We start by linking unitRanks2005 back with the unit location and emissions information, then execute the plot_ranked_facs command twice, once to size the facility units by emissions, and a second to size by hyads.py.sum, the metric of population-weighted influence calculated by rankfacs_by_popwgt_location.

library(ggsn)
unitRanks2005 <- merge( unitRanks2005, units2005, by = 'uID')
SOx_emiss_plot <- plot_ranked_facs( ranks.dt = unitRanks2005,
                                    size.var = 'SOx',
                                    size.name = "2005 SOx emissions",
                                    plot.title = NULL,
                                    xlims = c( -81, -75),
                                    ylims = c( 38.5, 42.5),
                                    dist.scalebar = 150)

SOx_emsposure_plot <- plot_ranked_facs( ranks.dt = unitRanks2005,
                                        size.var = 'hyads.py.sum',
                                        size.name = "2005 SOx HyADS exposure",
                                        plot.title = NULL,
                                        xlims = SOx_emiss_plot$latlonrange$xlim,
                                        ylims = SOx_emiss_plot$latlonrange$ylim,
                                        dist.scalebar = 150)

lhenneman/hyspdisp documentation built on Oct. 17, 2019, 6:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lhenneman/hyspdisp
Run HYSPLIT many times in parallel, aggregate to zip code level

In lhenneman/hyspdisp: Run HYSPLIT many times in parallel, aggregate to zip code level

Getting started - downloading important files

ZIP code-ZCTA crosswalk file

Census population of Zip Code Tabulation Areas

ZCTA shapefile

Monthly global planetary boundary layer

Setting up the inputs

Selecting units

Designate directories for storing output files

Download reanalysis meteorology files

Define time periods to be run

Run hyspdisp worker functions

Run HYSPLIT

Link results to ZIP codes

Plot the results

Combine all results into RData file

Calculate and extract useful information from the results

Load power plant emissions data

Weight the results by emissions

Plot unit-specific impacts over time

Rank facilities

Plot ranked facilities

R Package Documentation

Browse R Packages

We want your feedback!

lhenneman/hyspdisp Run HYSPLIT many times in parallel, aggregate to zip code level

In lhenneman/hyspdisp: Run HYSPLIT many times in parallel, aggregate to zip code level

Getting started - downloading important files

ZIP code-ZCTA crosswalk file

Census population of Zip Code Tabulation Areas

ZCTA shapefile

Monthly global planetary boundary layer

Setting up the inputs

Selecting units

Designate directories for storing output files

Download reanalysis meteorology files

Define time periods to be run

Run hyspdisp worker functions

Run HYSPLIT

Link results to ZIP codes

Plot the results

Combine all results into RData file

Calculate and extract useful information from the results

Load power plant emissions data

Weight the results by emissions

Plot unit-specific impacts over time

Rank facilities

Plot ranked facilities

R Package Documentation

Browse R Packages

We want your feedback!

lhenneman/hyspdisp
Run HYSPLIT many times in parallel, aggregate to zip code level