knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this vignette you will see how to use the disperseR
package. We will walk you through an example step by step. We will be using functions that will automatically download the needed data for you to be able to run a complete analysis for the United States. We also provide other vignettes that show you how we processed the data and how to load datasets one by one (optional).
We start by importing the packages that we will use in the analysis. Make sure you have them installed. The functions in disperseR
are based on the data.table
package for faster computation, but this vignette contains many functions from the tidyverse
family to illustrate another syntax. First, install a slightly modified version of SplitR
.
devtools::install_github( 'lhenneman/SplitR')
library(disperseR) # our package library(ncdf4) library(data.table) library(tidyverse) library(parallel) library(sf) library(viridis) library(ggplot2) library(scales) library(ggsn) library(gridExtra) library(ggmap) library(ggrepel) library(fst) library(USAboundaries)
Next, we will use the create_dirs()
function that creates a project folder with all the necessary sub folders. We recommend that you begin by setting up your project with this function. Provide the path to where you would like your project to live in the location
argument. By default, if you do not specify the path, this function will create a main directory on your desktop. It will also assign paths to string variables in your environment.
This will set up is the following folders and paths to them :
main
: the main folder where the project will be located. input
: the input that we need for calculations. zcta_500k
: ZCTA (A Zip Code Tabulation Area) shape fileshpbl
: monthly global planetary boundary layer files.meteo
: (reanalysis) meteorology filesoutput
hysplit
: disperseR output (one file for each emissions event)ziplink
: files containing ZIP code linkagesrdata
: RData files containing HyADS source-receptor matricesexp
: exposure per zipcode data graph
: graphs saved here as pdf when running functionsprocess
: temporary files that are created when the model is running and then deletedNote that not specifying the location
argument will lead to a default Desktop location for the project folder. You can specify another folder by typing: disperseR::create_dirs(location="/Users/username/projectlocation")
instead.
You should run this command before each disperseR session because the function defines global environment variables pointing to these directories. Don't worry, your previous data will not be overwritten.
disperseR::create_dirs()
The next step is to get the data necessary for the analysis. Let's start by loading data sets that are provided inside the disperseR
package.
get_data()
function.You can get most of the data required for the analysis by using the following function. This function will download the data necessary and for the data that is already attached with the package it will automatically assign it to variables in your R environment. If you want to load the data step by step check our vignette here. It also contains more information about the data and their sources.
The arguments start.year
, start.month
,end.year
, and end.month
are necessary to download the meteorology reanalysis files. They will be downloaded if they are not already in the meteo_dir
folder. The reanalysis met files are about 120 MB each.
If you, for example, you want to download files for Jan-March 2005, you just have to use the get_data()
function and set data = "all"
, start.year = "2005"
, start.month = "01"
, end.year = "2005"
, and end.month = "03"
. See below.
disperseR::get_data(data = "all", start.year = "2005", start.month = "11", end.year = "2006", end.month = "02")
The units data should be loaded separately so that you are able to select which units to process.
This package contains annual emissions and stack height data from EPA's Air Markets Program Data and the Energy Information Agency for years 2003-2012. Again, if you would like to know how these data were prepared please see the special vignette that we have attached to this package. Access it here
You can visualize the data like this in RStudio:
View(disperseR::units)
Now, we need to select the power plants whose impact we will analyse. In this case, we'll use the two units in 2005 with the greatest SOx emissions, and two with the greatest SOx emissions in 2006. You are free to choose your own units. You can see that unit "3136-1" is on top of the ranking in both years.
unitsrun2005 <- disperseR::units %>% dplyr::filter(year == 2005) %>% # only get data for 2005 dplyr::top_n(2, SOx) # sort and take the two rows with the biggest value for SOx unitsrun2006 <- disperseR::units %>% dplyr::filter(year == 2006) %>% # only get data for 2006 dplyr::top_n(2, SOx) # sort and take the two rows with the biggest value for SOx head(unitsrun2005) # append together and transform to data table unitsrun<-data.table::data.table(rbind(unitsrun2005, unitsrun2006))
HYSPLIT, as applied here, tracks air parcels emitted at certain times and locations. disperseR
refers to these as emissions events. Once HYSPLIT has been run for each emissions event, the simulated parcel locations are aggregated by source, time, and location. The functions below are written to enable runs of many emissions events.
To define an object that includes all emission events in a given time period, we can use the helper function define_inputs()
. This takes as inputs a starting and ending day, and outputs a table of values whose rows will later correspond to inputs into the main disperseR
functions. The following command combines the units defined above with four times a day for January-July in 2005 (we show an example for this short period of time here to keep the computation manageable for a desktop computer). Four daily emissions events are defined by the startday argument, and duration = 120
denotes that the emitted air parcels are tracked for 120 hours (5 days). Note that most application of HyADS to date have used duration = 240
(10 days); we use 5 days here to keep this example snappy.
The result of running this function is a data.table
that you can see below.
input_refs <- disperseR::define_inputs(units = unitsrun, startday = '2005-11-01', endday = '2006-02-28', start.hours = c(0, 6, 12, 18), duration = 120) head(input_refs)
The following examples show how to run a small subset (chosen to provide workable examples on a laptop) of emissions events, link to spatial domains---U.S. ZIP codes, U.S. counties, or a regular grid---and plot the results.
The run_disperser_parallel()
function runs HYSPLIT for each emissions event. Inputs defined above, including the projected ZCTA shape file (zcta
), the ZIP-ZCTA crosswalk (crosswalk
), the planetary boundary layer raster (pblheight
), are necessary in the function call.
If you get an error running HYSPLIT, it is likely because the meteorology files did not download correctly. Check your meteo_dir
to ensure all files are present and at least 100MB in size.
Below we subset the data to run the process on a local computer. We only choose the first day and first hour of each month for each of the units that we chose. You can choose the events you are interested in.
input_refs_subset <- input_refs[format(as.Date(input_refs$start_day, format = "%Y-%m-%d"), format = "%d") == "01" & start_hour == 0]
input.refs
should be the data table that is the result of the define_input()
function. Remember that here we are running the model just on a subset so we will set input.refs = input_refs_subset
The pbl.height
argument is NULL
by default but you should provide it for the function to run properly. We use the pblheight
data.
The package has the possibility to use two types of species. The default one is species = 'so2'
, but you can also use particulate sulfate species = 'so4p'
.
The parallel::detectCores()
function automatically detects the number of cores on your machine. If you would like to change on how many cores the model should be run please specify the number of cores. Here, for example we specified the number of cores to be the number of the cores on the machine. If you would like to run this model sequentially please set mc.cores = 1
.
It is possible that running the below code with output a warning "WARNING: map background file not found ../graphics/arlmap". It is safe to ignore it.
hysp_raw <- disperseR::run_disperser_parallel(input.refs = input_refs_subset, pbl.height = pblheight, species = 'so2', proc_dir = proc_dir, overwrite = FALSE, ## FALSE BY DEFAULT npart = 100, keep.hysplit.files = FALSE, ## FALSE BY DEFAULT mc.cores = parallel::detectCores())
Most current implementations of HyADS, instead of linking dispersion patterns from individual emissions events, link the patterns by month.
With the link_all_units()
function, users can link all air parcels to three spatial domains---U.S. ZIP codes, U.S. counties, or a regular grid---by month for specified units. Here, we define the variables yearmons
with combinations of years and months. link_all_units()
uses the support functions disperser_link_zips()
, disperser_link_counties()
, disperser_link_grids()
to read in raw HYSPLIT output files produced by the run_disperser_parallel()
and links the corresponding parcel locations to the appropriate spatial domain. The user selects the spatial domain through the link.to
argument---it can be set to one of "zips", "counties", or "grids".
link_all_units()
can be set to paralelize the runs by splitting different months on different cores, but you can make it serial just by setting the mc.cores = 1
. The result of link_all_units()
is a data.table of ZIP codes, counties, or grids and relative contributions N
, and an identical .csv
file is saved to the folder with a path specified by the ziplink_dir
variable. N
is not weighted by the unit
's emissions.
Below we present how to use link_all_units()
for chosen units. First, we create an argument (called yearmons
here) denoting the year and month combinations for which we would like to calculate unweighted exposure. Here we only choose July 2005 through June 2006 to cover all of the example runs specified above, and we create the variable using the get_yearmon()
function.
yearmons <- disperseR::get_yearmon(start.year = "2005", start.month = "07", end.year = "2006", end.month = "06")
Let's look again at our unitsrun
data set.
unitsrun
We only have three units in our unitsrun
data set. "3136-1", "3136-2" and "3149-1". We will run link_all_units()
for all of them (it is possible to pass link_all_units()
a single unit if desired). The function will return only the values for units and months that could have been linked to one another.
If you want to run the function for a specific start and end date you can specify the start.date
and the end.date
arguments. For example, start.date="2005-01-02"
for 2 January 2005. These arguments are set to NULL
by default and the function computes the start and the end dates using the year.mons
provided.
pbl.height = pblheight
by default but you can change it. The same applies to crosswalk. = crosswalk
(crosswalk.
is only required for linking to ZIP codes). counties.
is required for linking to counties, and is set as counties. = USAboundaries::us_counties( )
by default. If you'd only like to save time and only link to ZIP codes or counties in a subset of the country, provide crosswalk.
or counties.
that correspond to the domain of interest.
Changing duration.run.hours
to values less than that applied in run_disperser_parallel()
allows the exploration of air parcel locations in closer time proximity to each emissions event.
res.link
sets the grid resolution in meters (default is 12000 or 12km). Since parcel locations are first assigned to a grid before further spatial allocation, this setting has the potential to impact all results. A higher res.link
with link.to = "grids"
will result in faster run time and smaller output files.
linked_zips <- disperseR::link_all_units( units.run = unitsrun, link.to = 'zips', mc.cores = parallel::detectCores(), year.mons = yearmons, pbl.height = pblheight, crosswalk. = crosswalk, duration.run.hours = 20, res.link = 12000, overwrite = FALSE) # link all units to counties linked_counties <- disperseR::link_all_units( units.run=unitsrun, link.to = 'counties', mc.cores = parallel::detectCores(), year.mons = yearmons, pbl.height = pblheight, counties. = USAboundaries::us_counties( ), crosswalk. = NULL, duration.run.hours = 20, overwrite = FALSE) # link all units to grids linked_grids <- disperseR::link_all_units( units.run=unitsrun, link.to = 'grids', mc.cores = parallel::detectCores(), year.mons = yearmons, pbl.height = pblheight, crosswalk. = NULL, duration.run.hours = 20, overwrite = FALSE) head(linked_zips) head(linked_counties) head(linked_grids)
It could be the case that data could not have been linked for a specific month and unit. Below we show how to take a look at the combinations of units and months that were successfully linked. We can easily do it in the following way:
unique(linked_zips$comb)
At this point, it makes sense to take a look at the results to see which ZIP codes are impacted by emissions from the facility whose impacts we modeled above. We can see how the impact looks by looking at numbers and plots. We will first just look at the numbers. For this purpose, we created a function create_impact_table_single()
that generates a specific data.table.
And now we can use the create_impact_table_single()
function to see the computed impact for a single month and a single unit.
Keep in mind, since we only modeled a subset of the potential model runs defined in input_refs()
, the impacts contained in impact_table_single
and plotted below include only results from a single emissions event. Further, the plotted values are "HyADS raw exposure", i.e., not weighted by emissions. So, a value of 0.01 represents 10x more exposure than a value of 0.001, but only in relation to unit 3136-1.
We select the spatial link based on the link.to
argument. The function returns data.tables that include an sf
geometry column for easy plotting.
impact_table_zip_single <- disperseR::create_impact_table_single( data.linked=linked_zips, link.to = 'zips', data.units = unitsrun, zcta.dataset = zcta_dataset, map.unitID = "3136-1", map.month = "200511", metric = 'N') impact_table_county_single <- disperseR::create_impact_table_single( data.linked=linked_counties, link.to = 'counties', data.units = unitsrun, counties. = USAboundaries::us_counties( ), map.unitID = "3136-1", map.month = "200511", metric = 'N') impact_table_grid_single <- disperseR::create_impact_table_single( data.linked=linked_grids, link.to = 'grids', data.units = unitsrun, map.unitID = "3136-1", map.month = "200511", metric = 'N') head(impact_table_zip_single)
In a similar way we can just plot the results using the plot_impact_single()
function. A customized title can be set using the plot.name
argument. Additional inputs can be passed to ggplot2's theme
function to customize your plot.
zoom = T
by default and is authomatically adjusted depending on the plot values. You can set zoom = F
to display the map of the whole country. Note that the linked grids extend beyond the US boarders because any grid across the world may be impacted by the given facilities.
link_plot_zips <- disperseR::plot_impact_single( data.linked = linked_zips, link.to = 'zips', map.unitID = "3136-1", map.month = "20061", data.units = unitsrun, zcta.dataset = zcta_dataset, metric = 'N', graph.dir = graph_dir, zoom = T, # TRUE by default legend.name = 'HyADS raw exposure', # other parameters passed to ggplot2::theme() axis.text = element_blank(), legend.position = c( .75, .15)) link_plot_grids <- disperseR::plot_impact_single( data.linked = linked_grids, link.to = 'grids', map.unitID = "3136-1", map.month = "20061", data.units = unitsrun, metric = 'N', graph.dir = graph_dir, zoom = F, # TRUE by default legend.name = 'HyADS raw exposure', # other parameters passed to ggplot2::theme() axis.text = element_blank(), legend.position = c( .75, .15)) link_plot_counties <- disperseR::plot_impact_single( data.linked = linked_counties, link.to = 'counties', map.unitID = "3136-1", map.month = "20061", counties. = USAboundaries::us_counties( ), data.units = unitsrun, metric = 'N', graph.dir = graph_dir, zoom = T, # TRUE by default legend.name = 'HyADS raw exposure', # other parameters passed to ggplot2::theme() axis.text = element_blank(), legend.position = c( .75, .15)) link_plot_zips link_plot_grids link_plot_counties
So far, we have run two worker functions important to the implementation of HyADS:
run_disperser_parallel()
ran HYSPLIT for unit-date-time emission event combinations and saved an output file for each gathered output files from run_disperser_parallel()
. link_all_units()
function grouped them by month and unit, and linked the HYSPLIT parcel locations contained in the files to various three different scales.In practice, these two worker functions may need to be implemented in parallel R sessions on a cluster to improve efficiency. It helps for future analyses to gather the relevant monthly files and save them as a single RData file. This is made possible by combine_monthly_links()
, which saves an RData file of annual monthly results to the rda_dir
. Names of the resulting objects give the year-month combinations in each object.
combined_ziplinks <- disperseR::combine_monthly_links( month_YYYYMMs = yearmons, link.to = 'zips', filename = 'hyads_vig_unwgted_zips.RData') combined_countylinks <- disperseR::combine_monthly_links( month_YYYYMMs = yearmons, link.to = 'counties', filename = 'hyads_vig_unwgted_counties.RData') combined_gridlinks <- disperseR::combine_monthly_links( month_YYYYMMs = yearmons, link.to = 'grids', filename = 'hyads_vig_unwgted_grids.RData') names(combined_ziplinks)
Up until now, this vignette has focused on producing unweighted HyADS results. Once the procedure above has been followed, it is fairly straightforward to weight the unweighted HyADS results by emissions (or other quantity) and aggregate the results by source, receptor, and/or time period.
The create_impact_table_weighted()
function takes as input the .RData file created by combine_monthly_links()
and monthly power plant emissions. The user can select the disperseR year (year.D
), the emissions year (year.E
), the weighting pollutant (although it does not necessarily have to be a pollutant), the source aggregation (you can choose source.agg = "total"
, source.agg = "facility"
, or source.agg = "unit"
), the time aggregation (time.agg = "month"
or time.agg = "year"
).
Here, we first calculate the annual HyADS emissions weighted exposure for all emissions modeled above and plot the results. The plots require a large amount of working memory, and may require you to clear your image history (if using RStudio).
exp_ann_unit_zip <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'zips', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_zips.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'month', return.monthly.data = T) exp_ann_unit_grids <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'grids', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_grids.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'month', return.monthly.data = T) exp_ann_unit_counties <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'counties', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_counties.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'month', return.monthly.data = T)
We can plot the results as following using the plot_impact_weighted()
function. Again, a customized title can be set using the plot.name
argument, and addional arguments can be passed to ggplot2's theme
function to customize.
Same as before, zoom = T
by default and is authomatically adjusted depending on the plot values. You can set zoom = F
to display the map of the whole country.
Please note that the produced graph will be automatically saved in the folder specified in the graph.dir
argument. Here we set graph.dir = graph_dir
. graph_dir
should have been automatically created by the create_dirs()
function.
zip_exp_ann_plot <- disperseR::plot_impact_weighted( data.linked = exp_ann_unit_zip, data.units = unitsrun, link.to = 'zips', zcta.dataset = zcta_dataset, time.agg = 'year', metric = 'hyads', legend.name = 'Aggregate HyADS exposure', zoom = T, # TRUE by default graph.dir = graph_dir, map.month = NULL, # NULL by default change if time.agg = 'month' # other parameters passed to ggplot2::theme() axis.text = element_blank(), legend.position = c( .75, .15)) # 0 by default counties_exp_ann_plot <- disperseR::plot_impact_weighted( data.linked = exp_ann_unit_counties, data.units = unitsrun, link.to = 'counties', counties. = USAboundaries::us_counties( ), time.agg = 'year', metric = 'hyads', legend.name = 'Aggregate HyADS exposure', zoom = T, # TRUE by default graph.dir = graph_dir, map.month = NULL, # NULL by default change if time.agg = 'month' # other parameters passed to ggplot2::theme() axis.text = element_blank(), legend.position = c( .75, .15)) # 0 by default grids_exp_ann_plot <- disperseR::plot_impact_weighted( data.linked = exp_ann_unit_grids, data.units = unitsrun, link.to = 'grids', time.agg = 'year', metric = 'hyads', legend.name = 'Aggregate HyADS exposure', zoom = T, # TRUE by default graph.dir = graph_dir, map.month = NULL, # NULL by default change if time.agg = 'month' # other parameters passed to ggplot2::theme() axis.text = element_blank(), legend.position = c( .75, .15)) # 0 by default zip_exp_ann_plot counties_exp_ann_plot grids_exp_ann_plot
Let's now see how to plot a monthly exposure.
Results are saved to the exp_dir
and returned, but return.monthly.data
must be set to true to return data when time.agg = 'month'
(the default was set to save working memory space).
It is very important to provide a month for which we have available data
exp_mon_unit_zip <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'zips', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_zips.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'month', return.monthly.data = T) exp_mon_unit_grids <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'grids', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_grids.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'month', return.monthly.data = T) exp_mon_unit_counties <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'counties', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_counties.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'month', return.monthly.data = T)
zip_exp_mon_plot <- disperseR::plot_impact_weighted( data.linked = exp_mon_unit_zip, data.units = unitsrun, zcta.dataset = zcta_dataset, time.agg = 'month', map.month = "200511", metric = 'hyads', legend.name = 'Montly HyADS exposure', zoom = T, # TRUE by default graph.dir = graph_dir) zip_exp_mon_plot
Next, we calculate the monthly emissions weighted HyADS exposure (again, only from the select few runs we initiated above). The code below plots the monthly exposure contributions of the three units in ZIP codes 17815, 08210, 13849, 17229. The location of the zip codes can be seen on the small map on the right side of the plot. Please note that the axis values are different for different ZIP codes.
zip_exp_unit_mon2005 <- disperseR::calculate_exposure( rda_file = file.path(rdata_dir, "hyads_vig_unwgted_zips.RData"), units.mo = PP.units.monthly1995_2017, link.to = 'zips', year.E = 2005, year.D = 2005, pollutant = 'SO2.tons', source.agg = 'unit', # note! time.agg = 'month', exp_dir = exp_dir, return.monthly.data = T) zip_exp_unit_mon2006 <- disperseR::calculate_exposure( rda_file = file.path(rdata_dir, "hyads_vig_unwgted_zips.RData"), units.mo = PP.units.monthly1995_2017, link.to = 'zips', year.E = 2006, year.D = 2006, pollutant = 'SO2.tons', source.agg = 'unit', # note! time.agg = 'month', exp_dir = exp_dir, return.monthly.data = T) zip_exp_unit_mon <- rbind(zip_exp_unit_mon2005, zip_exp_unit_mon2006)
And now we can plot the results
zipcodes <- c("13039","21798", "03804") zip_exp_unit <- disperseR::plot_impact_unit( data.linked = zip_exp_unit_mon, zip.codes = zipcodes, graph.dir = graph_dir)
In the examples provided above, we selected coal units a priori to analyze. However, a benefit of HyADS over more complex models is the ability to easily rank-order emissions sources to identify the sources that contribute the most to exposure in a given area.
The following code snippet applies the calc_exposure()
function to calculate annual impacts by unit (here, only the two units selected above), then rank-orders their population-weighted exposures on Pennsylvania using the rankfacs_by_popwgt_location()
function.
zip_exp_ann_unit <- disperseR::calculate_exposure( year.E = 2005, year.D = 2005, link.to = 'zips', pollutant = 'SO2.tons', rda_file = file.path(rdata_dir, "hyads_vig_unwgted_zips.RData"), exp_dir = exp_dir, units.mo = PP.units.monthly1995_2017, source.agg = 'unit', time.agg = 'year') # add column year for proper merge zip_exp_ann_unit[, year := 2005]
In this function, the input ZIP-code linkage dataset is defined by zip_exp_ann_unit
; the rank.by
argument defines the variables in zip_exp_ann_unit
to rank by—it is possible to define multiple values (e.g., rank.by = c( 'hyads', 'SOx')) to rank by multiple ranking variables if they are available; finally, state.value
, city.value
, and zip.value
can be defined to limit the area of influence (here, we only rank these units’ influences on Pennsylvania with state.value = 'PA'). For now, this function is only available for ZIP codes.
unitRanks2005 <- disperseR::rankfacs_by_popwgt_location( data.linked = zip_exp_ann_unit, crosswalk. = crosswalk, rank.by = c('hyads'), state.value = 'PA', year = 2005) unitRanks2005
plotUnitsRanked()
works very similar to other plotting function in disperseR
. Choose the dataset of ranked facilities as well as year.
The function below results in two graphs. One is a map showing where the units are and showing their rank and the other one contains bar plots.
plotUnitsRanked <- disperseR::plot_units_ranked( data.units = unitsrun, data.ranked = unitRanks2005, year = 2005, graph.dir = graph_dir) plotUnitsRanked
Please note graphical output is authomatically saved to the graph_dir
by the plotting functions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.