knitr::opts_chunk$set(echo = TRUE) # globals num_species <- 41 num_sources <- 20 sources <- c("AG", "AIRC", "BIOG", "COAL", "DUST", "FIRE", "FOIL", "MEAT", "METAL", "NG", "NRDIE", "NRGAS", "NROTH", "ORDIE", "ORGAS", "OT", "OTHCMB", "SOLVENT", "SSALT", "WOOD") # load data for examples library(readr) library(magrittr)
This document seeks to illustrate how CMAQ DDM values are pre-processed, then used in hybrid source apportionment, and then post-processed for subsequent analysis and interpretation of results.
The order in which this overview covers processing functionality, is:
convert_mat.R)data_prep.R, reorder_rds_files.R)In particular, this document elaborates on what is performed in the first two 
parts of new_rj_script.R, the script used to produce Rj values for a single year. 
For an explanation of how the Rj values are calculated, and the post-processing 
of those values, see a subsequent vignette. 
What is helpful to keep in mind, is that the following structure for 
processing/manipulating the data were developed to facilitate computations with 
limited access to RAM. That is, with the following manipulations, all of the 
processing for a single year's data can be performed on an $8$Gb RAM laptop. This is not trivial, since 
loaded into R uses up RAM, then there's is a limitation on how much data can 
be used at any one time. Also, we want to minimize having to read in different files 
during computations, and to use as much of the data loaded into working memory as possible. 
To start, we begin with matlab files.
.mat filesThe CMAQ-DDM data is originally stored in .mat files. This is because, although 
the raw DDM output is in netCDF format, some post-processing of those raw 
results is conducted with matlab.
Since there are r num_species chemical species being used, then that is how 
many matlab files there initially are. Each matlab file consists of a list of 
sensitivity values, averaged by day. Each list contains r num_sources + 1 elements, 
and each element of the list is a three-dimensional array. The dimensions of the 
array correspond to: x- and y- coordinates, and time. The coordinates 
identify the cell's location in the grid, and each element of the time dimension 
corresponds to a different day.
Ultimately, each .mat file is converted into a list of arrays, stored as an .rds 
file. Using this native (to R) file format significantly reduces time needed to 
read in the data from a single file, from about five minutes to around ten seconds. 
In turn, it simplifies subsequent computations with the data.
No consistent naming scheme has (as of this writing) yet been adopted among the 
team members generating the DDM results. So, to at least have consistency in the 
R portion of the analysis, the names of the list elements (i.e., sources) are 
always a subset of the following: 
sources
and if they are originally otherwise in the .mat files, then those names 
are converted to the above sources which they correspond to.
Along with renaming the sources, the list elements of the .mat files should 
be (i.e., this is as of yet not implemented) in the same order. This limits 
the number of changes that need to be made later on in the hybridSA processing, out 
of accomodating different years' data. For example, the order of uncertainties 
$\sigma_{\ln(Rj)}$ for the Rj values wouldn't need to be changed in the hybrid 
optimization.
The way the sensitivity values are arranged in the .mat files, is not conducive 
to generating hybridSA results with R most efficiently. Also, given memory 
and time constraints, ideally all the data needed to generate Rj values for 
any single day, could be contained in one file. 
Thus, the files are converted so they are grouped temporally instead of by sensitivities of sources to species. (???) That is, the data is re-arranged so that in the end, thirty-seven "aggregate interval" files are generated, one file for each ten-day period in the year (except for the last). So, for example, the first aggregate interval file contains all the CMAQ data for days $1-10$ in the year, the second for days $11-20$ in the year, and so on.
Each aggregate interval file contains a list, where each element corresponds to a single day.
aa <- read_rds("/Volumes/My Passport for Mac/gpfs:pace1:/data_2005/Agg_intv_files/Intv_1_agg.rds") names(aa)
In turn, the list element for a day is itself a list of forty-one elements (one for each species).
aa[[1]] %>% str
And each array in that day-list has dimensions corresponding to: x-coordinate, y-coordinate, and source. (The third dimension here is one greater than the number of sources, because the first contains simulated concentrations for that grid cell).
Up to this stage, the original CMAQ-DDM results have only been re-arranged. Now, we want to take those values, and organize them into a format that can readily be used for processing for hybrid source apportionment.
The hybrid optimization requires two kinds of data as inputs: concentration data, and CMAQ sensitivities. In the following, we describe how to obtain each of these two inputs.
As for the concentration data, it consists of two types: observed values and simulated values. The observed values are obtained directly from the EPA website, while the simulated values are contained in the CMAQ DDM. So, to combine all the concentration data together, it's necessary to extract the simulated values from the aggregate interval files.
Since the simulated concentrations are contained in the first index of the "Sources" dimension
of the array, then for Day1 (which was shown above), we can obtain just the simulated 
concentrations by the following:
library(plyr) day_sim_values <- aa[[1]] %>% llply(function(x) x[, , 1]) str(day_sim_values)
Since we only took one "slice" from the third dimension, the resulting arrays 
are each two-dimensional. The interpretation for day_sim_values would then be: 
For the first day of the year, the simulated concentrations for Ag over the entire 
spatial domain, is given by the first array, and for aluminum, by the second array, etc.
There is already a script which downloads and combines all of the observed 
concentrations over the entire spatial and temporal domain. So, all that is needed after 
extracting these simulated values is to combine them with the previously-generated 
observed values. And that is exactly what the aggregate_concentrations function does.
While the each day's simulated concentrations are contained in the first slice of the arrays, the sensitivity values of the species to the sources, are contained in the rest. That is,
day_sens_values <- aa[[1]] %>% llply(function(x) x[, , -1]) str(day_sens_values)
The interpretation for this is as follows: for the first day of the year, in list 
element Ag, the value at element (x, y, z) is the sensitivity of Silver to 
the $z$th source at grid cell (x, y). Note that for $2005$, r num_sources are 
used for the hybrid optimization. From $2007$ onwards, though, only sixteen sources 
are used. 
From here, the sensitivities for each day of the year are extracted, and combined 
into a single file. This is performed by the gen_sens_mats function, so named 
because it generates the sensitivity matrices for each grid cell location. 
Up to this stage, we have (hopefully) shown how the original matlab files are converted 
to R objects, which are then re-arranged and manipulated to a format amenable 
to being analyzed with the hybridSA package. For details on how the Rj values 
are calculated, please see the next vignette in the series. 
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.