In NSAPH/airpred: A Framework For Building Air Pollution Estimation Models

knitr::opts_chunk$set(echo = TRUE)

Workflow Summary

The training step of airpred implements the ensemble training model used by Di et al. to predict values of PM$_{2.5}$ and other pollutants for use in analyzing health data. This is a two step process requiting

Preparing the Config file

The key fields in the configuration file required for training the model are the following:

training_models: A list of the models to be used in training and used for the ensemble model.

monitor_list: The location of the file containing the coordinates of the monitors

training_data: The file containing transformed and imputed code to be used for training. Currently must be an RDS file.

training_output: The directory to be used for storing the output of the training models

We'll create a temporary directory to hold the output for this vignette:

mkdir temp_training_output

There are no restrictions (besides existing) on the folder used to store training output. A temporay one is used here for ease of vignette processing, but that is not standard.

We'll want to ensure that the configuration file is properly set, with all necessary paths specified and models selected.

library(airpred)
train_list <- list() # Create list to allow configuration file to be edited by this routine

## Point to clean data stored as RDS file distributed with package
train_list$training_data <- file.path(path.package("airpred"), "airpred_clean.RDS")

## List of US PM2.5 monitor locations included with package
train_list$monitor_list <- file.path(path.package("airpred"), "pm25_na_eqd_conic.csv")

## Tell package to store generated files in the folder we just created.
train_list$training_output <- "temp_training_output"

## Specify models to be used, in this case we'll run a random forest and a gradient boost model
train_list$training_models <- c("forest", "gradboost")

gen_config(in_list = train_list)

Our config file now looks like this:

display_config()

Edit Model Parameters

Additional configuration files are also used to adjust the parameters for the models used in the ensemble model. These parameters can be adjusted through the same methods used to adjust the main configuration file. See the vignette on using the configuration files for more information. If you wish to use the default parameters and don't want to worry about generating your own files, the training step will generate default files if no model configuration files are detected.

To generate configuration files call {r} edit_params(). There is no way currently to adjust the parameters from the console when the files are generated using this function; however they can still be edited by hand.

edit_params()
display_config("forest_params.yml")
display_config("gradboost_params.yml")

Let's assume we want to adjust the number of trees in the random forest from 5 to 20. To do this we can either edit the file or execute the following:

param_list <- list()
param_list$ntrees <- 20
gen_model_config("forest", in_list = param_list)
display_config("forest_params.yml")

Kicking Off the Process

Once the set up is complete, the process can be kicked off by simply calling train().

train()

Despite the single line of code, a fairly complex process is kicked off here. First, a h2o cluster is initialized on the computer running the package. Then the data specified in the configuration file is loaded into memory and then passed to the h2o cluster. Next, the models specified in the configuration file are run. The outputs of each of those models is fed into a GAM ensemble model. The outputs of this model are spatially weighted to generate vectors of nearby values for each location and then the process is repeated.

Files Produced

During each step of the training process a series of files are produced containing each model. These can then be loaded and used during the prediction stage. The h2o models are stored with complex names automatically generated by the cluster. Given this, we create a separate directory with a regular name to house each model. The files created are as follows:

list.files("temp_training_output", recursive = T)

Ensemble_data_1 and 2 are the inputs to the ensemble models from the first and second stage of the over all training process. initial_ensemble is the ensemble model from the first stage of the training process. The initial_forest contains the h2o random forest model from the first stage of model training. The initial_gradboost is equivalent for the gradient boosting model. nearby_data is the original dataset with the spatially weighted values added in. nearby_ensemble is the second stage ensemble model, and the nearby_forest and nearby_gradboost directories contain the second stage h2o models.

All of these file names are hardcoded in the package, allowing for the prediction stage to be carried out with the same configuration file as the one used in the training stage.