The Purpose of the Configuration File

In contrast to typical R packages, which typically pass a large number of optional parameters to their functions, airpred uses a series of configuration files to specify specific paths and parameters, as well as to specify the overall workflow.

The configuration files used in conjunction with the airpred package serve two primary purposes. The first of which is to provide a single location where most of the parameters can be specified, rather than using optional parameters in the called functions. The second is to serve as documentation of past runs of airpred. Various past versions (parameters and data) The configuration file paradigm allows for a standard directory structure to be used, which allows for implementations of a model using airpred to be easily transported between computer systems as a zipped directory.

Generating the Configuration File

library(airpred)
gen_config()

The default file generated here is the following:

cat(readLines("config.yml"), sep = "\n")

There are a number of means available for editing the default values. The simplest of which is to edit the various fields by hand. Configuration files can also be customized using R's list objects. To use them, create a list and assign your desired values to the field you would like to use. For example in order to adjust the path to the uncleaned data stored as a csv, as well as the paths to the files generated during the cleaning process one would do the following:

config_list <- list()
config_list$csv_path <- "input_file/data.csv"
config_list$imputation_models <- "imputation_models"
config_list$mid_process_data <- "mid_process_data"

gen_config(in_list = config_list)

The config file generated by this code now looks like this:

cat(readLines("config.yml"), sep = "\n")

Note that the fields not specified in the list retain their default values. It's also important to note that the field names in the input list must match the field names used in the configuration file. The option to generate configuration files using lists like this has the benefit of letting multiple configuration files be procedurally generated with a loop, potentially allowing for multiple datasets to be processed with minimal human interaction after set up.

The fields in the Configuration File

The following are the items contained in the config file. All of them must be present in order for the model to run successfully.

monitor: The pollution type the data will be trained on

data_location: The directory holding the required data files

input_file_type: The extension of the files holding the data matrices

data_save_location: The directory processed data files should be saved in

use_default_vars: Should the default list of files and its file structure be used when reading the .mat files

add_custom_vars: Should a custom list of .mat files be looked for. If this is TRUE and use_default_vars is FALSE, then only the custom variables will be used.

custom_var_list: The location of the .yml file specifying the file structure of the custom variable files.

train: A boolean. If TRUE, the model run is a training run. If false, the run is going to be used to create predictions

finalday: The date of the last day covered by the data set

csv_path: The path where the assembled data is stored as a csv

rds_path: The path where the assembled data is stored as an rds file

imputation_models: The path where the imputation models should be saved.

mid_process_data: The path where data should be saved between imputation, normalization and transformation steps

training_models: A list of the models to be used in training and used for the ensemble model.

monitor_list: The location of the file containing the coordinates of the monitors

training_data: The file containing transformed and imputed code to be used for training. Currently must be an RDS file.

training_output: The directory to be used for storing the output of the training models

predict_data: The input data for a given round of prediction

predict_mid_process: The directory that holds all saved files generated in the prediction process.

predict_output: The directory that holds the generated predictions
clean_up_config()


NSAPH/airpred documentation built on May 7, 2020, 10:49 a.m.