In NSAPH/airpred: A Framework For Building Air Pollution Estimation Models

knitr::opts_chunk$set(echo = TRUE)
library(airpred)

Data Preparation

There is a three part process to get the assembled data sets ready for use in the training process. The first step is the transformation process which is intended to give all variables in the dataset a less disperse distrobution. Second is the normalization step which converts the values to exists on a scale of 0 to 1. Finally, an imputation step is carried out on a subset of the variables with a large number of missing values.

Notation

The following notation will be used throughout this vignette. $x_i$ represents a given value of the particular variable being transformed. $\bar{x}$ represents the mean of a given variable. $x_{min}$ represents the minimum value of a given variable. $x_{max}$ represents the maximum value of a given variable. $x_{pN}$ represents the Nth percentile of a given variable (for example $x_{p80}$ represents the 80th percentile). $x_{min-}$ represents the value produced by the expression $x_{min} - (x_{p1} - x_{min})$. $x_{max+}$ represents the value generated by the expression $x_{max} + (x_{max} - x_{p99})$.

Transformation

The transformation step is a formula developed by Qian Di to reduce the impact of outliers on the values generated during the normalization process. Each varible is transformed according the following formula:

A scaling factor $k$ is defined in the following way:

$$ k = \frac{x_{p80} - x_{p20}}{2\text{ arctanh }{0.80}} $$

Then, the following calculation is made:

$$ x_{out} = \bar{x} + k\text{ arctanh}\bigg(2\frac{x_i - 0.5(x_{min-} + x_{max+})}{x_{max+} - x_{min-}}\bigg) $$

$x_{max+}$ and $x_{min-}$ are used instead of $x_{max}$ and $x_{min}$ because the true minimums and maximums result in undefined values, preventing further analysis. Additionally, as the change of the distribution rather than the actual output is the objective of this transformation, we don't feel that the actual inputs differing strongly from the true maximums impacts the output.

Normalization

The output of the transformation process is used as input fot the normalization process. All of the notation here refers to the values generated from transformation. The normalization step scales all variables to lie on a scale from 0 to 1 where the minimum value is 0 and the maximum is 1. The following formula is used for this rescaling:

$$ x_{out} = \frac{x_i - x_{min}}{x_{max} - x_{min}} $$

Imputation

The imputation step is the final step of data preparation. The list of variables imputed by this method can be generated by calling the following function:

list_imputed_variables()

Additional Notation

$m^{k}_i$ represents a binary valued variable that is 1 if the ith value of the kth variable is missing, and is 0 otherwise.

The Algorithm

The imputation is a two step process, performed for each variable. First, the probablity of a given variable being missing is predicted using a glm logit model. The right hand variables in this model are constant for all variables and are chosen due to their lack of missingness for all variables.

The following are the variables currently used as inputs to logit model:

print_logit_inputs()

Ultimately, a predicted probability of missingness ($\hat{m}_i$) is generated by this model. From this, a vector of weights $\bf{w}$ is generated using the following formula:

$$ w_i = \frac{1}{1 - \hat{m}_i} $$

Second, a linear mixed model is fit for the variable in question, using the weights generated by the logit. The outputs of this model are used to fill in the missing values present in the variable being imputed. The models are saved to use when the same process is applied during the prediction process.

The following are the inputs currently used by the linear mixed effects model: