library(pacheck) data(df_pa)
This article describes an example workflow for fitting and validating metamodels, and making predictions using these metamodels, using the pacheck
package. The types of metamodel which the package supports are: linear model, random forest model, and lasso model. All three metamodels are covered. Note: This vignette is still in development.\
We will use the example dataframe df_pa
which is included in the package.
*** Note: the functions for fitting the linear model and the random forest model both include a 'validation' argument. So it is also possible to validate the model using those functions. However, there is also a separate function for the validation process available. ***
Functions: fit_lm_metamodel
, fit_rf_metamodel
, fit_lasso_metamodel
.\
For all three models (linear model, random forest model, and lasso model), there is a separate function used to fit the model.\
The random seed number chosen for these examples is 24.
Arguments: df
, y_var
, x_vars
, seed_num
.\
We fit the linear model, and obtain the coefficients.
lm_fit <- fit_lm_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd", "c_pfs", "c_pd", "c_thx", "p_pfspd", "p_pfsd", "p_pdd"), y_var = "inc_qaly", seed_num = 24, show_intercept = TRUE) lm_fit$fit
Arguments: standardise
, x_poly_2
, x_poly_3
, x_exp
, x_log
, x_inter
.\
The function also allows for the transformation of input variables.\
In the following model we standardise the regression parameters by setting the standardise
argument to TRUE. Standardisation is performed as follows: $\frac{x_j - \overline{x_j}}{\sigma_j}$ where $x_j$ is the value of parameter $j$, $\overline{x_j}$ the mean value of parameter $j$, and ${\sigma_j}$ its standard deviation.
lm_fit <- fit_lm_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd", "c_pfs", "c_pd", "c_thx", "p_pfspd", "p_pfsd", "p_pdd"), y_var = "inc_qaly", seed_num = 24, standardise = TRUE ) lm_fit$fit
Here we transform several variables: rr
will be exponentiated by factor 2, c_pfs
& c_pd
by factor 3, we take the exponential of u_pfs
& u_pd
, and the logarithm of p_pfsd
.
lm_fit <- fit_lm_metamodel(df = df_pa, x_vars = c("p_pfspd", "p_pdd"), x_poly_2 = "rr", x_poly_3 = c("c_pfs","c_pd"), x_exp = c("u_pfs","u_pd"), x_log = "p_pfsd", y_var = "inc_qaly", seed_num = 24 ) lm_fit$fit
And lastly, we can also include an interaction term between p_pfspd
and p_pdd
.
lm_fit <- fit_lm_metamodel(df = df_pa, x_vars = c("p_pfspd", "p_pdd"), x_poly_2 = "rr", x_poly_3 = c("c_pfs","c_pd"), x_exp = c("u_pfs","u_pd"), x_log = "p_pfsd", x_inter = c("p_pfspd","p_pdd"), y_var = "inc_qaly", seed_num = 24 ) lm_fit$fit
The random forest metamodel are fitted using the randomForestSRC R package. Arguments: df
, y_var
, x_vars
, seed_num
.
Note that for the RF model several variables are omitted to reduce computation time
rf_fit <- fit_rf_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd"), #"c_pfs", #"c_pd", #"c_thx", #"p_pfspd", #"p_pfsd", #"p_pdd"), y_var = "inc_qaly", seed_num = 24)
Setting the var_importance
argument to TRUE returns a plot which shows the importance of the included x-variables. The larger the number is, the more important the variable is, meaning that it greatly reduces the out-of-bag (OOB) error rate compared to the other x-variables.
rf_fit <- fit_rf_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd"), #"c_pfs", #"c_pd", #"c_thx", #"p_pfspd", #"p_pfsd", #"p_pdd"), y_var = "inc_qaly", var_importance = TRUE, seed_num = 24)
By using the tune
argument , two parameters which can have a significant impact on the model fit are nodesize
(the minimal size of the terminal node) and mtry
(the number of variables to possibly split at each node) are optimised to improve model fit. The tuning process consists of a grid search. If tune = FALSE
, default values for nodesize
and mtry
are used, which are 5 and (nr of x-variables)/3, respectively.
rf_fit <- fit_rf_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd"), #"c_pfs", #"c_pd", #"c_thx", #"p_pfspd", #"p_pfsd", #"p_pdd"), y_var = "inc_qaly", var_importance = FALSE, tune = TRUE, seed_num = 24) rf_fit$tune_fit$optimal rf_fit$tune_plot
The optimal nodesize
and mtry
can be found in the rf_fit$tune_fit$optimal
element of the list, and the results are shown in the plot (in this case rf_fit$tune_plot
). On this plot, the black dots represent the combinations of nodesize
and mtry
that have been tried, the colour gradient is obtained through 2-dimensional interpolation. The black cross marks the optimal combination, i.e., the combination yielding the lowest OOB error.
We can obtain partial and marginal plots, by setting pm_plot
equal to 'partial', 'marginal', or 'both'. pm_vars
specifies for which variables the plot must be constructed.
rf_fit <- fit_rf_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd"), #"c_pfs", #"c_pd", #"c_thx", #"p_pfspd", #"p_pfsd", #"p_pdd"), y_var = "inc_qaly", var_importance = FALSE, tune = TRUE, pm_plot = "both", pm_vars = c("rr","u_pfs"), seed_num = 24)
We fit the lasso model and obtain the coefficients using all parameter values from all iterations.
lasso_fit <- fit_lasso_metamodel(df = df_pa, x_vars = c("rr", "u_pfs", "u_pd", "c_pfs", "c_pd", "c_thx", "p_pfspd", "p_pfsd", "p_pdd"), y_var = "inc_qaly", tune_plot = TRUE, seed_num = 24) lm_fit$fit
The plot shows the tuning results: the error rate for each lambda. The smallest lambda is chosen for the full model.
Generally, the lasso procedure yields sparse models, i.e., models that involve only a subset of the set of input variables. In this case however, no coefficients are set to 0, and we have the same results as we obtained from the linear model.
Validating the metamodels can be performed using the validation
arguments of the metamodel functions, or by using the validate_metamodel
function.
Through the validate_metamodel
function, three types of validation methods can be used: the train/test split, (K-fold) cross-validation, and using a new dataset. We will discuss each validation method.
Arguments: method
, partition
.
First we use the train/test split. For this we need to specify partition
, which sets the proportion of the data that will be used for the training data. We also set show_intercept
to TRUE
for the calibration plot.
lm_validation = validate_metamodel(lm_fit, method = "train_test_split", partition = 0.8, show_intercept = TRUE, seed_num = 24) lm_validation$stats_validation lm_validation$calibration_plot
The output shows the validation results: R-squared, mean absolute error, mean relative error, and the mean squared error of the model applied to the test data. By using lm_validation$calibration_plot
, we can display the calibration plot.
Arguments: method
, folds
.
To use k-fold cross-validation, one needs to specify method = "cross_validation"
. In the example, two folds are used by specifying fold = 2
.
rf_validation = validate_metamodel(rf_fit, method = "cross_validation", folds = 2, seed_num = 24) rf_validation$stats_validation
Arguments: method
, df_validate
.
It also might happen that a validation dataset is obtained once the model has already been fitted some time ago. The function enables us to use this dataset for the validation process.
Since we only have one dataset, we first construct two datasets, where one dataset is used to fit the model, and the other is used to validate the model. Note that doing it like this, it is very similar to the train/test split method.
indices = sample(8000) df_original = df_pa[indices,] df_new = df_pa[-indices,] lasso_fit2 <- fit_lasso_metamodel(df = df_original, x_vars = c("rr", "u_pfs", "u_pd", "c_pfs", "c_pd", "c_thx", "p_pfspd", "p_pfsd", "p_pdd"), y_var = "inc_qaly", tune_plot = FALSE, seed_num = 24) lasso_validation <- validate_metamodel(lasso_fit2, method = "new_test_set", df_validate = df_new, seed_num = 24) rf_validation$stats_validation
Function: predict_metamodel
. Arguments: inputs
, output_type
There is one function which can be used to make predictions using the fitted metamodels: predict_metamodel
.
As an example, we will fit a four-variable model for all three metamodel types and with the following inputs:
library(knitr) p_pfspd = c(0.1,0.2) p_pfsd = c(0.08,0.15) p_pdd = c(0.06,0.25) rr = c(0.1,0.23) newdata = data.frame(p_pfspd,p_pfsd,p_pdd,rr) kable(newdata, caption="Example inputs for metamodels")
Thus we will obtain two predictions.
There are two ways to enter the inputs for the metamodel: as a vector or as a dataframe. The same holds for the output of the function: a vector or a dataframe. We will cover all four scenarios.
First we fit the models and define the inputs. Note that we do not validate the model first (which is not according to best practices) and use all available data to fit the models.
#fit the models lm_fit2 = fit_lm_metamodel(df = df_pa, x_vars = c("p_pfspd", "p_pfsd", "p_pdd", "rr"), y_var = "inc_qaly", seed_num = 24) rf_fit2 = fit_rf_metamodel(df = df_pa, x_vars = c("p_pfspd", "p_pfsd", "p_pdd", "rr"), y_var = "inc_qaly", tune = TRUE, seed_num = 24 ) lasso_fit3 <- fit_lasso_metamodel(df = df_pa, x_vars = c("p_pfspd", "p_pfsd", "p_pdd", "rr"), y_var = "inc_qaly", tune_plot = FALSE, seed_num = 24) #define the inputs ins_vec = c(0.1,0.2,0.08,0.15,0.06,0.25,0.1,0.23) #vector ins_df = newdata #dataframe (defined above)
Note that the order of the input vector matters. If we have a four-variable model, and we want to make two predictions, the first two values are for the first variable, the next two values are for the second variable, etc. 'First' and 'second' refer to the placement of the variable in the model as defined in the R-code. So in our example, the first and second variable is p_pfspd
and p_pfsd
, respectively.
predictions = predict_metamodel(lm_fit2, inputs = ins_vec, output_type = "vector") print(predictions)
predictions = predict_metamodel(lasso_fit3, inputs = ins_df, output_type = "vector") print(predictions)
predictions = predict_metamodel(rf_fit2, inputs = ins_vec, output_type = "vector") print(predictions)
predictions = predict_metamodel(lm_fit2, inputs = ins_vec, output_type = "dataframe") print(predictions)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.