Note, this vignette uses fake data - it is for illustrative purposes only and should not be used to inform decision making.

The steps in this exploratory analysis workflow may need to be performed iteratively, both in order to identify the optimal model types, predictors and covariates to use and modify default values to ensure model convergence. Once you are satisfied that you have identified the optimal set of input values, you can implement the reporting workflow described in another vignette.

library(TTU)

Set-up workspace

To facilitate reproducibility, we begin by setting a seed.

seed_1L_int = 12345
set.seed(seed_1L_int)

We next create a directory in which we will write all output.

path_to_write_to_1L_chr <- "Output" 
dir.create(path_to_write_to_1L_chr)

Import and describe data

We import our data - in this example we are using fake data.

ds_tb <- readRDS(url("https://dataverse.harvard.edu/api/access/datafile/4959218")) 

The dataset we are using has r nrow(ds_tb) records on r length(ds_tb$uid %>% unique()) study participants. The first six records are reproduced below.

ds_tb %>%
    head() %>%
  ready4show::print_table(output_type_1L_chr = "HTML",
                          caption_1L_chr = "Input dataset",
                          mkdn_tbl_ref_1L_chr = "tab:inputds")

We next create a lookup table summarising key information about the variables from our dataset that we will explore as candidate predictors in our models. The predictors lookup table must be of the specific_predictors class.

predictors_lup <- specific::specific_predictors(specific::make_pt_specific_predictors(short_name_chr = c("K10_int","Psych_well_int"),
                                              long_name_chr = c("Kessler Psychological Distress - 10 Item Total Score",
                                                   "Overall Wellbeing Measure (Winefield et al. 2012)"),
                                              min_val_dbl = c(10,18),
                                              max_val_dbl = c(50,90),
                                              class_chr = "integer",
                                              increment_dbl = 1,
                                              class_fn_chr = "as.integer",
                                              mdl_scaling_dbl = 0.01,
                                              covariate_lgl = F))

We also require a data dictionary for our input dataset. The dictionary must be of the ready4use::ready4use_dictionary class. In the example below we are using an existing dictionary.

dictionary_tb <- readRDS(url("https://dataverse.harvard.edu/api/access/datafile/4959217"))

Score utility and summarise meta-data

We now convert study participant responses to the EQ-5D questionnaire to health utility scores. We do this using the eq5d package.

ds_tb <- ds_tb %>% 
        dplyr::rename_with(~stringr::str_replace(.x,"eq5dq_",""),dplyr::starts_with("eq5dq_")) %>%
  dplyr::mutate(`:=`(EQ5D_total_dbl, 
                     eq5d::eq5d(., country="UK", version = "5L", type = "CW"))) %>%
    dplyr::rename_with(~paste0("eq5dq_",.x),c("MO","SC","UA","PD","AD"))

We can label dataset variables using descriptions from the data dictionary.

ds_tb <- ds_tb %>%
  ready4use::add_labels_from_dictionary(dictionary_tb = dictionary_tb,
                                        remove_old_lbls_1L_lgl = T) 

The first six records of the scored and labelled dataset are reproduced below.

ds_tb %>%
    head() %>%
  ready4show::print_table(output_type_1L_chr = "HTML",
                          caption_1L_chr = "Scored and labelled dataset",
                          use_lbls_as_col_nms_1L_lgl = T,
                          mkdn_tbl_ref_1L_chr = "tab:scoredds")

We next create an object ds_smry_ls to store a description of some structural properties of the dataset such as:

ds_smry_ls <- make_ds_smry_ls(candidate_predrs_chr = predictors_lup$short_name_chr,
                              candidate_covar_nms_chr = c("d_sex_birth_s", "d_age",  "d_sexual_ori_s", "d_relation_s", "d_studying_working"),
                              depnt_var_nm_1L_chr = "EQ5D_total_dbl",
                              dictionary_tb = dictionary_tb,
                              id_var_nm_1L_chr = "uid",
                              round_var_nm_1L_chr = "Timepoint",
                              round_bl_val_1L_chr = "BL",
                              predictors_lup = predictors_lup)

Test candidate models, predictors and covariates

Our first analytic task is to test which models, predictors and covariates perform best. This tasks can be broken down into the following steps.

Preliminary tasks

We begin by reviewing the TTU packages lookup table of types of models we can test.

mdl_types_lup <- TTU::get_cndts_for_mxd_mdls()
mdl_types_lup %>%
    ready4show::print_table(output_type_1L_chr = "HTML",
                          caption_1L_chr = "Model types lookup table")

We modify this table as appropriate (in the below example, we leave it unchanged). We then add the lookup table to a new object, mdl_smry_ls which stores information about the candidate models we will be using.

mdl_smry_ls <- mdl_types_lup %>%
  make_mdl_smry_ls()
cmprsns_ls <- write_mdl_cmprsn(scored_data_tb = ds_tb,
                                ds_smry_ls = ds_smry_ls,
                                mdl_smry_ls = mdl_smry_ls,
                                output_data_dir_1L_chr = path_to_write_to_1L_chr,
                                seed_1L_int = seed_1L_int)

The write_mdl_cmprsn function will write each model to be tested to a new sub-directory of "r path_to_write_to_1L_chr" called "A_Candidate_Mdls_Cmprsn". Also written to that sub-directory are a number of plots for each model. The write_mdl_cmprsn function also outputs a table summarising the performance of each of the candidate models.

cmprsns_ls$mdl_smry_ls$smry_of_sngl_predr_mdls_tb %>%
  ready4show::print_table(output_type_1L_chr = "HTML",
                          caption_1L_chr = "Comparison of model types for highest correlated predictor using baseline data",
                          mkdn_tbl_ref_1L_chr = "tab:mdl_cmprsn")

We can now identify the highest performing model in each category of candidate model based on the testing R^2^ statistic (RsquaredP in the previous table).

cmprsns_ls$mdl_smry_ls$prefd_mdl_types_chr

We can override these automated selections and instead incorporate other considerations (possibly based on judgments informed by visual inspection of the plots and the desirability of constraining predictions to a maximum value of one). We do this in the following command, specifying new preferred model types, in descending order of preference.

cmprsns_ls$mdl_smry_ls$prefd_mdl_types_chr <- c("OLS_NTF", "BET_CLL")

Use most preferred model to compare all candidate predictors

We can now compare all of our candidate predictors (with and without candidate covariates) using the most preferred model type.

cmprsns_ls <- write_predr_and_covars_cmprsn(scored_data_tb = ds_tb,
                                            bl_tb = cmprsns_ls$bl_tb,
                                            ds_smry_ls = cmprsns_ls$ds_smry_ls,
                                            mdl_smry_ls = cmprsns_ls$mdl_smry_ls,
                                            output_data_dir_1L_chr = path_to_write_to_1L_chr,
                                            seed_1L_int = seed_1L_int)

Now, we compare the performance of single predictor models of our preferred model type (in our case, a r mdl_smry_ls$mdl_types_lup %>% ready4::get_from_lup_obj(target_var_nm_1L_chr = "long_name_chr",match_value_xx = cmprsns_ls$mdl_smry_ls$prefd_mdl_types_chr[1], match_var_nm_1L_chr = "short_name_chr", evaluate_1L_lgl = F)) for each candidate predictor. The call to the write_predr_and_covars_cmprsn{.R} function saved the tested models along with model plots in the "B_Candidate_Predrs_Cmprsn" sub-directory of "Output". These results are also viewable as a table.

cmprsns_ls$mdl_smry_ls$predr_cmprsn_tb %>%
    ready4show::print_table(output_type_1L_chr = "HTML",
                          caption_1L_chr = "Comparison of all candidate predictors using preferred model")

The call to the write_predr_and_covars_cmprsn{.R} function also saved single predictor R model objects (one for each candidate predictors) along with the two plots for each model in the "C_Predrs_Sngl_Mdl_Cmprsn" sub-directory of "Output". The performance of each single predictor model can also be summarised in a table.

cmprsns_ls$mdl_smry_ls$smry_of_mdl_sngl_predrs_tb %>%
    ready4show::print_table(output_type_1L_chr = "HTML",
                          caption_1L_chr = "Preferred single predictor model performance by candidate predictor")

Updated versions of each of the models in the previous step (this time with covariates added) are saved to the "D_Predr_Covars_Cmprsn" subdirectory of "Output" and we can summarise the performance of each of the updated models, along with all signficant model terms, in a table.

cmprsns_ls$mdl_smry_ls$mdls_with_covars_smry_tb %>%
    ready4show::print_table(output_type_1L_chr = "HTML",
                            caption_1L_chr = "Performance of preferred models with covariate predictors by candidate predictor", 
                          use_lbls_as_col_nms_1L_lgl = T)

We can now identify which, if any, of the candidate covariates we previously specified are significant predictors in any of the models.

cmprsns_ls$mdl_smry_ls$signt_covars_chr

We can override the covariates to select, potentially because we want to select only covariates that are significant for all or most of the models. However, in the below example we have opted not to do so and continue to use r ifelse(is.na(mdl_smry_ls$signt_covars_chr),"no covariates",paste0(mdl_smry_ls$signt_covars_chr, collapse = "and ")) as selected by the algorithm in the previous step.

# cmprsns_ls$mdl_smry_ls$prefd_covars_chr <- c("COVARIATE WE WANT TO USE", "ANOTHER COVARIATE")

Test preferred model with preferred covariates for each candidate predictor

We now conclude our model testing by rerunning the previous step, except confining our covariates to those we prefer.

outp_smry_ls <- write_mdls_with_covars_cmprsn(scored_data_tb = ds_tb,
                                           bl_tb = cmprsns_ls$bl_tb,
                                           ds_smry_ls = cmprsns_ls$ds_smry_ls,
                                           mdl_smry_ls = cmprsns_ls$mdl_smry_ls,
                                           output_data_dir_1L_chr = path_to_write_to_1L_chr,
                                           seed_1L_int = seed_1L_int)

The previous call to the write_mdls_with_covars_cmprsn{.R} function saves the tested models along with the two plots for each model in the "E_Predrs_W_Covars_Sngl_Mdl_Cmprsn" sub-directory of "Output".

Apply preferred model types and predictors to longitudinal data

The next main step is to use the preferred model types and covariates identified from the preceding analysis of cross-sectional data in longitudinal analysis.

Longitudinal mixed modelling

Next we create longitudinal and mixed versions of each of the models we specified in the previous step. Note, that in many cases both the prior_ls and control_ls arguments can be set to NULL (which may speed up execution). However, in this example doing so would result in warning messages suggesting a change to the adapt_delta control value (default = 0.8). We have therefore passed a value to the control_ls argument that addresses this issue.

outp_smry_ls <- write_ts_mdls_from_alg_outp(outp_smry_ls,
                                            predictors_lup = predictors_lup ,
                                            utl_min_val_1L_dbl = 0.03,
                                            backend_1L_chr = "cmdstanr",
                                            new_dir_nm_1L_chr = "F_TS_Mdls",
                                            iters_1L_int = 4000L,
                                            prior_ls = NULL, 
                                            control_ls = list(adapt_delta = 0.99))

The write_ts_mdls_from_alg_outp{.R} function writes the models it tests to the "F_TS_Mdls" sub-directory of "Output" along with three plots for each model.

Purge dataset copies

Because the files created in analysis are large, multiple objects containing copies of the source dataset have been saved to our output directory during the analysis process. We therefore need to delete all of these copies.

write_to_delete_mdl_fls(outp_smry_ls)
outp_smry_ls$scored_data_tb <- NULL

Save work

Finally, we can save a complete record of our analysis.

saveRDS(outp_smry_ls,paste0(outp_smry_ls$path_to_write_to_1L_chr,"/I_ALL_OUTPUT_.RDS"))


ready4-dev/TTU documentation built on July 2, 2024, 8:12 a.m.