```r library(PatientLevelPrediction) knitr::opts_chunk$set( cache=FALSE, comment = "#>", error = FALSE, tidy = FALSE)

# Introduction

This vignette describes how one can populate the SkeletonExistingModelStudy package with the target cohort, outcome cohorts and model settings.

The first step is to make a copy of the 'SkeletonExistingModelStudy' R package template.  You can do this by cloning the github package and then copy and paste the directory to a location in your computer.

Open the Skeleton R project in R studio, this can be done by finding and double clicking on the SkeletonExistingModelStudy.Rproj file in the folder that you just copy and pasted.  Once the package project is opened in R studio there are 5 steps that must be followed:

1. Source extras/populatePackage.Rto add functions needed to populate the package
2. Rename the package using: replaceName()
3. Add the target and outcome cohort using: populatePackageCohorts()
4. Specify the model using: populatePackageModels() - this can be run multiple times to add multiple models
5. Build the study package

The package is now ready to run or share on GitHub.

## Step 1: Source './extras/populatePackage.R'
The R script extras/populatePackage.R contains multiple functions that can be used to populate the skeleton package.

To add the functions to your environment, make sure the package R project is open in R studio and run:
  ```r
source('./extras/populatePackage.R')

This will make the functions 'replaceName()', 'populatePackageCohorts()' and 'populatePackageModels()' available to use within your R session.

Step 2: Renaming the package

The package is still currently called 'EmcWaltersDementiaModel' but you need to rename all the files with the name of your study. To make this easy I created a function 'replaceName()' that will automatically replace 'EmcWaltersDementiaModel' with your new study name. Simply run:

```r replaceName(packageLocation = getwd(), packageName = 'PooledCohortvalidation')

When you open the r project it will set your working directory to the package directory - so 'packageLocation = getwd()' should work.  If this has not occurred, you may need to manually specify the location where 'SkeletonExistingModelStudy.Rproj' exists in the skeleton copy you made. Set 'packageName' to be the name you want to call your study.  The example above will replace 'SkeletonExistingModelStudy' with 'PooledCohortvalidation' throughout the skeleton package.

It is a good practice to close the R project and reopen it once renamed.  The R project will now be renamed, in the example about it would be called 'PooledCohortvalidation.Rproj'.


## Step 3: Adding the Target and Outcome cohorts
The 'populatePackageCohorts()' function requires users to specify:

* targetCohortIds - The ATLAS id/s for the target cohort/s 
* targetCohortNames - A string or vector of string with sharable name/s for the target cohort/s
* outcomeIds - The ATLAS id/s for the outcome cohort/s 
* outcomeNames - A string or vector of string with sharable name/s for the outcome cohort/s
* baseUrl - The url for the ATLAS webapi (this will be used to extract the ATLAS cohorts)

For example, to insert the target cohort: 

* 'Pooled cohort equation target - non-black male' - that has the atlas ID 18941
* 'Pooled cohort equation target - non-black female' - that has the atlas ID 18942
* 'Pooled cohort equation target - black female' - that has the atlas ID 18943
* 'Pooled cohort equation target - black male' - that has the atlas ID 18944

and the outcome cohorts:

* 'Earliest of AMI ischemic stroke or cardiovascular death' - that has the atlas ID 18945
* 'Acute myocardial infarction events' - that has the atlas ID 18935
* 'Ischemic stroke events' - that has the atlas ID 18936

from the webapi at 'https://yourWebAPI' run:

```r
populatePackageCohorts(targetCohortIds = c(18941, 18942, 18943, 18944),
                       targetCohortNames = c('Pooled cohort equation target - non-black male',
                                             'Pooled cohort equation target - non-black female',
                                             'Pooled cohort equation target - black female',
                                             'Pooled cohort equation target - black male'),
                       outcomeIds = c(18945, 18935, 18936),
                       outcomeNames = c('Earliest of AMI ischemic stroke or cardiovascular death',
                                       'Acute myocardial infarction events',
                                       'Ischemic stroke events'),
                       baseUrl = 'https://yourWebAPI')

This then inserts each cohort into 'inst/cohorts' as a json with the file name targetCohortNames or outcomeNames '.json' and inserts in 'inst/sql/sql_server' as a sql with the file name targetCohortNames or outcomeNames '.sql'.

It also creates a csv file in 'inst/settings' named 'CohortToCreate.csv' that specifies all the target and outcome cohorts to generate when the study is executed.

Step 4: Specify the model

The next step is to insert the model settings into the package. This can be done by running the function 'populatePackageModels()' per model to incorporate.

There are 6 main inputs into this function:

data <- data.frame(Input = c('modelname',
                             'standardCovariates',
                             'cohortCovariateSettings',
                             'measurementCovariateSettings',
                             'measurementCohortCovariateSettings',
                             'finalMapping'

                             ),
                   Description = c('The name of the model you are specifying',
                                   'A list of settings for predictors that are standardCovariates',
                                   'A list of settings for predictors that are cohort covariates',
                                   'A list of settings for predictors that are measurment covariates',
                                   'A list of settings for predictors that are measurement cohort covariates',
                                   'The mapping from total score to predicted risk'
                                   ) )
library(knitr)
kable(data, caption = 'The inputs into the populatePackageModels function')

This 6 inputs specifiy the model name, the predictors and the final model mapping.

modelname

The 'modelname' input is use when saving the model settings. The main model settings are saved in inst/settings as modelname_model.csv. Each row in modelname_model.csv specifies a covariate and the corresponding number of points (i.e., coefficient value) and settings required to create the covariate using standardCovariates, createCohortCovariate, createMeasurementCovariate, createMeasurementCohortCovariate or createAgeCovariate.

standardCovariates

The 'standardCovariates' input is used when you want to include standard FeatureExtraction covariates. The input is a list of settings that specify what standard covariates to use (when you execute the study these will be used to configure FeatureExtraction::createCovariateSettings() ) and the corresponding points.

data <- data.frame(Input = c('covariateId',
                             'covariateName',
                             'points',
                             'featureExtraction'),
                   Description = c('A vector of the standard covariateIds to include',
                                   'A vector of the corresponding names',
                                   'A vector of the corresponding points',
                                   'A vector of the inputs needed in FeatureExtraction::createCovariateSettings'
                                   ),
                   Example = c('c(12003, 13003, 8507001)', 
                               'c("Age 60-64", "Age 65-70", "Male")',
                               'c(10,10,15)',
                               'c("useDemographicsAgeGroup","useDemographicsAgeGroup", "useDemographicsGender")'
                               ))
library(knitr)
kable(data, caption = 'The inputs into the standardCovariates function')

The 'covariateId' values must match the standard ids created by 'FeatureExtraction::createCovariateSettings'. These are often the conceptId*1000+analysisId, where the analysisId can be found in https://github.com/OHDSI/FeatureExtraction/blob/master/inst/csv/PrespecAnalyses.csv . When the 'populatePackageModels' function is executed with 'standardCovariates' set then the model settings 'inst/settings/modelname_model.csv' is created or appended to with the standard covariate settings. If your model does not require standard covariates, then set 'standardCovariates' to NULL.

cohortCovariateSettings

The 'cohortCovariateSettings' is used when you want to include covariates using cohort definitions. This can be useful for more complex features such as a diabetes definition that requires a diabetes condition record and a treatment record within 60 days. The 'cohortCovariateSettings' input is a list containing the following entries:

data <- data.frame(Input = c('baseUrl',
                             'atlasCovariateIds',
                             'atlasCovariateNames',
                             'analysisIds',
                             'startDays',
                             'endDays',
                             'points',
                             'count',
                             'ageInteraction',
                             'lnAgeInteraction'),
                   Description = c('A baseUrl for the ATLAS webapi that contains the cohorts',
                                   'A vector of the atlas cohort ids',
                                   'A vector of the atlas cohort names',
                                   'A vector of analysis ids for the cohort covariates',
                                   'Find cohort entries where the cohort_start_date occurs after the the startDays relative to index',
                                   'Find cohort entries where the cohort_start_date occurs before the the endsDays relative to index', 
                                   'A vector of the points',
                                   'A vector specifying whether to use binary indicators or count the number of cohort_start_dates per patients between index + startDays and index + endsDays',
                                   'A vector specifying whether to use the age interaction',
                                   'A vector specifying whether to use the natural logarithm log(age) interaction'
                                   ),
                   Example = c('https://yourWebAPI', 
                               'c(191324,184343)',
                               'c("Diabetes","Hypertension")',
                               'c(456,456)',
                               'c(-9999,-9999)',
                               'c(-1,-1)',
                               'c(5,10)',
                               'c(F,F)',
                               'c(F,F)',
                               'c(F,F)'
                               ))
library(knitr)
kable(data, caption = 'The inputs into the cohortCovariateSettings function')

The 'cohortCovariateSettings' saves settings required by 'createCohortCovariateSettings'. When the 'populatePackageModels' function is executed with 'cohortCovariateSettings' set then the specified ATLAS cohorts are downloaded into the package using the webapi and the model settings 'inst/settings/modelname_model.csv' is created or appended to with the cohort covariate settings. If you do not want to include cohort covariates set 'cohortCovariateSettings' to NULL.

measurementCovariateSettings

The 'measurementCovariateSettings' is used when you want to include measurement covariates. This can be useful for more complex features such as total cholesterol. The 'measurementCovariateSettings' input is a list containing the following entries:

data <- data.frame(Input = c('names',
                             'conceptSets',
                             'startDays',
                             'endDays',
                             'scaleMaps',
                             'points',
                             'aggregateMethods',
                             'imputationValues',
                             'ageInteractions',
                             'lnAgeInteractions',
                             'lnValues',
                             'measurementIds',
                             'analysisIds'),
                   Description = c('A vector of the measurement covariate names',
                                   'A list containing vectors of the measurement concept sets',
                                   'Find measurement entries where the measurement_date occurs after the the startDays relative to index',
                                   'Find measurement entries where the measurement_date occurs before the the endsDays relative to index', 
                                   'A list of functions specifying how to standardise the measurements',
                                   'A vector of the points',
                                   'A vector specifying how to pick a single measurement value per patient.  Choose from recent (closest to index), min, max or mean',
                                   'A vector of values used to impute when a measurement is missing for a patient',
                                   'A vector specifying whether to use the age interaction',
                                   'A vector specifying whether to use the natural logarithm log(age) interaction',
                                   'A vector specifying whether to use the log measurement value',
                                   'A vector specifying the measurement ids',
                                   'A vector specifying the analysis ids'
                                   ),
                   Example = c('c("HDL-C_mgdL", "HDL-C_mgdL")', 
                               'list(c(2212449, 3003767, 3007070, 3007676, 3011884, 3013473, 3015204, 3016945, 3022449, 3023602), c(2212449, 3003767, 3007070, 3007676, 3011884, 3013473, 3015204, 3016945, 3022449, 3023602))',
                               'c(-365, -365)',
                               'c(60,60)',
                               'list(function(x){return(x)}, function(x){return(x)} )',
                               'c(5,10)',
                               'c("recent","recent")',
                               'c(50,50)',
                               'c(F,F)',
                               'c(F,F)',
                               'c(F,F)',
                               'c(1,2)',
                               'c(458,458)'
                               ))
library(knitr)
kable(data, caption = 'The inputs into the measurementCovariateSettings function')

The 'measurementCovariateSettings' saves settings required by 'createMeasurementCovariateSettings'. When the 'populatePackageModels' function is executed with 'measurementCovariateSettings' set then the model settings 'inst/settings/modelname_model.csv' is created or appended to with the neasurement covariate settings. The mapping functions for each measurement covariate are saved as rds files into the 'inst/settings' directory. If you do not want to include measurement covariates set 'measurementCovariateSettings' to NULL.

measurementCohortCovariateSettings

The 'measurementCohortCovariateSettings' is used when you want to include measurement covariates where patients are also in or not in some specified cohort. This can be useful for more complex features such as treated systolic blood pressure or untreated systolic blood pressure. The 'measurementCohortCovariateSettings' input is a list containing the following entries:

data <- data.frame(Input = c('names',
                             'atlasCovariateIds',
                             'types',
                             'conceptSets',
                             'startDays',
                             'endDays',
                             'scaleMaps',
                             'points',
                             'aggregateMethods',
                             'ageInteractions',
                             'lnAgeInteractions',
                             'lnValues',
                             'measurementIds',
                             'analysisIds',
                             'baseUrl'),
                   Description = c('A vector of the measurement covariate names',
                                   'A vector of the cohort ids',
                                   'A vector of "in" or "out" indicating whether to require being in, or not in, the cohort.',
                                   'A list containing vectors of the measurement concept sets',
                                   'Find measurement entries where the measurement_date occurs after the the startDays relative to index',
                                   'Find measurement entries where the measurement_date occurs before the the endsDays relative to index', 
                                   'A list of functions specifying how to standardise the measurements',
                                   'A vector of the points',
                                   'A vector specifying how to pick a single measurement value per patient.  Choose from recent (closest to index), min, max or mean',
                                   'A vector specifying whether to use the age interaction',
                                   'A vector specifying whether to use the natural logarithm log(age) interaction',
                                   'A vector specifying whether to use the log measurement value',
                                   'A vector specifying the measurement ids',
                                   'A vector specifying the analysis ids',
                                   'ATLAS webapi address'
                                   ),
                   Example = c('c("HDL-C_mgdL", "HDL-C_mgdL")', 
                               'c(18946,18946)',
                               'c("in","out")',
                               'list(c(2212449, 3003767, 3007070, 3007676, 3011884, 3013473, 3015204, 3016945, 3022449, 3023602), c(2212449, 3003767, 3007070, 3007676, 3011884, 3013473, 3015204, 3016945, 3022449, 3023602))',
                               'c(-365, -365)',
                               'c(60,60)',
                               'list(function(x){return(x)}, function(x){return(x)} )',
                               'c(5,10)',
                               'c("recent","recent")',
                               'c(F,F)',
                               'c(F,F)',
                               'c(F,F)',
                               'c(1,2)',
                               'c(458,458)',
                               'https://yourWebAPI'
                               ))
library(knitr)
kable(data, caption = 'The inputs into the measurementCohortCovariateSettings function')

The 'measurementCohortCovariateSettings' saves settings required by 'createMeasurementCohortCovariateSettings'. When the 'populatePackageModels' function is executed with 'measurementCohortCovariateSettings' set then the model settings 'inst/settings/modelname_model.csv' is created or appended to with the neasurement covariate settings and all specified ATLAS cohort are downloaded using the webapi into 'inst/cohorts' and 'inst/sql/sql_server'. The mapping functions for each measurement covariate are saved as rds files into the 'inst/settings' directory. If you do not want to include measurement based on being in/not in a cohort covariates set 'measurementCohortCovariateSettings' to NULL.

finalMapping

The 'finalMapping' input is a function that maps from the sum of the points to a predicted risk. For example, if the model is a logistic regression, the following mapping would be used:

function(x){return(1/(1+exp(-x)))}

This gets saved as an rds file in the 'inst/setting' directory.

Example

A complete example code for inserting the pooled cohort question non-black model into the skeleton package:

populatePackageModels(modelname = 'pooled_male_non_black',
                      standardCovariates = NULL,
                      cohortCovariateSettings = list(baseUrl = 'https://yourWebAPI',
                                                     atlasCovariateIds = c(17720,18957,18957),
                                                     atlasCovariateNames = c('diabetes', 'smoking','smoking'),
                                                     analysisIds = c(456,456,455),
                                                     startDays = c(-9999,-730,-730),
                                                     endDays = c(-1,60,60),
                                                     points = c(0.658,7.837,-1.795),
                                                     count = rep(F, 3),
                                                     ageInteraction = c(F,F,F), 
                                                     lnAgeInteraction = c(F,F,T)) ,

                      measurementCovariateSettings = list(names = c('Total_Cholesterol_mgdL', 'Total_Cholesterol_mgdL',
                                                                    'HDL-C_mgdL', 'HDL-C_mgdL'
                                                                    ),
                                                          conceptSets = list(c(2212267,3015232,3019900,3027114,4008265,4190897,4198448,4260765,37393449,37397989,40484105,44791053,44809580),
                                                                             c(2212267,3015232,3019900,3027114,4008265,4190897,4198448,4260765,37393449,37397989,40484105,44791053,44809580),
                                                                             c(2212449,3003767,3007070,3007676,3011884,3013473,3015204,3016945,3022449,3023602,3023752,3024401,3030792,3032771,3033638,3034482,3040815,3053286,4005504,4008127,4011133,4019543,4041557,4041720,4042059,4042081,4055665,4076704,4101713,4195503,4198116,37208659,37208661,37392562,37392938,37394092,37394229,37394230,37398699,40757503,40765014,44789188,45768617,45768651,45768652,45768653,45768654,45771001,45772902),
                                                                             c(2212449,3003767,3007070,3007676,3011884,3013473,3015204,3016945,3022449,3023602,3023752,3024401,3030792,3032771,3033638,3034482,3040815,3053286,4005504,4008127,4011133,4019543,4041557,4041720,4042059,4042081,4055665,4076704,4101713,4195503,4198116,37208659,37208661,37392562,37392938,37394092,37394229,37394230,37398699,40757503,40765014,44789188,45768617,45768651,45768652,45768653,45768654,45771001,45772902)
                                                          ),
                                                          startDays = c(-365, -365,-365, -365),
                                                          endDays = c(60,60,60,60),
                                                          scaleMaps= list(function(x){ x = dplyr::mutate(x, rawValue = dplyr::case_when(unitConceptId == 8753 ~ rawValue*38.6, unitConceptId %in% c(8840,8954,9028 ) ~ rawValue, TRUE ~ 0)); x= dplyr::filter(x, rawValue >= 80 & rawValue <= 500 ); x = dplyr::mutate(x,valueAsNumber = log(rawValue)); return(x)},
                                                                          function(x){ x = dplyr::mutate(x, rawValue = dplyr::case_when(unitConceptId == 8753 ~ rawValue*38.6, unitConceptId %in% c(8840,8954,9028 ) ~ rawValue, TRUE ~ 0)); x= dplyr::filter(x, rawValue >= 80 & rawValue <= 500 ); x = dplyr::mutate(x,valueAsNumber = log(rawValue)*log(age)); return(x)},
                                                                          function(x){ x = dplyr::mutate(x, rawValue = dplyr::case_when(unitConceptId == 8753 ~ rawValue*38.6, unitConceptId %in% c(8840,8954,9028 ) ~ rawValue, TRUE ~ 0)); x= dplyr::filter(x, rawValue >= 10 & rawValue <= 130 ); x = dplyr::mutate(x,valueAsNumber = log(rawValue)); return(x)},
                                                                          function(x){ x = dplyr::mutate(x, rawValue = dplyr::case_when(unitConceptId == 8753 ~ rawValue*38.6, unitConceptId %in% c(8840,8954,9028 ) ~ rawValue, TRUE ~ 0)); x= dplyr::filter(x, rawValue >= 10 & rawValue <= 130 ); x = dplyr::mutate(x,valueAsNumber = log(rawValue)*log(age)); return(x)}), 
                                                          points = c(11.853,-2.664,-7.990,1.769),
                                                          aggregateMethods = c('recent','recent','recent', 'recent'),
                                                          imputationValues = c(150,150,50,50),
                                                          ageInteractions = c(F,F,F,F),
                                                          lnAgeInteractions = c(F,T,F,T),
                                                          lnValues = c(T,T,T,T),
                                                          measurementIds = c(1,2,3,4), 
                                                          analysisIds = c(457,457,457,457)


                      ),

                      ageCovariateSettings = list(names = c('log(age)'),
                                                  ageMaps = list(function(x){return(log(x))}),
                                                  ageIds = 1,
                                                  analysisIds = c(458),
                                                  points = c(12.344)

                      ),

                      measurementCohortCovariateSettings = list(names = c('treated_Systolic_BP_mm_Hg','untreated_Systolic_BP_mm_Hg'),
                                                                atlasCovariateIds = c('18946','18946'),
                                                                types = c('in', 'out'),
                                                                conceptSets = list(c(3004249,3009395,3018586,3028737,3035856,4152194,4153323,4161413,4197167,4217013,4232915,4248525,4292062,21492239,37396683,44789315,44806887,45769778),
                                                                                   c(3004249,3009395,3018586,3028737,3035856,4152194,4153323,4161413,4197167,4217013,4232915,4248525,4292062,21492239,37396683,44789315,44806887,45769778)

                                                                ),
                                                                startDays = c(-365,-365),
                                                                endDays = c(60,60),
                                                                scaleMaps= list(function(x){ x = dplyr::filter(x, rawValue >= 50 & rawValue <= 250 ); return(x)},
                                                                                function(x){ x = dplyr::filter(x, rawValue >= 50 & rawValue <= 250 ); return(x)}
                                                                ),
                                                                points = c(1.797,1.764),
                                                                aggregateMethods = c('recent','recent'),
                                                                imputationValues = c(120,120),
                                                                ageInteractions = c(F,F),
                                                                lnAgeInteractions = c(F,F),
                                                                lnValues = c(T,T),
                                                                measurementIds = c(1,2), 
                                                                analysisIds = c(459,459),
                                                                baseUrl = 'https://yourWebAPI'


                      ),

                      finalMapping = function(x){ 1- 0.9144^exp(x-61.18)}
)

Step 5: Build the study package

Aftering adding the settings into the package, you now need to build the package. Use the standard process (in R studio press the 'Build' tab in the top right corner and then select the 'Install and Restart' button) to build the study package so an R library is created.



lhjohn/EmcWaltersDementiaModel documentation built on Feb. 25, 2021, 2:54 p.m.