R/RunPlp.R
In PatientLevelPrediction: Develop Clinical Prediction Models Using the Common Data Model

Documented in runPlp

# @file RunPlp.R
#
# Copyright 2025 Observational Health Data Sciences and Informatics
#
# This file is part of PatientLevelPrediction
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#' runPlp - Develop and internally evaluate a model using specified settings
#'
#' @description
#' This provides a general framework for training patient level prediction models.  The user can select 
#' various default feature selection methods or incorporate their own,  The user can also select from
#' a range of default classifiers or incorporate their own.  There are three types of evaluations for the model
#' patient (randomly splits people into train/validation sets) or year (randomly splits data into train/validation sets
#' based on index year - older in training, newer in validation) or both (same as year spliting but checks there are
#' no overlaps in patients within training set and validaiton set - any overlaps are removed from validation set)
#' 
#' @details
#' This function takes as input the plpData extracted from an OMOP CDM database and follows the specified settings to
#' develop and internally validate a model for the specified outcomeId.
#'
#' @param plpData                    An object of type \code{plpData} - the patient level prediction
#'                                   data extracted from the CDM.  Can also include an initial population as 
#'                                   plpData$popualtion.
#' @param outcomeId                  (integer) The ID of the outcome.                                       
#' @param analysisId                 (integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp.
#' @param analysisName               (character) Name for the analysis
#' @param populationSettings         An object of type \code{populationSettings} created using \code{createStudyPopulationSettings} that
#'                                   specifies how the data class labels are defined and addition any exclusions to apply to the 
#'                                   plpData cohort
#' @param splitSettings              An object of type \code{splitSettings} that specifies how to split the data into train/validation/test.  
#'                                   The default settings can be created using \code{createDefaultSplitSetting}.                               
#' @param sampleSettings             An object of type \code{sampleSettings} that specifies any under/over sampling to be done.
#'                                   The default is none.
#' @param featureEngineeringSettings An object of \code{featureEngineeringSettings} specifying any feature engineering to be learned (using the train data)                                                        
#' @param preprocessSettings         An object of \code{preprocessSettings}. This setting specifies the minimum fraction of 
#'                                   target population who must have a covariate for it to be included in the model training                            
#'                                   and whether to normalise the covariates before training  
#' @param modelSettings              An object of class \code{modelSettings} created using one of the function:
#'                                         \itemize{
#'                                         \item setLassoLogisticRegression() A lasso logistic regression model
#'                                         \item setGradientBoostingMachine() A gradient boosting machine
#'                                         \item setAdaBoost() An ada boost model
#'                                         \item setRandomForest() A random forest model
#'                                         \item setDecisionTree() A decision tree model
#'                                         \item setKNN() A KNN model
#'                                         } 
#' @param logSettings                An object of \code{logSettings} created using \code{createLogSettings} 
#'                                   specifying how the logging is done                                                                            
#' @param executeSettings            An object of \code{executeSettings} specifying which parts of the analysis to run
#' 
#'                                                                                 
#' @param saveDirectory         The path to the directory where the results will be saved (if NULL uses working directory)
#' @return
#' An plpResults object containing the following:
#'
#'  \itemize{
#'           \item model The developed model of class \code{plpModel}
#'           \item executionSummary A list containing the hardward details, R package details and execution time
#'           \item performanceEvaluation Various internal performance metrics in sparse format
#'           \item prediction The plpData cohort table with the predicted risks added as a column (named value)
#'           \item covariateSummary A characterization of the features for patients with and without the outcome during the time at risk
#'           \item analysisRef A list with details about the analysis
#'           } 
#' @examples
#' \donttest{ \dontshow{ # takes too long }
#' # simulate some data
#' data('simulationProfile')
#' plpData <- simulatePlpData(simulationProfile, n = 1000)
#' # develop a model with the default settings
#' saveLoc <- file.path(tempdir(), "runPlp")
#' results <- runPlp(plpData = plpData, outcomeId = 3, analysisId = 1,
#'                   saveDirectory = saveLoc) 
#' # to check the results you can view the log file at saveLoc/1/plpLog.txt
#' # or view with shiny app using viewPlp(results)
#' # clean up
#' unlink(saveLoc, recursive = TRUE)
#' }
#' @export
runPlp <- function(
  plpData,
  outcomeId = plpData$metaData$databaseDetails$outcomeIds[1],
  analysisId = paste(Sys.Date(), outcomeId, sep = "-"),
  analysisName = "Study details",
  populationSettings = createStudyPopulationSettings(),
  splitSettings = createDefaultSplitSetting(
    type = "stratified", 
    testFraction = 0.25, 
    trainFraction = 0.75, 
    splitSeed = 123, 
    nfold = 3), 
  sampleSettings = createSampleSettings(type = "none"),
  featureEngineeringSettings = createFeatureEngineeringSettings(type = "none"),
  preprocessSettings = createPreprocessSettings(
    minFraction = 0.001,
    normalize = TRUE
    ),
  modelSettings = setLassoLogisticRegression(),
  logSettings = createLogSettings(
    verbosity = "DEBUG",
    timeStamp = TRUE,
    logName = "runPlp Log"
    ),
  executeSettings = createDefaultExecuteSettings(),
  saveDirectory =  NULL
) {
  start <- Sys.time()
  
  if (is.null(saveDirectory)) {
    stop("Please provide a saveDirectory")
  }

  # start log 
  analysisPath <- file.path(saveDirectory, analysisId)
  logSettings$saveDirectory <- analysisPath
  logSettings$logFileName <- "plpLog"
  logger <- do.call(createLog, logSettings)
  ParallelLogger::registerLogger(logger)
  on.exit(closeLog(logger))
  
  #check inputs + print 
  settingsValid <- tryCatch(
    {
      checkInputs(
        inputs = list(
          plpData = plpData, 
          outcomeId = outcomeId,
          populationSettings = populationSettings, 
          splitSettings = splitSettings,
          sampleSettings = sampleSettings,
          featureEngineeringSettings = featureEngineeringSettings, 
          preprocessSettings = preprocessSettings, 
          modelSettings = modelSettings,
          executeSettings = executeSettings
        )
      )
    },
    error = function(e) {
      ParallelLogger::logError(e)
      return(NULL)
    })
  
  if (is.null(settingsValid)) {
    stop("Settings are invalid - check log for error message")
  }
  
  # log the start time:
  executionDateTime <- Sys.time()
  
  # print the header in the log
  tryCatch({
    printHeader(
      plpData, 
      plpData$metaData$databaseDetails$targetId, 
      outcomeId, 
      analysisId, 
      analysisName,
      executionDateTime
    )
  })
  
  # create the population
  if (!is.null(plpData$population)) {
    ParallelLogger::logInfo("Using existing population")
    population <- plpData$population
  } else {
    ParallelLogger::logInfo("Creating population")
    population <- tryCatch({
      do.call(createStudyPopulation,
              list(plpData = plpData,
                   outcomeId = outcomeId,
                   populationSettings = populationSettings,
                   population = plpData$population
                   )
      )},
    error = function(e) {
        ParallelLogger::logError(e)
        return(NULL)
      }
    )
  }
    
  if (is.null(population)) {
    stop("population NULL")
  }
  
  if (executeSettings$runSplitData) {
    # split the data (test/train/cv) + summarise at the end
    data <- tryCatch(
      {
        splitData(
          plpData = plpData,
          population = population,
          splitSettings = splitSettings
        )
      },
      error = function(e) {
        ParallelLogger::logError(e)
        return(NULL)
      }
    )
    if (is.null(data)) {
      stop("data NULL after splitting")
    }
    
    dataSummary(data)
  } 
  
  if (executeSettings$runSampleData) {
    # sampling
    data$Train <- tryCatch(
      {
        sampleData(
          trainData = data$Train, 
          sampleSettings = sampleSettings
        )
      },
      error = function(e) {
        ParallelLogger::logError(e)
        return(NULL)
      }
    )
    if (is.null(data$Train)) {
      stop("train data NULL after sample")
    }
    dataSummary(data)
  }
  
  if (executeSettings$runFeatureEngineering) {
    
    data$Train <- tryCatch(
      {
        featureEngineer(
          data = data$Train, 
          featureEngineeringSettings = featureEngineeringSettings
        )
      },
      error = function(e) {
        ParallelLogger::logError(e)
        return(NULL)
      }
    )
    if (is.null(data$Train)) {
      stop("train data NULL after feature engineering")
    }
    dataSummary(data)
  }
  
  if (executeSettings$runPreprocessData) {
    
    data$Train$covariateData <- tryCatch(
      {
        preprocessData(
          covariateData = data$Train$covariateData, 
          preprocessSettings = preprocessSettings
        )
      },
      error = function(e) {
        ParallelLogger::logError(e)
        return(NULL)
      }
    )
    if (is.null(data$Train$covariateData)) {
      stop("train data NULL after preprocessing")
    }
    dataSummary(data)
  }
  
  
  model <- NULL
  prediction <- NULL
  performance <- NULL
  if (executeSettings$runModelDevelopment) {
    # fit model
    settings <- list(
      trainData = data$Train, 
      modelSettings = modelSettings,
      analysisId = analysisId,
      analysisPath = analysisPath
    )
    
    ParallelLogger::logInfo(sprintf("Training %s model",
      settings$modelSettings$name))  

    model <- tryCatch(
      {
        do.call(fitPlp, settings)
      },
      error = function(e) {
        ParallelLogger::logError(e)
        return(NULL)
      }
    )
    
    if (!is.null(model)) {
      prediction <- model$prediction
      # remove prediction from model
      model$prediction <- NULL
      
      #apply to test data if exists:
      if ("Test" %in% names(data)) {
        predictionTest <- tryCatch(
          {
            predictPlp(
              plpModel = model, 
              plpData = data$Test,
              population = data$Test$labels
            )
          },
          error = function(e) {
            ParallelLogger::logError(e)
            return(NULL)
          }
        )
        
        predictionTest$evaluationType <- "Test"
        
        if (!is.null(predictionTest)) {
          prediction <- rbind(predictionTest, prediction[, colnames(prediction) != "index"])
        } 
        
        
      }
      
      # evaluate model
      performance <- tryCatch(
        {
          evaluatePlp(prediction, typeColumn = "evaluationType")
        },
        error = function(e) {
          ParallelLogger::logError(e)
          return(NULL)
        }
      )
    }
    
  }
  
  
  # covariateSummary
  covariateSummaryResult <- NULL
  if (executeSettings$runCovariateSummary) {
    
    if (!is.null(data$Test)) {
      strata <- data.frame(
        rowId = c(
          data$Train$labels$rowId, 
          data$Test$labels$rowId 
        ),
        strataName = c(
          rep("Train", nrow(data$Train$labels)), 
          rep("Test", nrow(data$Test$labels))
        )
      )
    } else {
      strata <- data.frame(
        rowId = c(data$Train$labels$rowId),
        strataName = c(rep("Train", nrow(data$Train$labels)))
      )
    }
    
    variableImportance <- plpData$covariateData$covariateRef %>% 
      dplyr::mutate(covariateValue = 0) %>% 
      dplyr::select("covariateId", "covariateValue") %>% 
      dplyr::collect()
    if (!is.null(model)) {
      if (!is.null(model$covariateImportance)) {
        variableImportance <- model$covariateImportance %>% 
          dplyr::select("covariateId", "covariateValue")
      }
    }
    
    # apply FE if it is used
    featureEngineering <- NULL
    if (!is.null(model)) {
      featureEngineering <- model$preprocessing$featureEngineering
    }
    
    covariateSummaryResult <- do.call(covariateSummary,   
      list(
        covariateData = plpData$covariateData,
        cohort = population %>% dplyr::select("rowId"),
        labels = population %>% dplyr::select("rowId", "outcomeCount"), 
        strata = strata,
        variableImportance = variableImportance,
        featureEngineering = featureEngineering
        )
    )
  
  }
  
  #  ExecutionSummary details:
  # log the end time:
  endTime <- Sys.time()
  TotalExecutionElapsedTime <- difftime(endTime, executionDateTime, units = "mins")
  
  executionSummary <- list(
    PackageVersion = list(
      rVersion = R.Version()$version.string,
      packageVersion = utils::packageVersion("PatientLevelPrediction")
    ),
    PlatformDetails = list(
      platform = R.Version()$platform,
      cores = Sys.getenv("NUMBER_OF_PROCESSORS"),
      RAM = memuse::Sys.meminfo()[1]
      ),
    TotalExecutionElapsedTime = TotalExecutionElapsedTime,
    ExecutionDateTime = executionDateTime,
    Log = logSettings$logFileName # location for now
    #Not available at the moment: CDM_SOURCE -  meta-data containing CDM version, release date, vocabulary version
  )
  
  # if model is NULL convert it to list for saving 
  if (is.null(model)) {
    model <- list(noModel = TRUE)
    attr(model, "predictionFunction") <- "noModel"
    attr(model, "saveType") <- "RtoJson"
    class(model) <- "plpModel"
  }
  
  results <- list(
    executionSummary = executionSummary, 
    model = model,
    prediction = prediction,
    performanceEvaluation = performance,
    covariateSummary = covariateSummaryResult,
    analysisRef = list(
      analysisId = analysisId,
      analysisName = analysisName
      )
    )
  class(results) <- c("runPlp")
  
  ParallelLogger::logInfo("Run finished successfully.")
  # save the results
  ParallelLogger::logInfo(paste0("Saving PlpResult"))
  tryCatch(savePlpResult(results, file.path(analysisPath, "plpResult")),
    finally = ParallelLogger::logTrace("Done."))
  ParallelLogger::logInfo(paste0("plpResult saved to ..\\", analysisPath, "\\plpResult"))
  delta <- Sys.time() - start
  ParallelLogger::logInfo(paste0("runPlp time taken: ", signif(delta, 3), " ", attr(delta, "units")))
  return(results)
  
}
Any scripts or data that you put into this service are public.
PatientLevelPrediction documentation built on April 3, 2025, 9:58 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
PatientLevelPrediction
Develop Clinical Prediction Models Using the Common Data Model

R/RunPlp.R
In PatientLevelPrediction: Develop Clinical Prediction Models Using the Common Data Model

Defines functions runPlp

Documented in runPlp

Try the PatientLevelPrediction package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

PatientLevelPrediction Develop Clinical Prediction Models Using the Common Data Model

R/RunPlp.R In PatientLevelPrediction: Develop Clinical Prediction Models Using the Common Data Model

Defines functions runPlp

Documented in runPlp

Try the PatientLevelPrediction package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

PatientLevelPrediction
Develop Clinical Prediction Models Using the Common Data Model

R/RunPlp.R
In PatientLevelPrediction: Develop Clinical Prediction Models Using the Common Data Model