knitr::opts_chunk$set(echo = TRUE)
library(dplyr)

Baseline Regularization is a package for estimating the effects of drugs on condition risk, designed to work with data in OMOP CDM format. This document contains a few examples illustrating how to run Baseline Regularization. For this vignette we will assume you have loaded the package.

library(BaselineRegularization)
default_params <- defineBRParameters()

Data

Typically, one would connect to a database containing the CDM using the DBI package. For the examples here, we will use a small subset of CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (SynPUF) included with the package.

data("synpuf_mini")

First example: one condition, all drugs

The simplest way to use Baseline Regularization is to apply it directly to the data to predict the effects of all drugs in the data on the risk of one condition of interest. We separate the process into three major steps:

  1. generating feature matrix and response vector from the data,
  2. fitting baseline regularization, and
  3. inspecting the results.

Preparing the data: generating the feature matrix and response vector

To generate the feature matrix and response vector, we have to have a condition of interest the risk of which we wish to model, and data from which to extract the drug features and patient timelines. For our example, our condition of interest will be Aplastic anemia.

response_event = 137829 # The concept id for "Aplastic anemia"

When available, we recommend using derived CDM tables for analysis, but in our SynPUF data, we only have clinical tables available, and we will use them for this example.

br_data <- prepareBRData(
  observation_period = synpuf_mini$observation_period,
  condition_occurrence = synpuf_mini$condition_occurrence,
  drug_exposure = synpuf_mini$drug_exposure,
  response_event = response_event )

See [Using derived CDM tables for preparing the data] for an example with derived tables. See [Using a database source for preparing the data] for an example using a database source.

Fitting baseline regularization

Once the data is built, running baseline regularization is simple:

fit1 <- fitBaselineRegularization( br_data )

This runs baseline regularization with the default parameters, which are

knitr::kable( data.frame("Default Value" = unlist(default_params), check.names = F) )

(see the documentation for defineBRParameters() for further details).

You may set different parameters using [defineBRParameters()] like so:

parameters <- defineBRParameters( lambda1 = 1 )

fit2 <- fitBaselineRegularization( br_data, parameters )

Note that only the parameters you'd like to change from the default need to be specified, the rest keep their default values.

Inspecting the results

Second example: managing drugs and drug categories

Without guidance from the user, Baseline Regularization treats each drug that has a distinct concept_id and occurs in the data as a feature for the model. However, we are often interested in examining broader categories of drugs. As an example, consider the drug categories used in the OMOP task [TODO: cite]:

data("omop_concept_names")
knitr::kable(
  ( doi_concept_names <- omop_concept_names %>%
    select( concept_id, concept_name ) %>%
    filter( concept_id %in% 600000001:600000010 ) %>%
    arrange( concept_id ) ) %>%
    mutate( concept_id = as.character(concept_id) ), align = "rl" )

These drug categories are broad. In the OMOP CDM vocabulary, the concept_ancestor table can be used to infer which concept_id's are encompassed by such broader concepts. Here, we load a small subset of this table included with our package:

data( "omop_doi_concept_ancestor" )

To illustrate, here are a few drug concept ids in our data fall into the "ACE Inhibitor" category:

data( "cdm_some_concept_names" )
drug_concept_names <-
  synpuf_mini$drug_exposure %>%
  distinct( drug_concept_id ) %>%
  left_join( cdm_some_concept_names, by = c("drug_concept_id"="concept_id") )
doi_drug_names <- drug_concept_names %>%
  inner_join( omop_doi_concept_ancestor, by = c("drug_concept_id"="descendant_concept_id") )
doi_drug_names %>%
  filter( ancestor_concept_id == 600000001 ) %>%
  select( concept_id = drug_concept_id, concept_name ) %>%
  head() %>%
  mutate( concept_id = as.character(concept_id) ) %>%
  knitr::kable()

To make use of the concept_ancestor table, we introduce the function ancestorConceptProcessor, which we can use as follows:

drug_concept_processor <- ancestorConceptProcessor(
  concept_ancestor = omop_doi_concept_ancestor,
  concept_list = 600000001:600000010,
  handle_remaining = "drop" ) 

See ?ancestorConceptProcessor for more details, briefly, the parameters are:

Then, we add it to the call of the data preparation function.

br_data_omop_drugs <- prepareBRData(
  observation_period = synpuf_mini$observation_period,
  condition_occurrence = synpuf_mini$condition_occurrence,
  drug_exposure = synpuf_mini$drug_exposure,
  response_event = response_event,
  drug_concept_processor = drug_concept_processor )

Nothing changes when fitting the model

fit_omop_drugs <- fitBaselineRegularization( br_data_omop_drugs, defineBRParameters( lambda1 = 0.01 ) )

And we can inspect the results with getCoefficients as before:

getCoefficients( fit_omop_drugs, names_table = doi_concept_names )

We can see that now each beta coefficient corresponds to a high-level drug concept that we have specified.

Third example: estimating the risk of multiple conditions

It is often of interest to use the same data to model the risk of multiple conditions. We provide a convenient way to run Baseline Regularization for multiple condition targets:

response_events = c(
  137829, # The concept id for "Aplastic anemia"
  315296, # The concept id for "Preinfarction syndrome"
  432585) # The concept id for "Blood coagulation disorder"

br_data_multiple_conditions <- prepareBRData(
  observation_period = synpuf_mini$observation_period,
  condition_occurrence = synpuf_mini$condition_occurrence,
  drug_exposure = synpuf_mini$drug_exposure,
  response_event = response_events, # <- We simply pass the array of concept ids to the response_event parameter
  drug_concept_processor = drug_concept_processor ) # We can still specify drug categories

# The call to the fit function doesn't change
fit_multiple <- fitBaselineRegularization( br_data_multiple_conditions, defineBRParameters( lambda1 = 0.01 ) )

This yields a list of fits, one for each condition being modeled:

getCoefficients( fit_multiple[[1]], names_table = doi_concept_names )
getCoefficients( fit_multiple[[2]], names_table = doi_concept_names )
getCoefficients( fit_multiple[[3]], names_table = doi_concept_names )

Fourth example: using condition categories

Just as we specified drug categories, we can specify condition categories. Again, we will use the OMOP task as an example

data("omop_hoi_conditions_map")

Next, construct a processor for condition concept IDs:

condition_concept_processor <- ancestorConceptProcessor(
  concept_ancestor = omop_hoi_conditions_map,
  concept_list = 501L:510L,
  ancestor_column = "hoi_id", # Note that ompo_hoi_conditions map uses custom column names
  descendant_column = "condition_concept_id" )

Simply add the processor to the data preparation call:

omop_events = 501L:510L # These are the IDs we gave the OMOP event categories

br_data_omop_conditions <- prepareBRData(
  observation_period = synpuf_mini$observation_period,
  condition_occurrence = synpuf_mini$condition_occurrence,
  drug_exposure = synpuf_mini$drug_exposure,
  response_event = omop_events,
  drug_concept_processor = drug_concept_processor, # We can still specify drug categories
  condition_concept_processor = condition_concept_processor )

# The call to the fit function doesn't change
fit_omop <- fitBaselineRegularization( br_data_omop_conditions, defineBRParameters( lambda1 = 0.01 ) )

Using a database source for preparing the data

Typically, the patient data resides in a database, and we designed prepareBRData to work with databases. There are two approaches to working with BaselineRegularization using a database, explicitly, by providing a connection object, or implicitly, using dplyr (and dbplyr).

Providing a connection object

We use the DBI API (available on CRAN) for database access. To create a connection object, your code will look something like this:

con <- DBI::dbConnect( RPostgreSQL::PostgreSQL() # Database driver
                     , host = "localhost"        # Host name/IP
                     , user = "user"             # Username
                     , dbname = "omop_example"   # Database name
                     , password = rstudioapi::askForPassword("Database Password") )
                     # This opens a dialogue if you're using RStudio.
                     # Never include passwords in source code.

The particular arguments to DBI::dbConnect will depend on the database driver selected. See this guide for a more comprehensive overview of using DBI for database connection.

You can also use the DatabaseConnector package available on CRAN and at https://github.com/OHDSI/DatabaseConnector to create the connection object:

library(DatabaseConnector)

connectionDetails <- createConnectionDetails(dbms="postgresql", 
                                             server="localhost",
                                             user="user",
                                             password=rstudioapi::askForPassword("Database Password"),
                                             schema="cdm_v4")
con <- connect(connectionDetails)

(see the documentation for DatabaseConnector for more details).

Once you have a connection object, you can simply pass it to prepareBRData

# Extract relevant data
br_data <- prepareBRData( con = con, response_event = response_event )

This will try to use the default lower-case names for clinical data, specifically:

Some additional common customization options:

br_data <- prepareBRData( con = con,
                          observation_period = "OBSERVATION_PERIOD", # Different capitalization from the default
                          drug_era = "my_drug_era_table", # Different table name to use
                          condition_era = NULL, # Don't use condition_era from the DB
                          response_event = response_event )

As in the examples above using data frames, one can use condition and drug category processors and pass multiple response events at once. ancestorConceptProcessor can also accept a con parameter along with which you can specify the ancestry table by name.

Using dplyr and dbplyr

Instead of passing a connection object explicitly, one can use dplyr and dbplyr to get R data table-like representations of database tables. This can be useful if you want to perform some of your own data manipulation before passing the tables to prepareBRData.

E.g, given a connection object (see above), one can use

observation_period_tbl <- tbl( con, `observation_period` )

and further manipulate observation_period_tbl with dplyr verbs. See the documentation for dplyr and dbplyr for more details.

This is in fact what prepareBRData does internally when you specify a connection object and provide table names.

Using derived CDM tables for preparing the data

One can provide the drug_era and condition_era tables instead of the drug_exposure and condition_occurrence tables.

br_data <- prepareBRData(
  observation_period = observation_period,
  condition_era = condition_era,
  drug_era = drug_era,
  response_event = response_event )

As mentioned above, prepareBRData prioritizes the former over the latter when both are provided.



sverchkov/BaselineRegularization documentation built on May 9, 2019, 1:26 p.m.