knitr::opts_chunk$set(echo = TRUE) library(dplyr)
Baseline Regularization is a package for estimating the effects of drugs on condition risk, designed to work with data in OMOP CDM format. This document contains a few examples illustrating how to run Baseline Regularization. For this vignette we will assume you have loaded the package.
library(BaselineRegularization)
default_params <- defineBRParameters()
Typically, one would connect to a database containing the CDM using the DBI package. For the examples here, we will use a small subset of CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (SynPUF) included with the package.
data("synpuf_mini")
The simplest way to use Baseline Regularization is to apply it directly to the data to predict the effects of all drugs in the data on the risk of one condition of interest. We separate the process into three major steps:
To generate the feature matrix and response vector, we have to have a condition of interest the risk of which we wish to model, and data from which to extract the drug features and patient timelines. For our example, our condition of interest will be Aplastic anemia.
response_event = 137829 # The concept id for "Aplastic anemia"
When available, we recommend using derived CDM tables for analysis, but in our SynPUF data, we only have clinical tables available, and we will use them for this example.
br_data <- prepareBRData( observation_period = synpuf_mini$observation_period, condition_occurrence = synpuf_mini$condition_occurrence, drug_exposure = synpuf_mini$drug_exposure, response_event = response_event )
See [Using derived CDM tables for preparing the data] for an example with derived tables. See [Using a database source for preparing the data] for an example using a database source.
Once the data is built, running baseline regularization is simple:
fit1 <- fitBaselineRegularization( br_data )
This runs baseline regularization with the default parameters, which are
knitr::kable( data.frame("Default Value" = unlist(default_params), check.names = F) )
(see the documentation for defineBRParameters()
for further details).
You may set different parameters using [defineBRParameters()] like so:
parameters <- defineBRParameters( lambda1 = 1 ) fit2 <- fitBaselineRegularization( br_data, parameters )
Note that only the parameters you'd like to change from the default need to be specified, the rest keep their default values.
Without guidance from the user, Baseline Regularization treats each drug that has a distinct concept_id and occurs in the data as a feature for the model. However, we are often interested in examining broader categories of drugs. As an example, consider the drug categories used in the OMOP task [TODO: cite]:
data("omop_concept_names") knitr::kable( ( doi_concept_names <- omop_concept_names %>% select( concept_id, concept_name ) %>% filter( concept_id %in% 600000001:600000010 ) %>% arrange( concept_id ) ) %>% mutate( concept_id = as.character(concept_id) ), align = "rl" )
These drug categories are broad.
In the OMOP CDM vocabulary, the concept_ancestor
table can be used to infer which concept_id's are encompassed by such broader concepts.
Here, we load a small subset of this table included with our package:
data( "omop_doi_concept_ancestor" )
To illustrate, here are a few drug concept ids in our data fall into the "ACE Inhibitor" category:
data( "cdm_some_concept_names" ) drug_concept_names <- synpuf_mini$drug_exposure %>% distinct( drug_concept_id ) %>% left_join( cdm_some_concept_names, by = c("drug_concept_id"="concept_id") ) doi_drug_names <- drug_concept_names %>% inner_join( omop_doi_concept_ancestor, by = c("drug_concept_id"="descendant_concept_id") ) doi_drug_names %>% filter( ancestor_concept_id == 600000001 ) %>% select( concept_id = drug_concept_id, concept_name ) %>% head() %>% mutate( concept_id = as.character(concept_id) ) %>% knitr::kable()
To make use of the concept_ancestor
table, we introduce the function ancestorConceptProcessor
, which we can use as follows:
drug_concept_processor <- ancestorConceptProcessor( concept_ancestor = omop_doi_concept_ancestor, concept_list = 600000001:600000010, handle_remaining = "drop" )
See ?ancestorConceptProcessor
for more details, briefly, the parameters are:
concept_ancestor
the table that defines the concept ancestor relationshipsconcept_list
the array of ancestor concepts to use as features (in this case the concept_id numbers are conveniently consecutive numbers from 600000001 to 600000010)handle_remaining
a string that defines how to handle concepts that aren't descendants of the concepts in concept_list
: "drop"
(the default) simply removes them from the analysis. Then, we add it to the call of the data preparation function.
br_data_omop_drugs <- prepareBRData( observation_period = synpuf_mini$observation_period, condition_occurrence = synpuf_mini$condition_occurrence, drug_exposure = synpuf_mini$drug_exposure, response_event = response_event, drug_concept_processor = drug_concept_processor )
Nothing changes when fitting the model
fit_omop_drugs <- fitBaselineRegularization( br_data_omop_drugs, defineBRParameters( lambda1 = 0.01 ) )
And we can inspect the results with getCoefficients
as before:
getCoefficients( fit_omop_drugs, names_table = doi_concept_names )
We can see that now each beta coefficient corresponds to a high-level drug concept that we have specified.
It is often of interest to use the same data to model the risk of multiple conditions. We provide a convenient way to run Baseline Regularization for multiple condition targets:
response_events = c( 137829, # The concept id for "Aplastic anemia" 315296, # The concept id for "Preinfarction syndrome" 432585) # The concept id for "Blood coagulation disorder" br_data_multiple_conditions <- prepareBRData( observation_period = synpuf_mini$observation_period, condition_occurrence = synpuf_mini$condition_occurrence, drug_exposure = synpuf_mini$drug_exposure, response_event = response_events, # <- We simply pass the array of concept ids to the response_event parameter drug_concept_processor = drug_concept_processor ) # We can still specify drug categories # The call to the fit function doesn't change fit_multiple <- fitBaselineRegularization( br_data_multiple_conditions, defineBRParameters( lambda1 = 0.01 ) )
This yields a list of fits, one for each condition being modeled:
getCoefficients( fit_multiple[[1]], names_table = doi_concept_names ) getCoefficients( fit_multiple[[2]], names_table = doi_concept_names ) getCoefficients( fit_multiple[[3]], names_table = doi_concept_names )
Just as we specified drug categories, we can specify condition categories. Again, we will use the OMOP task as an example
data("omop_hoi_conditions_map")
Next, construct a processor for condition concept IDs:
condition_concept_processor <- ancestorConceptProcessor( concept_ancestor = omop_hoi_conditions_map, concept_list = 501L:510L, ancestor_column = "hoi_id", # Note that ompo_hoi_conditions map uses custom column names descendant_column = "condition_concept_id" )
Simply add the processor to the data preparation call:
omop_events = 501L:510L # These are the IDs we gave the OMOP event categories br_data_omop_conditions <- prepareBRData( observation_period = synpuf_mini$observation_period, condition_occurrence = synpuf_mini$condition_occurrence, drug_exposure = synpuf_mini$drug_exposure, response_event = omop_events, drug_concept_processor = drug_concept_processor, # We can still specify drug categories condition_concept_processor = condition_concept_processor ) # The call to the fit function doesn't change fit_omop <- fitBaselineRegularization( br_data_omop_conditions, defineBRParameters( lambda1 = 0.01 ) )
Typically, the patient data resides in a database, and we designed prepareBRData
to work with databases.
There are two approaches to working with BaselineRegularization
using a database, explicitly, by providing a connection object, or implicitly, using dplyr
(and dbplyr
).
We use the DBI
API (available on CRAN) for database access.
To create a connection object, your code will look something like this:
con <- DBI::dbConnect( RPostgreSQL::PostgreSQL() # Database driver , host = "localhost" # Host name/IP , user = "user" # Username , dbname = "omop_example" # Database name , password = rstudioapi::askForPassword("Database Password") ) # This opens a dialogue if you're using RStudio. # Never include passwords in source code.
The particular arguments to DBI::dbConnect
will depend on the database driver selected.
See this guide for a more comprehensive overview of using DBI
for database connection.
You can also use the DatabaseConnector
package available on CRAN and at https://github.com/OHDSI/DatabaseConnector to create the connection object:
library(DatabaseConnector) connectionDetails <- createConnectionDetails(dbms="postgresql", server="localhost", user="user", password=rstudioapi::askForPassword("Database Password"), schema="cdm_v4") con <- connect(connectionDetails)
(see the documentation for DatabaseConnector
for more details).
Once you have a connection object, you can simply pass it to prepareBRData
# Extract relevant data br_data <- prepareBRData( con = con, response_event = response_event )
This will try to use the default lower-case names for clinical data, specifically:
drug_era
table, if no such table is found, will try to derive drug intervals from the drug_exposure
table, if neither is found, fails with an error.condition_era
table, if no such table is found, will try to use the condition_occurrence
table, if neither is found, fails with an error."observation_period"
, if no such table is found, will try to use infer the observation periods (one per patient) from as many of the following tables as are found in the database:drug_era
condition_era
drug_exposure
condition_occurrence
visit_occurrence
Some additional common customization options:
NULL
.br_data <- prepareBRData( con = con, observation_period = "OBSERVATION_PERIOD", # Different capitalization from the default drug_era = "my_drug_era_table", # Different table name to use condition_era = NULL, # Don't use condition_era from the DB response_event = response_event )
As in the examples above using data frames, one can use condition and drug category processors and pass multiple response events at once.
ancestorConceptProcessor
can also accept a con
parameter along with which you can specify the ancestry table by name.
dplyr
and dbplyr
Instead of passing a connection object explicitly, one can use dplyr
and dbplyr
to get R data table-like representations of database tables.
This can be useful if you want to perform some of your own data manipulation before passing the tables to prepareBRData
.
E.g, given a connection object (see above), one can use
observation_period_tbl <- tbl( con, `observation_period` )
and further manipulate observation_period_tbl
with dplyr
verbs.
See the documentation for dplyr
and dbplyr
for more details.
This is in fact what prepareBRData
does internally when you specify a connection object and provide table names.
One can provide the drug_era
and condition_era
tables instead of the drug_exposure
and condition_occurrence
tables.
br_data <- prepareBRData( observation_period = observation_period, condition_era = condition_era, drug_era = drug_era, response_event = response_event )
As mentioned above, prepareBRData
prioritizes the former over the latter when both are provided.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.