library(SelfControlledCaseSeries)
outputFolder <- "d:/temp/vignetteSccs"
folderExists <- dir.exists(outputFolder)

Introduction

This vignette describes how you can use the SelfControlledCaseSeries package to perform a single Self-Controlled Case Series (SCCS) study. We will walk through all the steps needed to perform an exemplar study, and we have selected the well-studied topic of the effect of aspirin on epistaxis.

Terminology

The following terms are used consistently throughout the package:

Installation instructions

Before installing the SelfControlledCaseSeries package make sure you have Java available. For Windows users, RTools is also necessary. See these instructions for properly configuring your R environment.

The SelfControlledCaseSeries package is maintained in a Github repository, and can be downloaded and installed from within R using the remotes package:

install.packages("remotes")
library(remotes)
install_github("ohdsi/SelfControlledCaseSeries") 

Once installed, you can type library(SelfControlledCaseSeries) to load the package.

Overview

In the SelfControlledCaseSeries package a study requires at least three steps:

  1. Loading the necessary data from the database.

  2. Transforming the data into a format suitable for an SCCS study. This step can be broken down in multiple tasks:

  3. Defining a study population, people having a specific outcome, with possible further restrictions such as age.

  4. Creating covariates based on the variables extracted from the database, such as defining risk windows based on exposures.
  5. Transforming the data into non-overlapping time intervals, with information on the various covariates and outcomes per interval.

  6. Fitting the model using conditional Poisson regression.

In the following sections these steps will be demonstrated for increasingly complex studies.

Studies with a single drug

Configuring the connection to the server

We need to tell R how to connect to the server where the data are. SelfControlledCaseSeries uses the DatabaseConnector package, which provides the createConnectionDetails function. Type ?createConnectionDetails for the specific settings required for the various database management systems (DBMS). For example, one might connect to a PostgreSQL database using this code:

connectionDetails <- createConnectionDetails(dbms = "postgresql", 
                                             server = "localhost/ohdsi", 
                                             user = "joe", 
                                             password = "supersecret")

cdmDatabaseSchema <- "my_cdm_data"
cohortDatabaseSchema <- "my_results"
cohortTable <- "my_cohorts"
options(sqlRenderTempEmulationSchema = NULL)

The last lines define the cdmDatabaseSchema, cohortDatabaseSchema, and cohortTable variables. We'll use these later to tell R where the data in CDM format live, where we have stored our cohorts of interest, and what version CDM is used. Note that for Microsoft SQL Server, 'databaseSchemas' need to specify both the database and the schema, so for example cdmDatabaseSchema <- "my_cdm_data.dbo". For database platforms that do not support temp tables, such as Oracle, it is also necessary to provide a schema where the user has write access that can be used to emulate temp tables. PostgreSQL supports temp tables, so we can set options(sqlRenderTempEmulationSchema = NULL) (or not set the sqlRenderTempEmulationSchema at all.)

Preparing the exposure and outcome of interest

We need to define the exposure and outcome for our study. For the exposure, we will directly use the drug_era table in the CDM. For the outcome we can use the OHDSI PhenotypeLibrary package to retrieve a community-approved definition of epistaxis:

epistaxis <- 356
cohortDefinitionSet <- PhenotypeLibrary::getPlCohortDefinitionSet(epistaxis)

We can use the CohortGenerator package to instantiate this cohort:

connection <- DatabaseConnector::connect(connectionDetails)
cohortTableNames <- CohortGenerator::getCohortTableNames(cohortTable)
CohortGenerator::createCohortTables(connection = connection,
                                    cohortDatabaseSchema = cohortDatabaseSchema,
                                    cohortTableNames = cohortTableNames)
counts <- CohortGenerator::generateCohortSet(connection = connection,
                                             cdmDatabaseSchema = cdmDatabaseSchema,
                                             cohortDatabaseSchema = cohortDatabaseSchema,
                                             cohortTableNames = cohortTableNames,
                                             cohortDefinitionSet = cohortDefinitionSet)
DatabaseConnector::disconnect(connection)

Extracting the data from the server

Now we can tell SelfControlledCaseSeries to extract all necessary data for our analysis:

aspirin <- 1112807

sccsData <- getDbSccsData(connectionDetails = connectionDetails,
                          cdmDatabaseSchema = cdmDatabaseSchema,
                          outcomeDatabaseSchema = cohortDatabaseSchema,
                          outcomeTable = cohortTable,
                          outcomeIds = epistaxis,
                          exposureDatabaseSchema = cdmDatabaseSchema,
                          exposureTable = "drug_era",
                          exposureIds = aspirin,
                          studyStartDates = "20100101",
                          studyEndDates = "21000101")
sccsData
aspirin <- 1112807
if (folderExists) {
  sccsData <- loadSccsData(file.path(outputFolder, "data1.zip"))
} 
if (folderExists) {
  sccsData
}

There are many parameters, but they are all documented in the SelfControlledCaseSeries manual. In short, we are pointing the function to the table created earlier and indicating which cohort ID in that table identifies the outcome. Note that it is possible to fetch the data for multiple outcomes at once. We further point the function to the drug_era table, and specify the concept ID of our exposure of interest: aspirin. Again, note that it is also possible to fetch data for multiple drugs at once. In fact, when we do not specify any exposure IDs the function will retrieve the data for all the drugs found in the drug_era table.

All data about the patients, outcomes and exposures are extracted from the server and stored in the sccsData object. This object uses the Andromeda package to store information in a way that ensures R does not run out of memory, even when the data are large.

We can use the generic summary() function to view some more information of the data we extracted:

summary(sccsData)
if (folderExists) {
  summary(sccsData)
}

Saving the data to file

Creating the sccsData file can take considerable computing time, and it is probably a good idea to save it for future sessions. Because sccsData uses Andromeda, we cannot use R's regular save function. Instead, we'll have to use the saveSccsData() function:

saveSccsData(sccsData, "sccsData.zip")

We can use the loadSccsData() function to load the data in a future session.

Creating the study population

From the data fetched from the server we can now define the population we wish to study. If we retrieved data for multiple outcomes, we should now select only one, and possibly impose further restrictions:

studyPop <- createStudyPopulation(sccsData = sccsData,
                                  outcomeId = epistaxis,
                                  firstOutcomeOnly = FALSE,
                                  naivePeriod = 180)

Here we specify we wish to study the outcome with the ID stored in epistaxis. Since this was the only outcome for which we fetched the data, we could also have skipped this argument. We furthermore specify that the first 180 days of observation of every person, the so-called 'naive period', will be excluded from the analysis. Note that data in the naive period will be used to determine exposure status at the start of follow-up (after the end of the naive period). We also specify we will use all occurrences of the outcome, not just the first one per person.

We can find out how many people (if any) were removed by any restrictions we imposed:

getAttritionTable(studyPop)
if (folderExists) {
  studyPop <- readRDS(file.path(outputFolder, "studyPop.rds"))
  getAttritionTable(studyPop)
}

Defining a simple model

Next, we can use the data to define a simple model to fit:

covarAspirin <- createEraCovariateSettings(label = "Exposure of interest",
                                           includeEraIds = aspirin,
                                           start = 0,
                                           end = 0,
                                           endAnchor = "era end")

sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = covarAspirin
)

summary(sccsIntervalData)
if (folderExists) {
  sccsIntervalData <- loadSccsIntervalData(file.path(outputFolder, "intervalData1.zip"))
  summary(sccsIntervalData)
}

In this example, we use the createEraCovariateSettings to define a single covariate: exposure to aspirin. We specify that the risk window is from start of exposure to the end by setting start and end to 0, and defining the anchor for the end to be the era end, which for drug eras is the end of exposure.

We then use the covariate definition in the createSccsIntervalData function to generate the sccsIntervalData. This represents the data in non-overlapping time intervals, with information on the various covariates and outcomes per interval.

Model fitting

The fitSccsModel function is used to fit the model:

model <- fitSccsModel(sccsIntervalData)

We can inspect the resulting model:

model
if (folderExists){
  model <- readRDS(file.path(outputFolder, "simpleModel.rds"))
  model
}

This tells us what the estimated relative risk (the incidence rate ratio) is during exposure to aspirin compared to non-exposed time.

Adding a pre-exposure window

The fact that platelet aggregation inhibitors like aspirin can cause epistaxis is well known to doctors, and this knowledge affects prescribing behavior. For example, a patient who has just had a nose bleed is not likely to be prescribed aspirin. This may lead to underestimation of the rate during unexposed time, because the unexposed time includes time just prior to exposure where observing of the outcome is unlikely because of this behavior. One solution to this problem that is often used is to introduce a separate 'risk window' just prior to exposure, to separate it from the remaining unexposed time. We can add such a 'pre-exposure window' to our analysis:

covarPreAspirin <- createEraCovariateSettings(label = "Pre-exposure",
                                              includeEraIds = aspirin,
                                              start = -60,
                                              end = -1,
                                              endAnchor = "era start")

sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = list(covarAspirin, 
                              covarPreAspirin)
)

model <- fitSccsModel(sccsIntervalData)

Here we created a new covariate definition in addition to the first one. We define the risk window to start 60 days prior to exposure, and end on the day just prior to exposure. We combine the two covariate settings in a list for the createSccsIntervalData function. Again, we can take a look at the results:

model
if (folderExists) {
  model <- readRDS(file.path(outputFolder, "preExposureModel.rds"))
  model
}

Including seasonality, and calendar time

Often both the rate of exposure and the outcome change with age, and can even depend on the season or calendar time in general (e.g. rates may be higher in 2021 compared to 2020). This may lead to confounding and may bias our estimates. To correct for this we can include age, season, and/or calendar time into the model.

For computational reasons we assume the effect of age, season, and calendar time are constant within each calendar month. We assume that the rate from one month to the next can be different, but we also assume that subsequent months have somewhat similar rates. This is implemented by using spline functions.

Spline for seasonality Figure 1. Example of how a spline is used for seasonality: within a month, the risk attributable to seasonality is assumed to be constant, but from month to month the risks are assumed to follow a cyclic quadratic spline.

Note that by default all people that have the outcome will be used to estimate the effect of age, seasonality, and calendar time on the outcome, so not just the people exposed to the drug of interest. Adjusting for age typically only makes sense for small children, where a small difference in age can still make a big difference (remember: we are modeling the effect of age in each person by themselves, not the effect of age across persons). Since age and calendar time are often hard to fit simultaneously, it is often best to only model seasonality and calendar time, like this:

seasonalityCovariateSettings <- createSeasonalityCovariateSettings(seasonKnots = 5)

calendarTimeSettings <- createCalendarTimeCovariateSettings(calendarTimeKnots = 5)

sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = list(covarAspirin,
                              covarPreAspirin),
  seasonalityCovariateSettings = seasonalityCovariateSettings,
  calendarTimeCovariateSettings = calendarTimeSettings
)

model <- fitSccsModel(sccsIntervalData)

Again, we can inspect the model:

model
if (folderExists) {
  model <- readRDS(file.path(outputFolder, "seasonCalendarTimeModel.rds"))
  model
}

We see that our estimates for exposed and pre-exposure time have not changes much. We can plot the spline curves for season, and calendar time to learn more:

plotSeasonality(model)
if (folderExists) {
  plotSeasonality(model)
}
plotCalendarTimeEffect(model)
if (folderExists) {
  plotCalendarTimeEffect(model)
}

We see some effect for season: epistaxis tends to be more prevalent during winter. We should verify if our model accounts for time trends, by plotting the observed to expected rate of the outcome across time, before and after adjustment:

plotEventToCalendarTime(studyPopulation = studyPop,
                        sccsModel = model)
if (folderExists) {
  plot <- readRDS(file.path(outputFolder, "stabilityPlot.rds"))
  print(plot)
}

Here we see that after adjustment, in general, the rate of outcome appears fairly stable across time, with the exception of some months in 2020. This is what we refer to as the 'COVID-blip', because during the time of the COVID-19 pandemic regular healthcare was put on hold.

Removing the COVID blip

The discontinuity in the rate of the outcome during the COVID pandemic could cause bias. For this reason it is probably best to remove this time from the analysis altogether. We can achieve this by defining two separate study periods:

sccsData <- getDbSccsData(connectionDetails = connectionDetails,
                          cdmDatabaseSchema = cdmDatabaseSchema,
                          outcomeDatabaseSchema = cohortDatabaseSchema,
                          outcomeTable = cohortTable,
                          outcomeIds = epistaxis,
                          exposureDatabaseSchema = cdmDatabaseSchema,
                          exposureTable = "drug_era",
                          exposureIds = aspirin,
                          studyStartDates = c("19000101", "20220101"),
                          studyEndDates = c("20191231", "21001231"))
studyPop <- createStudyPopulation(sccsData = sccsData,
                                  outcomeId = epistaxis,
                                  firstOutcomeOnly = FALSE,
                                  naivePeriod = 180)
sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = list(covarAspirin,
                              covarPreAspirin),
  seasonalityCovariateSettings = seasonalityCovariateSettings,
  calendarTimeCovariateSettings = calendarTimeSettings
)
model <- fitSccsModel(sccsIntervalData)

If we plot the outcomes over time, we see the entire time of the COVID pandemic has been removed:

plotEventToCalendarTime(studyPopulation = studyPop,
                        sccsModel = model)
if (folderExists) {
  plot <- readRDS(file.path(outputFolder, "stabilityPlot2.rds"))
  print(plot)
}

Considering event-dependent observation time

The SCCS method requires that observation periods are independent of outcome times. This requirement is violated when outcomes increase the mortality rate, since censoring of the observation periods is then event-dependent. A modification to the SCCS has been proposed that attempts to correct for this. First, several models are fitted to estimate the amount and shape of the event-dependent censoring, and the best fitting model is selected. Next, this model is used to reweigh various parts of the observation time. This approach is also implemented in this package, and can be turned on using the eventDependentObservation argument of the createSccsIntervalData function. However, this method has proven to be somewhat unstable in combinations with other corrections, so for now it is best to keep the model simple:

sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = list(covarAspirin,
                              covarPreAspirin),
  eventDependentObservation = TRUE
)

model <- fitSccsModel(sccsIntervalData)

Again, we can inspect the model:

model
if (folderExists) {
  model <- readRDS(file.path(outputFolder, "eventDepModel.rds"))
  model
}

Studies with more than one drug

Although we are usually interested in the effect of a single drug or drug class, it could be beneficial to add exposure to other drugs to the analysis if we believe those drugs represent time-varying confounders that we wish to correct for.

Adding a class of drugs

For example, SSRIs might also cause epistaxis, and if aspirin is co-prescribed with SSRIs we don't want the effect of SSRIs attributed to aspirin. We would like our estimate to represent just the effect of the aspirin, so we need to keep the effect of the SSRIs separate. First we have to retrieve the information on SSRI exposure from the database:

ssris <- c(715939, 722031, 739138, 751412, 755695, 797617, 40799195)
sccsData <- getDbSccsData(connectionDetails = connectionDetails,
                          cdmDatabaseSchema = cdmDatabaseSchema,
                          outcomeDatabaseSchema = cohortDatabaseSchema,
                          outcomeTable = cohortTable,
                          outcomeIds = epistaxis,
                          maxCasesPerOutcome = 100000,
                          exposureDatabaseSchema = cdmDatabaseSchema,
                          exposureTable = "drug_era",
                          exposureIds = c(aspirin, ssris),
                          studyStartDates = c("19000101", "20220101"),
                          studyEndDates = c("20191231", "21001231"))
sccsData
ssris <- c(715939, 722031, 739138, 751412, 755695, 797617, 40799195)
if (folderExists) {
  sccsData <- loadSccsData(file.path(outputFolder, "data2.zip"))
} 
if (folderExists) {
  sccsData
}

Once retrieved, we can use the data to build and fit our model:

studyPop <- createStudyPopulation(sccsData = sccsData,
                                  outcomeId = epistaxis,
                                  firstOutcomeOnly = FALSE,
                                  naivePeriod = 180)
covarSsris <- createEraCovariateSettings(label = "SSRIs",
                                         includeEraIds = ssris,
                                         stratifyById = FALSE,
                                         start = 1,
                                         end = 0,
                                         endAnchor = "era end")
sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = list(covarAspirin,
                              covarPreAspirin,
                              covarSsris),
  seasonalityCovariateSettings = seasonalityCovariateSettings,
  calendarTimeCovariateSettings = calendarTimeSettings
)

model <- fitSccsModel(sccsIntervalData)

Here, we added a new covariate based on the list of concept IDs for the various SSRIs. In this example we set stratifyById to FALSE, meaning that we will estimate a single incidence rate ratio for all SSRIs, so one estimate for the entire class of drugs. Note that duplicates will be removed: if a person is exposed to two SSRIs on the same day, this will be counted only once when fitting the model. Again, we can inspect the model:

model
if (folderExists) {
  model <- readRDS(file.path(outputFolder, "ssriModel.rds"))
  model
}

Adding all drugs

Another approach could be to add all drugs into the model. Again, the first step is to get all the relevant data from the database:

sccsData <- getDbSccsData(connectionDetails = connectionDetails,
                          cdmDatabaseSchema = cdmDatabaseSchema,
                          outcomeDatabaseSchema = cohortDatabaseSchema,
                          outcomeTable = cohortTable,
                          outcomeIds = 1,
                          exposureDatabaseSchema = cdmDatabaseSchema,
                          exposureTable = "drug_era",
                          exposureIds = c(),
                          cdmVersion = cdmVersion)

Note that the exposureIds argument is left empty. This will cause data for all concepts in the exposure table to be retrieved. Next, we simply create a new set of covariates, and fit the model:

studyPop <- createStudyPopulation(sccsData = sccsData,
                                  outcomeId = 1,
                                  firstOutcomeOnly = FALSE,
                                  naivePeriod = 180)

covarAllDrugs <- createEraCovariateSettings(label = "Other exposures",
                                            excludeEraIds = aspirin,
                                            stratifyById = TRUE,
                                            start = 1,
                                            end = 0,
                                            endAnchor = "era end",
                                            allowRegularization = TRUE)

sccsIntervalData <- createSccsIntervalData(
  studyPopulation = studyPop,
  sccsData = sccsData,
  eraCovariateSettings = list(covarAspirin,
                              covarPreAspirin,
                              covarAllDrugs),
  seasonalityCovariateSettings = seasonalityCovariateSettings,
  calendarTimeCovariateSettings = calendarTimeSettings
)

model <- fitSccsModel(sccsIntervalData)

The first thing to note is that we have defined the new covariates to be all drugs except aspirin by not specifying the includeEraIds and setting the excludeEraIds to the concept ID of aspirin. Furthermore, we have specified that stratifyById is TRUE, meaning an estimate will be produced for each drug.

We have set allowRegularization to TRUE, meaning we will use regularization for all estimates in this new covariate set. Regularization means we will impose a prior distribution on the effect size, effectively penalizing large estimates. This helps fit the model, for example when some drugs are rare, and when drugs are almost often prescribed together and their individual effects are difficult to untangle.

Because there are now so many estimates, we will export all estimates to a data frame using getModel():

estimates <- getModel(model)
estimates[estimates$originalEraId == aspirin, ]
if (folderExists) {
  model <- readRDS(file.path(outputFolder, "allDrugsModel.rds"))
  estimates <- getModel(model)
  estimates[estimates$originalEraId == aspirin, ]
}

Here we see that despite the extensive adjustments that are made in the model, the effect estimates for aspirin have remained nearly the same.

In case we're interested, we can also look at the effect sizes for the SSRIs:

estimates[estimates$originalEraId %in% ssris, ]
if (folderExists) {
  estimates[estimates$originalEraId %in% ssris, ]
}

Note that because we used regularization, we are not able to compute the confidence intervals for these estimates.

Diagnostics

We can perform several diagnostics on the data to verify whether our assumptions underlying the SCCS are met.

Power calculations

We might be interested to know whether we have sufficient power to detect a particular effect size. It makes sense to perform these power calculations once the study population has been fully defined, so taking into account loss to the various inclusion and exclusion criteria. This means we will use the sccsIntervalData object we've just created as the basis for our power calculations. Since the sample size is fixed in retrospective studies (the data has already been collected), and the true effect size is unknown, the SelfControlledCaseSeries package provides a function to compute the minimum detectable relative risk (MDRR) instead:

computeMdrr(sccsIntervalData,
            exposureCovariateId = 1000,
            alpha = 0.05,
            power = 0.8,
            twoSided = TRUE,
            method = "binomial")
if (folderExists) {
  computeMdrr(sccsIntervalData,
              exposureCovariateId = 1000,
              alpha = 0.05,
              power = 0.8,
              twoSided = TRUE,
              method = "binomial")
}

Note that we have to provide the covariate ID of the exposure of interest, which we learned by calling summary on sccsIntervalData earlier. This is because we may have many covariates in our model, but will likely only be interested in the MDRR of one.

Time from exposure start to event

To gain a better understanding of when the event occurs relative to the start of exposure, we can plot their relationship:

plotExposureCentered(studyPop, sccsData, exposureEraId = aspirin)
if (folderExists) {
  plotExposureCentered(studyPop, sccsData, exposureEraId = aspirin)
}

Increased risk pre-exposure

If the rate just before exposure is higher compared to during exposure, this indicates there might reverse causality: that the outcome, or some precursor of the outcome, increases the probability of having the exposure. To avoid incorrect causal interpretation, we want to detect these situations. We can compute a p-value for whether rate is increased pre-exposure:

computePreExposureGainP(sccsData, studyPopulation, exposureEraId = aspirin) 
if (folderExists) {
  computePreExposureGainP(sccsData, studyPop, exposureEraId = aspirin) 
}

If this p-value is lower than a prespecified threshold (e.g. p < 0.05), we can decide to reject the null hypothesis that the rate is not increased, and we probably should not trust our analysis to produce reliable estimates.

Ages covered per subject

We can visualize which age ranges are covered by each subject's observation time:

plotAgeSpans(studyPop)
if (folderExists) {
  plotAgeSpans(studyPop)
}

Here we see that most observation periods span only a small age range, making it unlikely that any within-person age-related effect will be large.

Dependency between events and observation end

To understand whether censoring is dependent on the event, which would violate one of the assumptions of the SCCS, we can plot the difference in distribution between censored and uncensored events. By 'censored' we mean periods that end before we would normally expect. Here, we define periods to be uncensored if they end at either the study end date (if specified), database end date (i.e. the date after which no data is captured in the database), or maximum age (if specified). All other periods are assumed to be censored.

plotEventObservationDependence(studyPop)
if (folderExists) {
  sccsData <- loadSccsData(file.path(outputFolder, "data1.zip"))
  plotEventObservationDependence(studyPop)
}

Here we see that overall the two distributions are similar, suggesting epistaxis does not lead to earlier observation end.

Stability of the outcome over calendar time

If the rate of the outcome changes as a function of calendar time, this could introduce bias. For example, if the outcome is more prevalent during winter, and the exposure also tends to occur in winter, this will create an association between the two that likely doesn't imply causation. We can check for patterns over time:

plotEventToCalendarTime(studyPopulation = studyPop,
                        sccsModel = model)
if (folderExists) {
  plot <- readRDS(file.path(outputFolder, "stabilityPlot2.rds"))
  print(plot)
}

We see that, after removing the COVID blip, the observed-to-expected ratio seems stable. We can perform a formal test for stability:

stability <- computeTimeStability(studyPopulation = studyPop,
                                  sccsModel = model)
stability %>%
  filter(!stable)
if (folderExists) {
  stability <- readRDS(file.path(outputFolder, "stability2.rds"))
  stability 
}

Acknowledgments

Considerable work has been dedicated to provide the SelfControlledCaseSeries package.

citation("SelfControlledCaseSeries")

Furthermore, SelfControlledCaseSeries makes extensive use of the Cyclops package.

citation("Cyclops")

Part of the code (related to event-dependent observation periods) is based on the SCCS package by Yonas Ghebremichael-Weldeselassie, Heather Whitaker, and Paddy Farrington.

This work is supported in part through the National Science Foundation grant IIS 1251151.



OHDSI/SelfControlledCaseSeries documentation built on Jan. 31, 2024, 7:30 p.m.