knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

ROMOPOmics

Introduction

library(tidyverse)
library(dbplyr)
library(data.table)
library(rmarkdown)
library(knitr)
library(kableExtra)
library(rprojroot)
library(RSQLite)
library(ROMOPOmics)
knitr::opts_chunk$set(echo = TRUE)

dirs          <- list()
dirs$base     <- file.path(find_root(criterion = criteria$is_r_package))
dirs$figs     <- file.path(dirs$base,"man/figures")
dirs$data     <- file.path(dirs$base,"data")
dirs$masks    <- file.path(dirs$base,"data")

#Files.
dm_file     <- system.file("extdata","OMOP_CDM_v6_0_custom.csv",package="ROMOPOmics",mustWork = TRUE)
tcga_files  <- 
  c(
    "brca_clinical" = system.file("extdata","brca_clinical.csv",package="ROMOPOmics",mustWork = TRUE),
    "brca_mutation" = system.file("extdata","brca_mutation.csv",package="ROMOPOmics",mustWork = TRUE)
  )

#select <- dplyr::select

dm <- loadDataModel(as_table_list = FALSE)

ROMOPOmics was developed to standardize metadata of high throughput assays with associated patient clinical data. Biomedical research datasets such as RNA-Seq experiments or other next generation sequencing datasets contain a mixture of clinical traits of the studied patients including data derived from samples perturbed by different assays, at different timepoints, with varying protocols, and more. Additionally, next generation sequencing datasets include metadata on pipeline byproducts (alignment files, raw reads, readmes, etc.) and analysis results of any type (gene counts, differential expression data, quality control analyses, etc.). Our package ROMOPOmics provides a framework to standardize these datasets and a pipeline to convert this information into a SQL-friendly database that is easily accessed by users. After installation of our R package from the github repository, users specify a data directory and a mask file describing how to map their data's fields into a common data model. The resulting standardized data tables are then formatted into a SQLite database for easily interoperating and sharing the dataset.

knitr::include_graphics("../man/figures/romopomics_code_flow.png")

Background

The OMOP common data model

The CDM allows for standardizing patient clinical data. The foundation of the ROMOPOmics package is the common data model developed by the Observational Medical Outcomes Partnership Medicine and now used by the Observational Health Data Sciences and Informatics network. The data model contains all the fields and tables for standardizing patient clinical data in the OMOP framework. Unless a custom data model is provided, the package defaults to using a custom version of the OMOP 6.0 data model which is packaged within extdata. The OMOP data model includes r nrow(dm) fields distributed among r length(unique(dm$table)) tables. We use the CDM with a few custom characteristics. First, we include an hla_source_value field in the PERSON table meant to incorporate histocompatibility complex types as a key individual characteristic rather than as a separate observation. Second, our customized version includes a SEQUENCING table:

dm %>% 
  filter(table=="sequencing") %>%
  mutate(description = gsub(" CUSTOM$","",description)) %>%
  select(-table_index) %>%
  mutate_all(function(x) gsub("_"," ",x)) %>%
  kable(escape = TRUE) %>%
  kable_styling(full_width=FALSE,latex_options = "striped")

There are two main reasons for including this 'mask' table.

1) Sequencing data is becoming ubiqutious in contemporary research, and is an increasingly common component of personalized medicine treatment regimens.

2) Unlocking the full information within next generation experiments behooves these "Sequencing" datasets to include the spectrum of products generated along any testing pipeline, from library preparation to sequencing machine to data analysis. This allows for intermediate steps and files to be used (generating and using raw files rather than processed and normalized gene counts, for example), but crucially it facilitates comparisons between different studies and treatments by allowing comparisons of library preparation, quality control, alignment methods, reference data, etc. Including this data is crucial, but incorporating the variety of available variables is not intuitive in the existing OMOP model.

Package functionality

ROMOPomics is a package with 9 functions in total, where 4 are internal functions required by other package functions. We outline the algorithm for using the package below, but this package is very simple.

Package algorithm

Step 1: Load the data model.

Load the master common data model file from extdata and return the table's information as a data dictionary to be referenced later to standardize the mask data.

dm <- loadDataModel(as_table_list = FALSE)

Step 2: Design and load input masks.

msks <- loadModelMasks(mask_files = "../inst/extdata")

"Masks" streamline the mapping of values from existing data sets to OMOP format, or at least to how the database's administrator thinks these data sets should be mapped. See the files in r dirs$masks for examples of masks files used here.

Mask files are tables which provide alias, table, and field columns that describe each term's name in the input dataset, its destination OMOP table, and name within that table, respectively. For instance, patient_name in the user's database will likely map to person_source_value in current OMOP parlance. Using multiple masks should streamline the use of multiple analysis types as well: the database administrators can develop and implement masks and users won't need to know that patient_name and cell_line_name are both synonymous with person_source_value in the OMOP framework, for instance. Next generation sequencing data can be added using the sequencing mask, while "HLA"" data can be incorporated using an hla mask.

Here's an example of a mask formatted TCGA clinical data, provided to the loadModelMasks() function as a CSV:

msks$brca_clinical %>%
  select(alias,table,field,field_idx,set_value,example1) %>%
  mutate_all(function(x) ifelse(is.na(x),"",x)) %>%
  rename(example=example1) %>%
  mutate_all(function(x) gsub("_"," ",x)) %>%
  kable() %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0,font_size = 20,italic=TRUE,hline_after = TRUE) %>%
  column_spec(c(1:3),color="black") %>%
  column_spec(c(1),background = "lightgray",border_right = TRUE,border_left = TRUE) %>%
  column_spec(c(2),bold=TRUE) %>%
  column_spec(c(4,5),color = "gray")

Column names:

  1. alias: Field name for a value according to the input dataset.
  2. field: Field name for a value according to the selected data model.
  3. table: Table name for a value according to the selected data model.
  4. field_idx: Used to differentiate between observations (see below).
  5. set_value: Default value that is added to the given table and field regardless of input. Useful for specifying units, descriptions, etc. when facilitating the multiple-column transition.

Patient vs. observation-centric datasets

The OMOP format anticipates data tables with one row per observation, and one observation type per table. As an example, consider this translation of a simple input dataset with two descriptors and two observations:

knitr::include_graphics("../man/figures/patient_to_observation_centric.PNG")

ROMOPOmics converts this wide, unstandardized data format by using a field_idx variable. Appending a field_idx value allows for observations to be "grouped" into an observation complete with their units, descriptions, etc.

Step 3: Translate input datasets into common data model format.

Using the readInputFile() function, data table inputs are translated into the destination format according to the provided mask (in this case brca_clinical and brca_mutation). Tables in this format are "exhaustive" in that they include all possible fields and tables in the data model, including unused ones. It is not required that every variable in the input tables are present in the mask tables. Only variables in the input tables that are mapped to the common data model format will be in the mask tables.

omop_inputs <- lapply(names(tcga_files), 
                function(x) readInputFile(input_file = tcga_files[[x]],
                                           mask_table = msks[[x]],
                                           data_model = dm))

Step 4: Combine all input tables into SQL-friendly tables.

Since tables read via readInputFile() include all fields and tables from the data model, these tables can be combined regardless of input type or mask used using combineInputTables(). This function combines all data sets from all mask types, and filters out all OMOP tables from the data model that are unused (no entries in any of the associated fields). Tables are not "partially" used; if any field is included from that table, all fields from that table are included. The only exception to this is table indices: if a table inherits an index from an unused table, that index column is dropped.

Once data has been loaded into a single comprehensive table, an index column (<table_name>_index) is assigned for each permutation of all data sets included in each used table, and formats the type of each column based on the data model's specification (VARCHAR(50) is changed to "character", INTEGER is changed to "integer", etc.). Finally, this function returns each formatted OMOP table in a named list.

db_inputs   <- combineInputTables(input_table_list = omop_inputs)

In this example using these masks, the OMOP tables included are r paste(names(db_inputs)[1:(length(db_inputs)-1)],collapse=", "), and r names(db_inputs)[length(db_inputs)].

Step 5: Add OMOP-formatted tables to a database.

The tables compiled in db_inputs are now formatted for creating a SQLite database. The package dplyr has built-in SQLite functionality, which is wrapped in the function buildSQLDBR(). However, building a database using any other package is amenable here.

omop_db     <- buildSQLDBR(db_inputs,sql_db_file = file.path(tempdir(),"TCGA_brca.sqlite"))

dbListTables(omop_db)
dbListFields(omop_db,"PERSON")

Raw SQLite query

With a SQL database, data can be queried with simple, clear, and transportable queries for retrieving patient and sample data.

dbGetQuery(omop_db,
'SELECT person_source_value, person.person_id,file_remote_repo_id,file_remote_repo_value
 FROM person INNER JOIN sequencing 
 WHERE file_remote_repo_id IS NOT NULL and person_source_value is "tcga-3c-aaau" 
 ORDER BY "file_remote_repo_value"') %>%
  mutate_all(function(x) gsub("_"," ",x)) %>%
  distinct() %>% 
  head(20) %>% 
  kable() %>%
  kable_styling(full_width=FALSE) 

DBplyr query

Alternatively, one can use commands from dplyr to query a SQL database, abstracting out SQL queries into possibly more intuitive R functions.

inner_join(tbl(omop_db,"PERSON"),
           tbl(omop_db,"MEASUREMENT"),
           by=c("person_id", "provider_id")) %>%
  select(person_source_value,
         birth_datetime,
         death_datetime,
         measurement_source_value,
         value_as_number,
         unit_source_value) %>%
  filter(charindex("lymph",measurement_source_value)) %>%
  as_tibble() %>%
  mutate_all(function(x) gsub("_"," ",x)) %>%
  head(20) %>% 
  kable() %>%
  kable_styling(full_width=FALSE)

DBI::dbDisconnect(omop_db)

TL;DR:

Here is the 'too long; didn't read` section. Below are the steps to convert patient and sample data into the OMOP framework using ROMOPOmics:

TCGA data

library(ROMOPOmics)
dm_file     <- system.file("extdata","OMOP_CDM_v6_0_custom.csv",package="ROMOPOmics",mustWork = TRUE)
dm          <- loadDataModel(master_table_file = dm_file)
tcga_files  <- 
  list(
    "brca_clinical" = system.file("extdata","brca_clinical.csv",package="ROMOPOmics",mustWork = TRUE),
    "brca_mutation" = system.file("extdata","brca_mutation.csv",package="ROMOPOmics",mustWork = TRUE)
  )
msks        <- list(brca_clinical=loadModelMasks(system.file("extdata","brca_clinical_mask.csv",package="ROMOPOmics",mustWork = TRUE)),
                    brca_mutation=loadModelMasks(system.file("extdata","brca_mutation_mask.csv",package="ROMOPOmics",mustWork = TRUE)))
omop_inputs <- list(brca_clinical=readInputFile(input_file = tcga_files$brca_clinical,
                                                 data_model = dm,
                                                 mask_table = msks$brca_clinical),
                    brca_mutation=readInputFile(input_file = tcga_files$brca_mutation,
                                                 data_model = dm,
                                                 mask_table = msks$brca_mutation))
db_inputs   <- combineInputTables(input_table_list = omop_inputs)
omop_db     <- buildSQLDBR(omop_tables = db_inputs,file.path(tempdir(),"TCGA.sqlite"))
dbListTables(omop_db)
DBI::dbDisconnect(omop_db)

ATAC-seq data

dm_file     <- system.file("extdata","OMOP_CDM_v6_0_custom.csv",package="ROMOPOmics",mustWork = TRUE)
dm          <- loadDataModel(master_table_file = dm_file)

msk_file    <- system.file("extdata","GSE60682_standard_mask.csv",package="ROMOPOmics",mustWork = TRUE)
msks        <- loadModelMasks(msk_file)

in_file     <- system.file("extdata","GSE60682_standard.csv",package="ROMOPOmics",mustWork = TRUE)
omop_inputs <- readInputFile(input_file=in_file,data_model=dm,mask_table=msks,transpose_input_table = TRUE)
db_inputs   <- combineInputTables(input_table_list = omop_inputs)
omop_db     <- buildSQLDBR(omop_tables = db_inputs, sql_db_file=file.path(dirs$data,"GSE60682_sqlDB.sqlite"))
dbListTables(omop_db)


AndrewC160/ROMOPOmics documentation built on Jan. 27, 2021, 6:57 p.m.