knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tidyverse) library(dbplyr) library(data.table) library(rmarkdown) library(knitr) library(kableExtra) library(rprojroot) library(RSQLite) library(ROMOPOmics) knitr::opts_chunk$set(echo = TRUE) dirs <- list() dirs$base <- file.path(find_root(criterion = criteria$is_r_package)) dirs$figs <- file.path(dirs$base,"man/figures") dirs$data <- file.path(dirs$base,"data") dirs$masks <- file.path(dirs$base,"data") #Files. dm_file <- system.file("extdata","OMOP_CDM_v6_0_custom.csv",package="ROMOPOmics",mustWork = TRUE) tcga_files <- c( "brca_clinical" = system.file("extdata","brca_clinical.csv",package="ROMOPOmics",mustWork = TRUE), "brca_mutation" = system.file("extdata","brca_mutation.csv",package="ROMOPOmics",mustWork = TRUE) ) #select <- dplyr::select dm <- loadDataModel(as_table_list = FALSE)
ROMOPOmics was developed to standardize metadata of high throughput assays with associated patient clinical data. Biomedical research datasets such as RNA-Seq experiments or other next generation sequencing datasets contain a mixture of clinical traits of the studied patients including data derived from samples perturbed by different assays, at different timepoints, with varying protocols, and more. Additionally, next generation sequencing datasets include metadata on pipeline byproducts (alignment files, raw reads, readmes, etc.) and analysis results of any type (gene counts, differential expression data, quality control analyses, etc.). Our package ROMOPOmics provides a framework to standardize these datasets and a pipeline to convert this information into a SQL-friendly database that is easily accessed by users. After installation of our R package from the github repository, users specify a data directory and a mask file describing how to map their data's fields into a common data model. The resulting standardized data tables are then formatted into a SQLite database for easily interoperating and sharing the dataset.
knitr::include_graphics("../man/figures/romopomics_code_flow.png")
The CDM allows for standardizing patient clinical data. The foundation of the ROMOPOmics package is the common data model developed by the Observational Medical Outcomes Partnership Medicine and now used by the Observational Health Data Sciences and Informatics network. The data model contains all the fields and tables for standardizing patient clinical data in the OMOP framework. Unless a custom data model is provided, the package defaults to using a custom version of the OMOP 6.0 data model which is packaged within extdata
. The OMOP data model includes r nrow(dm)
fields distributed among r length(unique(dm$table))
tables. We use the CDM with a few custom characteristics. First, we include an hla_source_value
field in the PERSON
table meant to incorporate histocompatibility complex types as a key individual characteristic rather than as a separate observation. Second, our customized version includes a SEQUENCING
table:
dm %>% filter(table=="sequencing") %>% mutate(description = gsub(" CUSTOM$","",description)) %>% select(-table_index) %>% mutate_all(function(x) gsub("_"," ",x)) %>% kable(escape = TRUE) %>% kable_styling(full_width=FALSE,latex_options = "striped")
There are two main reasons for including this 'mask' table.
1) Sequencing data is becoming ubiqutious in contemporary research, and is an increasingly common component of personalized medicine treatment regimens.
2) Unlocking the full information within next generation experiments behooves these "Sequencing" datasets to include the spectrum of products generated along any testing pipeline, from library preparation to sequencing machine to data analysis. This allows for intermediate steps and files to be used (generating and using raw files rather than processed and normalized gene counts, for example), but crucially it facilitates comparisons between different studies and treatments by allowing comparisons of library preparation, quality control, alignment methods, reference data, etc. Including this data is crucial, but incorporating the variety of available variables is not intuitive in the existing OMOP model.
ROMOPomics is a package with 9 functions in total, where 4 are internal functions required by other package functions. We outline the algorithm for using the package below, but this package is very simple.
Load the master common data model file from extdata and return the table's information as a data dictionary to be referenced later to standardize the mask data.
dm <- loadDataModel(as_table_list = FALSE)
msks <- loadModelMasks(mask_files = "../inst/extdata")
"Masks" streamline the mapping of values from existing data sets to OMOP format, or at least to how the database's administrator thinks these data sets should be mapped. See the files in r dirs$masks
for examples of masks files used here.
Mask files are tables which provide alias
, table
, and field
columns that describe each term's name in the input dataset, its destination OMOP table, and name within that table, respectively. For instance, patient_name
in the user's database will likely map to person_source_value
in current OMOP parlance. Using multiple masks should streamline the use of multiple analysis types as well: the database administrators can develop and implement masks and users won't need to know that patient_name
and cell_line_name
are both synonymous with person_source_value
in the OMOP framework, for instance. Next generation sequencing data can be added using the sequencing
mask, while "HLA"" data can be incorporated using an hla
mask.
Here's an example of a mask formatted TCGA clinical data, provided to the loadModelMasks()
function as a CSV:
msks$brca_clinical %>% select(alias,table,field,field_idx,set_value,example1) %>% mutate_all(function(x) ifelse(is.na(x),"",x)) %>% rename(example=example1) %>% mutate_all(function(x) gsub("_"," ",x)) %>% kable() %>% kable_styling(full_width = FALSE) %>% row_spec(0,font_size = 20,italic=TRUE,hline_after = TRUE) %>% column_spec(c(1:3),color="black") %>% column_spec(c(1),background = "lightgray",border_right = TRUE,border_left = TRUE) %>% column_spec(c(2),bold=TRUE) %>% column_spec(c(4,5),color = "gray")
Column names:
The OMOP format anticipates data tables with one row per observation, and one observation type per table. As an example, consider this translation of a simple input dataset with two descriptors and two observations:
knitr::include_graphics("../man/figures/patient_to_observation_centric.PNG")
ROMOPOmics converts this wide, unstandardized data format by using a field_idx
variable. Appending a field_idx
value allows for observations to be "grouped" into an observation complete with their units, descriptions, etc.
Using the readInputFile()
function, data table inputs are translated into the destination format according to the provided mask
(in this case brca_clinical
and brca_mutation
). Tables in this format are "exhaustive" in that they include all possible fields and tables in the data model, including unused ones. It is not required that every variable in the input tables are present in the mask tables. Only variables in the input tables that are mapped to the common data model format will be in the mask tables.
omop_inputs <- lapply(names(tcga_files), function(x) readInputFile(input_file = tcga_files[[x]], mask_table = msks[[x]], data_model = dm))
Since tables read via readInputFile()
include all fields and tables from the data model, these tables can be combined regardless of input type or mask used using combineInputTables()
. This function combines all data sets from all mask types, and filters out all OMOP tables from the data model that are unused (no entries in any of the associated fields). Tables are not "partially" used; if any field is included from that table, all fields from that table are included. The only exception to this is table indices: if a table inherits an index from an unused table, that index column is dropped.
Once data has been loaded into a single comprehensive table, an index column (<table_name>_index
) is assigned for each permutation of all data sets included in each used table, and formats the type
of each column based on the data model's specification (VARCHAR(50)
is changed to "character", INTEGER
is changed to "integer", etc.). Finally, this function returns each formatted OMOP table in a named list.
db_inputs <- combineInputTables(input_table_list = omop_inputs)
In this example using these masks, the OMOP tables included are r paste(names(db_inputs)[1:(length(db_inputs)-1)],collapse=", ")
, and r names(db_inputs)[length(db_inputs)]
.
The tables compiled in db_inputs
are now formatted for creating a SQLite database. The package dplyr
has built-in SQLite functionality, which is wrapped in the function buildSQLDBR()
. However, building a database using any other package is amenable here.
omop_db <- buildSQLDBR(db_inputs,sql_db_file = file.path(tempdir(),"TCGA_brca.sqlite")) dbListTables(omop_db) dbListFields(omop_db,"PERSON")
With a SQL database, data can be queried with simple, clear, and transportable queries for retrieving patient and sample data.
dbGetQuery(omop_db, 'SELECT person_source_value, person.person_id,file_remote_repo_id,file_remote_repo_value FROM person INNER JOIN sequencing WHERE file_remote_repo_id IS NOT NULL and person_source_value is "tcga-3c-aaau" ORDER BY "file_remote_repo_value"') %>% mutate_all(function(x) gsub("_"," ",x)) %>% distinct() %>% head(20) %>% kable() %>% kable_styling(full_width=FALSE)
Alternatively, one can use commands from dplyr
to query a SQL database, abstracting out SQL queries into possibly more intuitive R functions.
inner_join(tbl(omop_db,"PERSON"), tbl(omop_db,"MEASUREMENT"), by=c("person_id", "provider_id")) %>% select(person_source_value, birth_datetime, death_datetime, measurement_source_value, value_as_number, unit_source_value) %>% filter(charindex("lymph",measurement_source_value)) %>% as_tibble() %>% mutate_all(function(x) gsub("_"," ",x)) %>% head(20) %>% kable() %>% kable_styling(full_width=FALSE) DBI::dbDisconnect(omop_db)
Here is the 'too long; didn't read` section. Below are the steps to convert patient and sample data into the OMOP framework using ROMOPOmics:
library(ROMOPOmics) dm_file <- system.file("extdata","OMOP_CDM_v6_0_custom.csv",package="ROMOPOmics",mustWork = TRUE) dm <- loadDataModel(master_table_file = dm_file) tcga_files <- list( "brca_clinical" = system.file("extdata","brca_clinical.csv",package="ROMOPOmics",mustWork = TRUE), "brca_mutation" = system.file("extdata","brca_mutation.csv",package="ROMOPOmics",mustWork = TRUE) ) msks <- list(brca_clinical=loadModelMasks(system.file("extdata","brca_clinical_mask.csv",package="ROMOPOmics",mustWork = TRUE)), brca_mutation=loadModelMasks(system.file("extdata","brca_mutation_mask.csv",package="ROMOPOmics",mustWork = TRUE))) omop_inputs <- list(brca_clinical=readInputFile(input_file = tcga_files$brca_clinical, data_model = dm, mask_table = msks$brca_clinical), brca_mutation=readInputFile(input_file = tcga_files$brca_mutation, data_model = dm, mask_table = msks$brca_mutation)) db_inputs <- combineInputTables(input_table_list = omop_inputs) omop_db <- buildSQLDBR(omop_tables = db_inputs,file.path(tempdir(),"TCGA.sqlite")) dbListTables(omop_db) DBI::dbDisconnect(omop_db)
dm_file <- system.file("extdata","OMOP_CDM_v6_0_custom.csv",package="ROMOPOmics",mustWork = TRUE) dm <- loadDataModel(master_table_file = dm_file) msk_file <- system.file("extdata","GSE60682_standard_mask.csv",package="ROMOPOmics",mustWork = TRUE) msks <- loadModelMasks(msk_file) in_file <- system.file("extdata","GSE60682_standard.csv",package="ROMOPOmics",mustWork = TRUE) omop_inputs <- readInputFile(input_file=in_file,data_model=dm,mask_table=msks,transpose_input_table = TRUE) db_inputs <- combineInputTables(input_table_list = omop_inputs) omop_db <- buildSQLDBR(omop_tables = db_inputs, sql_db_file=file.path(dirs$data,"GSE60682_sqlDB.sqlite")) dbListTables(omop_db)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.