The purpose of the r Githubpkg("isglobal-brge/dsOmicsClient") package is to provide a set of functions to perform omic association analyses when data are stored on federated databases or, more generally, in different repositories. In particular the package utilizes DataSHIELD infrastructure which is a software solution that allows simultaneous co-analysis of data from multiple studies stored on different servers without the need to physically pool data or disclose sensitive information [@wilson_datashield_2017]. DataSHIELD uses Opal servers to properly perform such analyses. Our bookdown introduces Opal, DaaSHIELD and other related features. Here, we describe the most relevant ones to be able to reproduce this document.

At a high level DataSHIELD is set up as a client-server model which houses the data for a particular study. A request is made from the client to run specific functions on the remote servers where the analysis is performed. Non-sensitive and pre-approved summary statistics are returned from each study to the client where they can be combined for an overall analysis. An overview of what a single-site DataSHIELD architecture would look like is illustrated in Figure \@ref(fig:dsArchitec).

knitr::include_graphics(tools::file_path_as_absolute("../fig/singleSiteDSInfrastructure.jpg"))

One of the main limitations of DataSHIELD is how to deal with large data given the restrictions of Opal with databases. Nonetheless, the recent development of the r Githubpkg("obiba/resourcer") R package allows DataSHIELD developers to overcome this drawback by granting the Opal servers to deal with any type of data (e.g. resources). So far, Opal can register access to different types of data resources in different formats (csv, tsv, R data, SQL, tiddy, ..) that can also be located in different places (local, http, ssh, AWS S3 or Mongodb file stores, ...). This is another important advancement since the r Githubpkg("obiba/resourcer") addresses another important issue that is having duplicated data in different research centers or hospitals.

The r Githubpkg("obiba/resourcer") package permits to work with specific R data classes. This is highly important in our setting since it will allow to use Bioconductor classes to properly manage omic data using efficient infrastructures such as ExpressionSet or RangedSummarizedExperiment among others. Another important asset of the r Githubpkg("obiba/resourcer") package is that it can be extended to new data types by writting specific functions (see how to extending resources. We have used this feature and created some functions for the analysis of Variant Calling Format (VCF files) that are loaded into R as Genomic Data Storage objects. These functions along with others that allow the managment of Bioconductor classes in DataSHIELD have been included in a new DataSHIELD package, the r Githubpkg("isglobal-brge/dsOmics"), which is able to manage different Bioconductor data infrastructures that are required to perform omic association analyses. These including ExpressionSet, RangedSummarizedExperiment or GDS among others. Generaly speaking, any data format and storage that can be read by R can be expressed as a resource.

In the next sections we first describe how to deal with Opal servers and resources. We illustre how we prepared a test environment to describe how Opal must be setup as well as how to provide the appropiate R/DataSHIELD configuration in both the Opal server and the client side to perform omic association analyses. Then, the different types of omic data analyses that can be performed with the r Githubpkg("isglobal-brge/dsOmicsClient") functionality are described and further illustrated using real data examples including epigenome, transcriptome and genomic data analyses.



isglobal-brge/dsOmicsClient documentation built on March 20, 2023, 3:52 p.m.