In isglobal-brge/dsOmicsClient: DataSHIELD client site Omics association functions.

Let us illustrate how to perform transcriptomic data analysis using data from TCGA project. We have uploaded to the opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. This resource contains the RangeSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. Next, we illustrate how a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using r Biocpkg("limma") package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.

Let us start by creating the connection to the opal server:

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org", 
               user = "dsuser", password = "P@ssw0rd", 
               resource = "RSRC.tcga_liver", profile = "omics")

logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE, 
                          symbol = "res")

Then, let us coerce the resource to a RangedSummarizedExperiment which is the type of object that is available in the recount project.

datashield.assign.expr(conns, symbol = "rse", 
                       expr = quote(as.resource.object(res)))
ds.class("rse")

The number of features and samples can be inspected by

ds.dim("rse")

And the names of the features using the same function used in the case of analyzing an ExpressionSet

name.features <- ds.featureNames("rse")
lapply(name.features, head)

Also the covariate names can be inspected by

name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)

We can visualize the levels of the variable having gender information that will be our condition (i.e., we are interested in obtaining genes that are differentially expressed between males and females)

ds.table1D("rse$gdc_cases.demographic.gender")

We have implemented a function called ds.RNAseqPreproc() to perform RNAseq data pre-processing that includes:

transforming data into log2 CPM units
filtering lowly-expressed genes
data normalization

ds.RNAseqPreproc('rse', group= 'gdc_cases.demographic.gender', 
                 newobj.name = 'rse.pre')

Note that it is recommended to indicate the grouping variable (i.e., condition). Once data have been pre-processed, we can perform differential expression analysis. Notice how dimensions have changed given the fact that we have removed genes with low expression which are expected to do not be differentially expressed.

ds.dim('rse')
ds.dim('rse.pre')

The differential expression analysis is ´dsOmicsClient/dsOmics´ is implemented in the funcion ds.limma(). This functions runs a limma-pipeline for microarray data and for RNAseq data allows:

voom + limma
DESeq2
edgeR

We recommend to use the voom + limma pipeline proposed here given its versatility and that limma is much faster than DESeq2 and edgeR. By default, the function consider that data are obtained from a microarray experiment (type.data = "microarray"). Therefore, as we are analyzing RNAseq data, we much indicate that type.data = "RNAse"

ans.gender <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq")

The top differentially expressed genes can be visualized by:

ans.gender

We can verify whether the distribution of the observed p-values are the ones we expect in this type of analyses

hist(ans.gender$study1$P.Value, xlab="Raw p-value gender effect",
     main="", las=1, cex.lab=1.5, cex.axis=1.2, col="gray")

We can also check whether there is inflation just executing

qqplot(ans.gender$study1$P.Value)

So, in that case, the model needs to remove unwanted variability ($\lambda>2$). If so, we can use surrogate variable analysis just changing the argument sva=TRUE

ans.gender.sva <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq",
                       sva = TRUE)

Now the inflation has dramatically been reduced ($\lambda>1.12$)

qqplot(ans.gender.sva$study1$P.Value)

We can add annotation to the output that is available in our RSE object. We can have access to this information by

ds.fvarLabels('rse.pre')

So, we can run

ans.gender.sva <- ds.limma(model =  ~ gdc_cases.demographic.gender, 
                       Set = "rse.pre", type.data = "RNAseq",
                       sva = TRUE, annotCols = c("chromosome"))

The results are:

ans.gender.sva

The function has another arguments that can be used to fit other type of models:

sva: estimate surrogate variables
annotCols: to add annotation available in the
method: Linear regression ("ls") or robust regression ("robust") used in limma (lmFit)
robust: robust method used for outlier sample variances used in limma (eBayes)
normalization: normalization method used in the voom transformation (default "none")
voomQualityWeights: should voomQualityWeights function be used instead of voom? (default FALSE)
big: should SmartSVA be used instead of SVA (useful for big sample size or when analyzing epigenome data. Default FALSE)

We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:

To be supplied

We close the DataSHIELD session by: