We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf the phenotypes are available in another resource called brge that is a .txt file. The GWAS analysis is then perform as follows

We first start by preparing login data

builder <- newDSLoginBuilder()
builder$append(server = "study1", url = "https://opal-demo.obiba.org",
               user = "dsuser", password = "password",
               resource = "RSRC.brge_vcf")
logindata <- builder$build()

conns <- datashield.login(logins = logindata, assign = TRUE,
                          symbol = "res")

In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the r BiocStyle::CRANpkg("resourcer")

datashield.assign.resource(conns, symbol = "vcf.res", 
                           resource = list(study1 = "RSRC.brge_vcf"))
datashield.assign.expr(conns, symbol = "gds", 
                       expr = quote(as.resource.object(vcf.res)))


datashield.assign.resource(conns, symbol = "covars.res", 
                           resource = list(study1 = "RSRC.brge"))
datashield.assign.expr(conns, symbol = "covars", 
                       expr = quote(as.resource.data.frame(covars.res)))

These are the objects available in the Opal server

ds.ls()

We can use r Githubpkg("datashield/dsBaseClient") functions to inspect the variables that are in the covars data.frame. The variables are

ds.colnames("covars")

The asthma variable has this number of individuals at each level (0: controls, 1: cases)

ds.table("covars$asthma")

There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes)

The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name can be used to create a new object on the server with the subsetted VCF, preserving the complete one.

genes <- c("A1BG","A2MP1")
ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")

Then, an object of class GenotypeData must be created at the server side to perform genetic data analyses. This is a container defined in the r Biocpkg("GWASTools") package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by

ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, sexId = 2, male_encoding = "2", female_encoding = "1", newobj.name = 'gds.Data')

Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.

To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time

pca <- ds.PCASNPS("gds", prune = TRUE)
plotPCASNPS(pca)

To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars"). In our case, this variable is called "gender", and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels since gender variable is not a factor):

ds.table1D("covars$gender")$counts

The HWE test can be performed to selected chromosomes using the argument chromosomes, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes present on the object can be fetched with

ds.genoDimensions('gds.Data')$chromosomes

Therefore, HWE can be performed by:

ds.exactHWE("gds.Data", chromosome = "20")

The same test can be conducted only for the control individuals (on this example taking the asthma covariates columnas as a cas/control) by:

ds.exactHWE("gds.Data", chromosome = "20", controls = TRUE, controls_column = "asthma")

Similarly, allele frequencies estimates can be estimated by:

ds.alleleFrequency("gds.Data")

In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.

Association analysis for a given SNP is performed by simply

ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')

The analysis of all available SNPs is performed when the argument snps.fit is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the r Biocpkg("limma") at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the r Biocpkg("GWASTools") and r Biocpkg("GENESIS") packages. The complete pipeline is implemented in this function

ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)

This close the DataSHIELD session

datashield.logout(conns)


isglobal-brge/dsOmicsClient documentation built on March 20, 2023, 3:52 p.m.