knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  message = FALSE, warning = FALSE, cache=TRUE
)

Introduction

This document provides an overview of the bayesOmic package that will be available at Bioconductor (http://bioconductor.org/). The package implements a Bayesian approach for omic association studies. We propose a shared-component model to tease out the feature information (e.g. SNPs, CNVs, genes, transcripts, CpGs) that is common to cases and controls from the one that is specific to cases. This allows to detect the SNPs (bayesSNPassoc function) or any continuos feature (bayesOmicAssoc function) that show the strongest association with the disease. The model can nevertheless be applied to more than one disease. More detailed information about the model and assumptions are given in @abellan2010bayesian and @gonzalez2012bayesian in the case of analyzing SNPs or CNVs, respectively. We illustrate how to analyze SNP data by using a synthetic data set of a targeted study (e.g. a selection of SNPs), GWAS data comparing different populations from 1000 Genomes project (To be supplied) and expression data (i.e. a RangedSummarizedExperiment object for an RNA-Seq experiment) on four human airway smooth muscle cell lines.

Getting started

The bayesOmic package uses JAGS, a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation @plummer2003jags, to estimate model parameters. The current implementation of bayesOmic is based on JAGS version 4.3.0. JAGS has an R interface rjags that is used by the package (rjags version 4-9).

The development version of bayesOmic can be installed from our GitHub repository

library(devtools)
install_github("isglobal-brge/bayesOmic")

Then, the package is loaded as usual

library(bayesOmic)

Analysis of SNP data: small datasets

Let us illustrate how to apply our model to a GWAS study as described in @abellan2010bayesian. The examples correspond to a simulated data including information about 72 SNPs. It contains information of 800 individuals divided in 4 populations: controls and 3 type of cases (M1, M2, and M3). We simulated SNPs that are specific from gene 1 to be associated with M1 and SNPs from gene 3 to M3.

Let us start by describing how to get the data into R

The data

Data can be imported from a text file or can be loaded using snpMatrix package (to be supplied). We provide a simulated example that can be loaded by typing:

data(sim.data)

This is a simulated data having 72 SNPs and one column with case/control status

dim(sim.data)
sim.data[1:5, 1:5]

We can use SNPassoc package to prepare the data to perform data analysis.

library(SNPassoc)
ii <- grep("snp", colnames(sim.data))
SNPs <- setupSNP(sim.data, colSNPs = ii, sep="")

Model parameter estimates

The model can be fitted by using the function bayesSNPassoc as follows:

mod <- bayesSNPassoc( ~ caco, data=SNPs, n.cores=3)

The function bayesSNPassoc prepares the data (i.e, it aggregates the number of rare alleles by the grouping variable). Then the function calls jags.parfit from dclone package. The argument n.cores indicates the number of cores used to fit the model if multiple processors are available. By default n.cores=1.

We have set up the following default arguments to be passed through rjags functions:

args(bayesSNPassoc)

Notice that other arguments realated to MCMC estimation using JAGS can be passed through this function. More details about them can be obtained at http://calvin.iarc.fr/ martyn/software/jags/.

The function also has two arguments to perform a simple quality control. The argument call.rate stands for the call rate of a given individual (e.g. proportion of non-missing SNPs, default is 0.9) and min.freq provides the minimum allele frequency allowed for a given SNP (default is 0.05).

Checking convergence

Before interpreting the simulations obtained from de a posteriori distri bution (e.g. model parameters), Markov chains convergence migth be verified. This can be done by using the function checkConvergence. This function has an argument called type that defines the type of plot to be obtained. When type='Markov chain' (default value) the function calls plot.mcmc() function from package coda Ontherwise, Gelman-Rubin plots are displayed. The function checkConvergence has another argument, parameter, to indicate the model parameter to be summaryzed. The default is alpha. For example, Figure \ref{fig:checkConvergence} can be obtained by executing:

checkConvergence(mod)

Other model parameters (Figure \ref{fig:checkConvergence2) are summaryzed by changing the argument called parameter.

checkConvergence(mod, parameter = "beta")

The trace plot can be obtained by

checkConvergence(mod, type = "trace")

The JAGS estimation can be very time consuming. INLA (Integrated Nested Laplace Approximation) can be used instead

mod.inla <- bayesSNPassoc( ~ caco , data=SNPs, method="INLA")

Results

Model parameters (intercept and shared component) can be obtained by using print function:

mod
mod.inla

On the other hand, specific components can be visualize by:

plot(mod, type="specific")

and

plot(mod.inla, type = "specific")

respectively.

Finally, a hierarchical clustering can be performed by using the predicted probabilities by typing:

makeHeatmap(mod.inla)

The figure shows a Heatmap were we can observe that groups M1 and M3 are different from controls and group M2.

Analysis of GWAS data: data in PLINK format

GWAS data is normally stored in PLINK format. Genomic Data Storage (GDS) is a very efficient way of deal with this data in R. Also, GWASTools package can be use to properly store both genomic and phenotypic dat and help to perform statistical data analyses. Here we illustrate how to estimate the shared component model using a real data example interested in determining whether there are specific obesity genes in males or females.

library(GWASTools)
library(SNPRelate)

Let us load the phenotypic data

pp <- "c:/juan/CREAL/GitHub/brgedata/inst/extdata"
obesity <- read.delim(file.path(pp, "obesity.txt"), as.is=TRUE)

Now, lets create our variable of interest

obesity$group <- rep(NA, nrow(obesity))
obesity$group[obesity$obese==0] <- "control"
obesity$group[obesity$obese==1 & obesity$gender=="Male"] <- "caseMale"
obesity$group[obesity$obese==1 & obesity$gender=="Female"] <- "caseFemale"
table(obesity$group)

It is always recommended to relevel the grouping variable having the control samples as the reference

obesity$group <- factor(obesity$group, 
                        levels = c("control", "caseFemale", "caseMale"))
table(obesity$group)

``` {r create_gds} snpgdsBED2GDS(file.path(pp, "obesity.bed"), file.path(pp, "obesity.fam"), file.path(pp, "obesity.bim"), out.gdsfn = "obesity.gds")

```r
geno <- GdsGenotypeReader("obesity.gds", 
                        YchromCode=24L, 
                        XYchromCode=25L)
genoData <- createGenotypeData(geno, obesity)
mod.chr1 <- bayesSNPassoc( ~ group, genoData, chr="1", n.cores=3)
close(geno)

Analysis of CNV data

We provide a real data example that can be loaded by typing:

data(armengol)

Model parameter estimates are obtained using the function bayesOmicAssoc by executing:

mod.CNV <- bayesOmicAssoc(group="pop", data=armengol)

The intercept ans coefficient of shared components of populations can be obtained by:

mod.CNV

We can check model convergence by

checkConvergence(mod.CNV)
checkConvergence(mod.CNV, type="trace")

The shared components are

plot(mod.CNV, type="shared")

The Specific components for each population can be obtained by typing:

plot(mod.CNV, type="specific")

We can obtain the name of the specific features of a given population by

getSpecific(mod.CNV, group="YRI")

The heatmap is obtained by:

makeHeatmap(mod.CNV, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))

Analysis of expression data (SummarizedExperiment objects)

Dataset can be loaded by typing:

library(airway)
data(airway)

We can analyse a specific genomic range of interest (i.e. chromosome 1, positions 1 to 10000000). We define the roi by executing:

library(GenomicRanges)
library(IRanges)
roi <- GRanges(seqnames = "1", ranges= IRanges(1,10000000))

Model parameter estimates are obtained using the function bayesOmicAssoc with the argument roi by executing:

mod.Expr <- bayesOmicAssoc(group="dex", data=airway, roi = roi)

The intercept ans coefficient of shared components of populations can be obtained by:

mod.Expr

We can check model convergence by

checkConvergence(mod.Expr)
checkConvergence(mod.Expr, type="trace")

The shared components are

plot(mod.Expr, type="shared")

The Specific components for each population can be obtained by typing:

plot(mod.Expr, type="specific")

We can obtain the name of the specific features of a given population by

getSpecific(mod.Expr, group="trt")

The heatmap is obtained by:

makeHeatmap(mod.Expr, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))

Acknowledgments

This work has been partly supported by ....

References



isglobal-brge/bayesOmic documentation built on April 22, 2020, 8:35 p.m.