knitr::opts_chunk$set( collapse = TRUE, comment = "", message = FALSE, warning = FALSE, cache=TRUE )
This document provides an overview of the bayesOmic
package that will be available at Bioconductor (http://bioconductor.org/). The package implements a Bayesian approach for omic association studies. We propose a shared-component model to tease out the feature information (e.g. SNPs, CNVs, genes, transcripts, CpGs) that is common to cases and controls from the one that is specific to cases. This allows to detect the SNPs (bayesSNPassoc function) or any continuos feature (bayesOmicAssoc
function) that show the strongest association with the disease. The model can nevertheless be applied to more than one disease. More detailed information about the model and assumptions are given in @abellan2010bayesian and @gonzalez2012bayesian in the case of analyzing SNPs or CNVs, respectively. We illustrate how to analyze SNP data by using a synthetic data set of a targeted study (e.g. a selection of SNPs), GWAS data comparing different populations from 1000 Genomes project (To be supplied) and expression data (i.e. a RangedSummarizedExperiment object for an RNA-Seq experiment) on four human airway smooth muscle cell lines.
The bayesOmic
package uses JAGS, a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation @plummer2003jags, to estimate model parameters. The current implementation of bayesOmic
is based on JAGS version 4.3.0. JAGS has an R interface rjags that is used by the package (rjags version 4-9).
The development version of bayesOmic
can be installed from our GitHub repository
library(devtools) install_github("isglobal-brge/bayesOmic")
Then, the package is loaded as usual
library(bayesOmic)
Let us illustrate how to apply our model to a GWAS study as described in @abellan2010bayesian. The examples correspond to a simulated data including information about 72 SNPs. It contains information of 800 individuals divided in 4 populations: controls and 3 type of cases (M1, M2, and M3). We simulated SNPs that are specific from gene 1 to be associated with M1 and SNPs from gene 3 to M3.
Let us start by describing how to get the data into R
Data can be imported from a text file or can be loaded using snpMatrix package (to be supplied). We provide a simulated example that can be loaded by typing:
data(sim.data)
This is a simulated data having 72 SNPs and one column with case/control status
dim(sim.data) sim.data[1:5, 1:5]
We can use SNPassoc
package to prepare the data to perform data analysis.
library(SNPassoc) ii <- grep("snp", colnames(sim.data)) SNPs <- setupSNP(sim.data, colSNPs = ii, sep="")
The model can be fitted by using the function bayesSNPassoc
as follows:
mod <- bayesSNPassoc( ~ caco, data=SNPs, n.cores=3)
The function bayesSNPassoc
prepares the data (i.e, it aggregates the number of rare alleles by the grouping variable). Then the function calls jags.parfit
from dclone
package. The argument n.cores
indicates the number of cores used to fit the model if multiple processors are available. By default n.cores=1
.
We have set up the following default arguments to be passed through rjags functions:
args(bayesSNPassoc)
Notice that other arguments realated to MCMC estimation using JAGS can be passed through this function. More details about them can be obtained at http://calvin.iarc.fr/ martyn/software/jags/.
The function also has two arguments to perform a simple quality control. The argument call.rate
stands for the call rate of a given individual (e.g. proportion of non-missing SNPs, default is 0.9) and min.freq
provides the minimum allele frequency allowed for a given SNP (default is 0.05).
Before interpreting the simulations obtained from de a posteriori distri bution (e.g. model parameters), Markov chains convergence migth be verified. This can be done by using the function checkConvergence
. This function has an argument called
type
that defines the type of plot to be obtained. When type
='Markov
chain' (default value) the function calls plot.mcmc()
function from package coda
Ontherwise, Gelman-Rubin plots are displayed. The function checkConvergence
has another argument, parameter, to indicate the model parameter to be summaryzed. The default is alpha
. For example, Figure \ref{fig:checkConvergence} can be obtained by executing:
checkConvergence(mod)
Other model parameters (Figure \ref{fig:checkConvergence2) are summaryzed by changing the argument called parameter
.
checkConvergence(mod, parameter = "beta")
The trace plot can be obtained by
checkConvergence(mod, type = "trace")
The JAGS estimation can be very time consuming. INLA (Integrated Nested Laplace Approximation) can be used instead
mod.inla <- bayesSNPassoc( ~ caco , data=SNPs, method="INLA")
Model parameters (intercept and shared component) can be obtained by using print function:
mod
mod.inla
On the other hand, specific components can be visualize by:
plot(mod, type="specific")
and
plot(mod.inla, type = "specific")
respectively.
Finally, a hierarchical clustering can be performed by using the predicted probabilities by typing:
makeHeatmap(mod.inla)
The figure shows a Heatmap were we can observe that groups M1 and M3 are different from controls and group M2.
GWAS data is normally stored in PLINK format. Genomic Data Storage (GDS) is a very efficient way of deal with this data in R. Also, GWASTools
package can be use to properly store both genomic and phenotypic dat and help to perform statistical data analyses. Here we illustrate how to estimate the shared component model using a real data example interested in determining whether there are specific obesity genes in males or females.
library(GWASTools) library(SNPRelate)
Let us load the phenotypic data
pp <- "c:/juan/CREAL/GitHub/brgedata/inst/extdata" obesity <- read.delim(file.path(pp, "obesity.txt"), as.is=TRUE)
Now, lets create our variable of interest
obesity$group <- rep(NA, nrow(obesity)) obesity$group[obesity$obese==0] <- "control" obesity$group[obesity$obese==1 & obesity$gender=="Male"] <- "caseMale" obesity$group[obesity$obese==1 & obesity$gender=="Female"] <- "caseFemale" table(obesity$group)
It is always recommended to relevel the grouping variable having the control samples as the reference
obesity$group <- factor(obesity$group, levels = c("control", "caseFemale", "caseMale")) table(obesity$group)
``` {r create_gds} snpgdsBED2GDS(file.path(pp, "obesity.bed"), file.path(pp, "obesity.fam"), file.path(pp, "obesity.bim"), out.gdsfn = "obesity.gds")
```r geno <- GdsGenotypeReader("obesity.gds", YchromCode=24L, XYchromCode=25L)
genoData <- createGenotypeData(geno, obesity)
mod.chr1 <- bayesSNPassoc( ~ group, genoData, chr="1", n.cores=3)
close(geno)
We provide a real data example that can be loaded by typing:
data(armengol)
Model parameter estimates are obtained using the function bayesOmicAssoc
by executing:
mod.CNV <- bayesOmicAssoc(group="pop", data=armengol)
The intercept ans coefficient of shared components of populations can be obtained by:
mod.CNV
We can check model convergence by
checkConvergence(mod.CNV)
checkConvergence(mod.CNV, type="trace")
The shared components are
plot(mod.CNV, type="shared")
The Specific components for each population can be obtained by typing:
plot(mod.CNV, type="specific")
We can obtain the name of the specific features of a given population by
getSpecific(mod.CNV, group="YRI")
The heatmap is obtained by:
makeHeatmap(mod.CNV, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))
Dataset can be loaded by typing:
library(airway) data(airway)
We can analyse a specific genomic range of interest (i.e. chromosome 1, positions 1 to 10000000). We define the roi
by executing:
library(GenomicRanges) library(IRanges) roi <- GRanges(seqnames = "1", ranges= IRanges(1,10000000))
Model parameter estimates are obtained using the function bayesOmicAssoc
with the argument roi
by executing:
mod.Expr <- bayesOmicAssoc(group="dex", data=airway, roi = roi)
The intercept ans coefficient of shared components of populations can be obtained by:
mod.Expr
We can check model convergence by
checkConvergence(mod.Expr)
checkConvergence(mod.Expr, type="trace")
The shared components are
plot(mod.Expr, type="shared")
The Specific components for each population can be obtained by typing:
plot(mod.Expr, type="specific")
We can obtain the name of the specific features of a given population by
getSpecific(mod.Expr, group="trt")
The heatmap is obtained by:
makeHeatmap(mod.Expr, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))
This work has been partly supported by ....
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.