knitr::opts_chunk$set(message = FALSE, warning = FALSE)
MAD (mosaic alteration detector) is a fast algorithm to detect allelic imbalances using SNP array data (B-allele frequency (BAF) and log R ratio (LRR). The method detects different types of mosaic alterations (UPDs, deletions, duplications and trisomies) by searching for segments were B-deviation is different from 0. Segmentation procedure is carried out using an R implementation of GADA algorithm (Pique-Regi R, Caceres A, Gonzalez JR. BMC Bioinformatics, 2010). R-GADA is available at https://github.com/isglobal-brge/R-GADA.
Herein, we briefly describe how to use the main functions by using a real data set of 5 samples: CASE369.txt, CASE371.txt, CASE377.txt, CONTROL152.txt, CONTROL191.txt. These files can be loaded into R from our package called brgedata
which is also available at our GitHub repository https://github.com/isglobal-brge/brgedata.
The package depends on R-GADA package that can be installed from our GitHub repository:
devtools::install_github("isglobal-brge/R-GADA")
After that, MAD can be installed by
devtools::install_github("isglobal-brge/MAD")
The package is loaded as usually
library(mad)
Detection of mosaic alterations enforces a strict directory structure on the working directory to perform the analysis of multiple samples. Nonetheless, the only required directory to be set up by the user is that containing the raw data. This is an example of how data are organized before the analysis is completed
Notice that rawData directory contains all data files corresponding to each individual from a particular assay. Each text file must be prepared in pennCNV format http://http://penncnv.openbioinformatics.org. This can be performed by using the BeadStudio tool when analyzing Illumina data (see http://www.illumina.com/), or Affymetrix power tools (see https://www.affymetrix.com/ or other Bioconductor packages if Affymetrix SNP data is analyzed: crlmm, affy2sv, etc. This is the structure of a given example:
Name Chr Position Log.R.Ratio B.Allele.Freq GType rs758676 7 12878632 0.1401 0.4977 AB rs3916934 13 103143536 0.3934 0.4610 AA rs2711935 4 38838852 -0.1091 0.0026 AA rs17126880 1 64922104 0.0478 0.9910 AA rs12831433 12 4995220 -0.1661 0.0000 AA
\textcolor{red}{NOTE:} It is important to mention that all files included in the folder rawData must belong to the same type of array (e.g. same number of probes) because the annotation data is obtained from one of these files. If samples have been analyzed using different platforms, the best option is to create another working directory containing another rawData folder.
Let us prepare a rawData
folder containing 5 files in the required format. PennCNV files are available in our package called brgedata
that can be installed by executing:
devtools::install_github("isglobal-brge/brgedata")
Once brgedata
package is available in your computer we can copy the required files into a rawData
folder by executing
ss1 <- system.file("extdata/madData", package="brgedata") dir.create("rawData") ss2 <- "rawData" files <- list.files(ss1) file.copy(file.path(ss1,files), ss2)
As a result, our working directory contains a folder called rawData
having all the files we want to process.
dir() dir("rawData")
we are ready to start the analyses. First, data can be prepared for the analysis by:
example <- setupParGADA.B.deviation(NumCols=6, GenoCol=6, BAFcol=5, log2ratioCol=4)
By default, the function considers that the first column of pennCNV-style files correspond to the marker id (e.g SNP name). The second columd stands for the chromosome name and the third one corresponds to the genomic position. These can be changed by the arguments MarkerIdCol, ChrNameCol and ChrPosCol, respectively. Information about the number of columns of each file (argument NumCols) as well as the column containing Genotypes, BAF and LRR are mandatory and should be passed through the arguments GenoCol, BAFcol and log2ratioCol, respectively. The function also contains an argument called folder that allow the user to indicate where the rawData is located if the working directory does not contains the required rawData
folder.
The object example contains the following information:
example
Segmentation is performed by using two consecutive algorithms implemented in two separate R-functions whithin the gada
package. The first function (parSBL
) uses sparse Bayesian learning (SBL) to discover the most likely positions and magnitudes for a segment, i.e. the breakpoints. The SBL model is governed by a hierarchical Bayesian prior, which is uninformative with respect to the location and magnitude of the copy number changes but restricts the total number of breakpoints. Sensitivity, given by the maximum breakpoint
sparseness, is controlled by the hyperparameter aAlpha
. The second function (parBE
) is an algorithm that uses a backward elimination (BE) strategy to rank the statistical significance of each breakpoint obtained from SBL
. We have programmed another function called parBE.B.deviation
to detect segments that can be considered as an allelic imbalance. The results from parSBL
and parBE.B.deviation
are stored in separate files, one for each sample, in a folder called SBL
.
Therefore, the segmentation procedure can be performed by executing two steps:
parSBL(example, estim.sigma2=TRUE, aAlpha=0.8)
and
parBE.B.deviation(example, T=9, MinSegLen=100)
As previously mentioned, the parameter aAlpha
controls the number of breakpoints. We consider that aAlpha=0.8
is a good choice for Illumina 1M. The parameter T
controls the False Discovery Rate (FDR). We are currently working on estimating the FDR but the recommended settings of aAlpha
and T
depending on the desired level of sensitivity and FDR for a Illumina 650K data are.
\begin{table}[h] \begin{center} \begin{tabular}{r c c } \hline (higher sensitivity , higher FDR ) & $<-->$ & ($a_{\alpha}=0.2$,$T>3$)\ & $<-->$ & ($a_{\alpha}=0.5$,$T>4$)\ (lower sensitivity , lower FDR ) & $<-->$ & ($a_{\alpha}=0.8$,$T>5$)\ \hline \end{tabular} \end{center} \end{table}
Anyway the backward elimination procedure is very fast and the user can change the parameter T
and obtain different results. The argument MinSegLen
indicates the number of consecutive probes that have a B-deviation different from 0.
The function getMosaics
creates a GenomicRanges
object which includes the altered regions:
mosaics <- getMosaics(example) mosaics
The column State
is a preliminary attempt to classify mosaic alterations after MAD calling process based on log2-ratio segment values (LRR) together with the percentage of normal heterozygous (BAF $\sim$ 0.5) and homozygous probes. The number codes correspond to the following abnormalities: UPD (1), deletion (2), duplication (3), trisomy (4) and LOH (5). It is recommendable to check called segments by making chromosome plots and observing LRR and BAF values to confirm those segments with limit cut-off values. The column Bdev
stands for B-deviation that can be used to estimate cellularity. Our experience by analyzing several real datasets is that the method is highly sensible to detect alterations having B-deviations larger than 0.05.
LOH can be removed from the candidate list by simply
mosaics <- subset(mosaics, State!=5) mosaics
For those users who are not familiar with GenomicRanges
the resulting information can be obtained as a data.frame
and exported into a text file by simply writting:
altered <- as(mosaics, "data.frame") head(altered) write.table(altered, file="file.txt", quote=FALSE, sep="\t", row.names=FALSE)
The user can change T
parameter to get more candidate regions. It should be consider that lower values of T
argument will increase false positive results. We start by executing
parBE.B.deviation(example, T=7, MinSegLen=75)
and then the new altered regions can be obtained by:
mosaics2 <- getMosaics(example)
The user can visually inspect the BAF and LRR of given region to verify wheather it is a real mosaic alteration. The function plotQMosaic
can be used to this end:
example <- addAnnot(example) # add annotation plotQMosaic(example, sample="CONTROL191", chr=20, regions=mosaics)
We can also zoom out the region of interest. For instance, let us better visualize the second alteration of CONTROL191 individual that can be accessed by
library(GenomicRanges) mosaics[mosaics$sample=="CONTROL191" & seqnames(mosaics)=="chr20"]
plotZoomQMosaic(example, sample="CONTROL191", chr=20, regions = mosaics[mosaics$sample=="CONTROL191" & seqnames(mosaics)=="chr20"])
We can also visualize all the individuals having an alteration in a given genomic region (passed as a GenomicRange
that can be easily created using GRanges
function) using plotCNV
function from R-GADA
package indicating mosaic=TRUE
. The function requires annotation from Homo.sapiens
and the Gviz
library. Genes can be annotated by setting drawGenes = TRUE
By default, it depicts mosaic gains are in green and loses in blue, UPD are in orange, trisomies in tomato and LOH in red.
library(Gviz) library(Homo.sapiens) rr <- GRanges("chr16:63.5e6-67.0e6") plotCNVs(mosaics, range=rr, mosaic=TRUE, drawGenes = TRUE)
Mosaic segments at individual level can be plotted using GenVisR
Bioconductor package. The legend correspond to: UPD (1), deletion (2), duplication (3), trisomy (4).
library(ggplot2) plotSegments(mosaics)
unlink("rawData", recursive = TRUE) unlink("SBL", recursive = TRUE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.