knitr::opts_chunk$set(message = FALSE, warning = FALSE)

Introduction

MAD (mosaic alteration detector) is a fast algorithm to detect allelic imbalances using SNP array data (B-allele frequency (BAF) and log R ratio (LRR). The method detects different types of mosaic alterations (UPDs, deletions, duplications and trisomies) by searching for segments were B-deviation is different from 0. Segmentation procedure is carried out using an R implementation of GADA algorithm (Pique-Regi R, Caceres A, Gonzalez JR. BMC Bioinformatics, 2010). R-GADA is available at https://github.com/isglobal-brge/R-GADA.

Herein, we briefly describe how to use the main functions by using a real data set of 5 samples: CASE369.txt, CASE371.txt, CASE377.txt, CONTROL152.txt, CONTROL191.txt. These files can be loaded into R from our package called brgedata which is also available at our GitHub repository https://github.com/isglobal-brge/brgedata.

Getting Started

The package depends on R-GADA package that can be installed from our GitHub repository:

devtools::install_github("isglobal-brge/R-GADA")

After that, MAD can be installed by

devtools::install_github("isglobal-brge/MAD")

The package is loaded as usually

library(mad)

Required data format

Detection of mosaic alterations enforces a strict directory structure on the working directory to perform the analysis of multiple samples. Nonetheless, the only required directory to be set up by the user is that containing the raw data. This is an example of how data are organized before the analysis is completed

Notice that rawData directory contains all data files corresponding to each individual from a particular assay. Each text file must be prepared in pennCNV format http://http://penncnv.openbioinformatics.org. This can be performed by using the BeadStudio tool when analyzing Illumina data (see http://www.illumina.com/), or Affymetrix power tools (see https://www.affymetrix.com/ or other Bioconductor packages if Affymetrix SNP data is analyzed: crlmm, affy2sv, etc. This is the structure of a given example:

Name       Chr Position   Log.R.Ratio B.Allele.Freq GType
rs758676   7   12878632   0.1401      0.4977        AB
rs3916934  13  103143536  0.3934      0.4610        AA
rs2711935  4   38838852  -0.1091      0.0026        AA
rs17126880 1   64922104   0.0478      0.9910        AA
rs12831433 12  4995220   -0.1661      0.0000        AA

\textcolor{red}{NOTE:} It is important to mention that all files included in the folder rawData must belong to the same type of array (e.g. same number of probes) because the annotation data is obtained from one of these files. If samples have been analyzed using different platforms, the best option is to create another working directory containing another rawData folder.

Importing array data

Let us prepare a rawData folder containing 5 files in the required format. PennCNV files are available in our package called brgedata that can be installed by executing:

devtools::install_github("isglobal-brge/brgedata")

Once brgedata package is available in your computer we can copy the required files into a rawData folder by executing

ss1 <- system.file("extdata/madData", package="brgedata")
dir.create("rawData")
ss2 <- "rawData"
files <- list.files(ss1)
file.copy(file.path(ss1,files), ss2)

As a result, our working directory contains a folder called rawData having all the files we want to process.

dir()
dir("rawData")

we are ready to start the analyses. First, data can be prepared for the analysis by:

example <- setupParGADA.B.deviation(NumCols=6, GenoCol=6, 
                                    BAFcol=5, log2ratioCol=4)

By default, the function considers that the first column of pennCNV-style files correspond to the marker id (e.g SNP name). The second columd stands for the chromosome name and the third one corresponds to the genomic position. These can be changed by the arguments MarkerIdCol, ChrNameCol and ChrPosCol, respectively. Information about the number of columns of each file (argument NumCols) as well as the column containing Genotypes, BAF and LRR are mandatory and should be passed through the arguments GenoCol, BAFcol and log2ratioCol, respectively. The function also contains an argument called folder that allow the user to indicate where the rawData is located if the working directory does not contains the required rawData folder.

The object example contains the following information:

example

Segmentation procedure

Segmentation is performed by using two consecutive algorithms implemented in two separate R-functions whithin the gada package. The first function (parSBL) uses sparse Bayesian learning (SBL) to discover the most likely positions and magnitudes for a segment, i.e. the breakpoints. The SBL model is governed by a hierarchical Bayesian prior, which is uninformative with respect to the location and magnitude of the copy number changes but restricts the total number of breakpoints. Sensitivity, given by the maximum breakpoint sparseness, is controlled by the hyperparameter aAlpha. The second function (parBE) is an algorithm that uses a backward elimination (BE) strategy to rank the statistical significance of each breakpoint obtained from SBL. We have programmed another function called parBE.B.deviation to detect segments that can be considered as an allelic imbalance. The results from parSBL and parBE.B.deviation are stored in separate files, one for each sample, in a folder called SBL.

Therefore, the segmentation procedure can be performed by executing two steps:

parSBL(example, estim.sigma2=TRUE, aAlpha=0.8)

and

parBE.B.deviation(example, T=9, MinSegLen=100)

As previously mentioned, the parameter aAlpha controls the number of breakpoints. We consider that aAlpha=0.8 is a good choice for Illumina 1M. The parameter T controls the False Discovery Rate (FDR). We are currently working on estimating the FDR but the recommended settings of aAlpha and T depending on the desired level of sensitivity and FDR for a Illumina 650K data are.

\begin{table}[h] \begin{center} \begin{tabular}{r c c } \hline (higher sensitivity , higher FDR ) & $<-->$ & ($a_{\alpha}=0.2$,$T>3$)\ & $<-->$ & ($a_{\alpha}=0.5$,$T>4$)\ (lower sensitivity , lower FDR ) & $<-->$ & ($a_{\alpha}=0.8$,$T>5$)\ \hline \end{tabular} \end{center} \end{table}

Anyway the backward elimination procedure is very fast and the user can change the parameter T and obtain different results. The argument MinSegLen indicates the number of consecutive probes that have a B-deviation different from 0.

Obtaining allelic imbalance regions

The function getMosaics creates a GenomicRanges object which includes the altered regions:

mosaics <- getMosaics(example)
mosaics

The column State is a preliminary attempt to classify mosaic alterations after MAD calling process based on log2-ratio segment values (LRR) together with the percentage of normal heterozygous (BAF $\sim$ 0.5) and homozygous probes. The number codes correspond to the following abnormalities: UPD (1), deletion (2), duplication (3), trisomy (4) and LOH (5). It is recommendable to check called segments by making chromosome plots and observing LRR and BAF values to confirm those segments with limit cut-off values. The column Bdev stands for B-deviation that can be used to estimate cellularity. Our experience by analyzing several real datasets is that the method is highly sensible to detect alterations having B-deviations larger than 0.05.

LOH can be removed from the candidate list by simply

mosaics <- subset(mosaics, State!=5)
mosaics

For those users who are not familiar with GenomicRanges the resulting information can be obtained as a data.frame and exported into a text file by simply writting:

altered <- as(mosaics, "data.frame")
head(altered)
write.table(altered, file="file.txt", quote=FALSE, sep="\t",
            row.names=FALSE)

The user can change T parameter to get more candidate regions. It should be consider that lower values of T argument will increase false positive results. We start by executing

parBE.B.deviation(example, T=7, MinSegLen=75) 

and then the new altered regions can be obtained by:

mosaics2 <- getMosaics(example) 

Data visualization

The user can visually inspect the BAF and LRR of given region to verify wheather it is a real mosaic alteration. The function plotQMosaic can be used to this end:

example <- addAnnot(example) # add annotation
plotQMosaic(example, sample="CONTROL191", chr=20, 
            regions=mosaics)

We can also zoom out the region of interest. For instance, let us better visualize the second alteration of CONTROL191 individual that can be accessed by

library(GenomicRanges)
mosaics[mosaics$sample=="CONTROL191" &
          seqnames(mosaics)=="chr20"]
plotZoomQMosaic(example, sample="CONTROL191", chr=20,
            regions = mosaics[mosaics$sample=="CONTROL191" &
                                seqnames(mosaics)=="chr20"])

We can also visualize all the individuals having an alteration in a given genomic region (passed as a GenomicRange that can be easily created using GRanges function) using plotCNV function from R-GADA package indicating mosaic=TRUE. The function requires annotation from Homo.sapiens and the Gviz library. Genes can be annotated by setting drawGenes = TRUE By default, it depicts mosaic gains are in green and loses in blue, UPD are in orange, trisomies in tomato and LOH in red.

library(Gviz)
library(Homo.sapiens)
rr <- GRanges("chr16:63.5e6-67.0e6")
plotCNVs(mosaics, range=rr, mosaic=TRUE, drawGenes = TRUE)

Mosaic segments at individual level can be plotted using GenVisR Bioconductor package. The legend correspond to: UPD (1), deletion (2), duplication (3), trisomy (4).

library(ggplot2)
plotSegments(mosaics)
unlink("rawData", recursive = TRUE)
unlink("SBL", recursive = TRUE)
sessionInfo()


isglobal-brge/MAD documentation built on April 21, 2020, 3:26 p.m.