In isglobal-brge/madloy_v1: Package to detect mosaic LOY from SNP array data

knitr::opts_chunk$set(cache=TRUE)

Introduction

MADloy is a package to detect mosaic loss of chromosome Y (LOY) events from genotype-array-intensity data. This vignette illustrates how to obtain summarized log R ratio (LRR) values of SNPs probes in the male-specific region of chromosome Y (mLRR-Y) of a set of samples. The mLRR-Y is located in the 56-Mb region between pseudoautosomal regions 1 and 2 (PAR1 and PAR2) on chromosome Y (chrY:2,694,521-59,034,049, hg19/GRCh37). The median value of mLRR-Y can be use as a quantitative proxy of LOY. Downstream association analyses can be performed by regressing median mLRR-Y values with quantitative (e.g. age, gene expression, ...) or qualitative (e.g., case/control, smoking, ...) traits. The p-values of those regression models can be used values to establish correlations between LOY and traits. Another strategy consists on performing calling of LOY samples by using median mLRR-Y and then associate this binary variable (normal/LOY) with the trait of interest. This will be addressed in a different vignette.

Getting started

You can install MADloy from Github by typing

devtools::install_github("isglobal-brge/MADloy_v1")

Then the package is loaded as usual

library(MADloy)

We have prepared a set of data files including 124 males and 2 females to be used as an illustrative example about how to get mLRR-Y data of each sample. Files and data have been anonymized and belong to individuals from general population. These files can be downloaded from this link LOYdata - https://goo.gl/tZu2Pt.

The zipped file includes one file per sample in the required format to be processed with MADloy package. This format is described in the next section.

In order to reproduce this vignette, decompress the .zip file in a folder and set this folder path in an R object (for instance rawDataPath).

rawDataPath <- "/SYNCRW10125/DATASETS/STUDY/EGCUT/rawData_anon"
rawDataPath
files <- dir(rawDataPath)
length(files)
files[1:5]

Processing SNP data with `madloy`: loading and summarizing LRR (and BAF)

Required input data

The function madloy processes individual SNP array data in pennCNV format. Basically, different files of each sample must be created containing information about SNP, chromosome, position, LRR, BAF and genotype (although having only the first 4 columns is enough to summarized mLRR-Y). Different tools can be used to get the required information. Affymetrix data (.CEL files) can be processed by using Birdseed v2 algorithm. Affymetrix power tools can also be used to process .CEL files as well as affy2sv R package. Illumina data (.idat files) can be processed by using Genome Studio software. crlmm Bioconductor package can also be used to get LRR and BAF.

Filtering female samples

LOY association studies are performed only using male samples. Obviously, clinical data can be used to filter such individuals. However, in some ocassions there are errors in those databases that can be detected by using genomic data. The function checkSex can be used to further verify that there are no female samples in our data. Let us perform this filtering using HapMap data

sex <- checkSex(rawDataPath, mc.cores=20)

This function only requires the path containing the raw data in pennCNV format. By default the function is assuming that the LRR information is in column number 4. This can be changed through the argument LRRCol. Notice that this function speed up the process by changing the argument mc.cores.

The function checkSex returns an object that can be plotted by using the generic function plot. The figure depicts the LRR in both X and Y chromosomes.

plot(sex)

This figure shows that there are 4 female samples, although only 2 were identified as female in the epidemiological data. This information can also be seen by typing

sex

These samples can be identified

sex$par$files[sex$class=="FEMALE"]

and removed from the next analyses by selecting the files corresponding to males samples:

files.males <- sex$par$files[sex$class!="FEMALE"]

Data normalization

Summarized (median) mLRR-Y data is used as a proxy of LOY events. The median mLRR-Y value can be affected by several artifacts that have to be corrected before analyzing mLRR-Y data. First, it can be a systematic bias in the mLRR-Y due to the fact that the overall intensity of LRR distribution shifted slightly away from 0 in the whole array. This issue is addressed by normalizing the median mLRR-Y data using the LRR intensity in the autosomes (reference). In particular, we propose to compute the 5% trimmed-mean of LRR to avoid regions having copy number alterations. This parameter can be tuned to take into account the different nature of the data we are dealing with. As an example, studies in cancer are expected to have individuals with large number of aneuploidies. Therefore, the trimmed value of the LRR may be increased up to, for instance, 25%.

Second, some of the existing algorithms used to get LRR information do not normalize the intensity of mLRR-Y to be 0. They provide values close to -0.46 indicating that only 1 copy is present in males (e.g ploidy is equal to 1) and hence, the LRR is centered at 2/3log(1/2) = r round(2/3*log(1/2),2). We address this issue by shifting the observed values of mLRR-Y towards 0 by removing the median value of the mLRR-Y in all individuals.

Get summarized mLRR-Y data

madloy function processes raw data (e.g. separate files in pennCNV format of each sample) and provides the normalized median mLRR-Y value of each sample. The normalization procedure consists on considering technical artifacts that may affect the LRR values in the mLRR-Y region by removing:

The sample mean-trimmed LRR values in autosomes

The median value of summarized mLRR-Y region of all samples

The madloy function is design to process LRR files and only requires the path where those files are located. Let us illustrate how to get summarized data of our illustrative example available at LOYdata package (see Getting Started section). NOTE: if checkSex is not executed, files.males should be replaced by rawDataPath

ex <- madloy(files.males, mc.cores=20)

The function creates an object of class MADloy that can be inspected by using the generic print function.

ex

We observe that LRR data has been summarized in a target and a reference region. By default the target region corresponds to the mLRR-Y region and the reference region (the one used to normalized LRR in the target region) corresponds to autosomal chromosomes. The arguments target.region and ref.region can be used to change those values. This information has to be passed in UCSC format (e.g. "chr21" or "chr21:1000-10000").

By default the human genome reference is hg18 that can be changed in the argument hg. The package also contains files encoding the required information to retrieve summarized data in X and Y PAR regions, p and q arms, and msY region that are used to better describe the characteristics of LOY samples.

The LRR data of the reference is summarized by using the trimmed-mean. The argument trim controls the fraction of probes (e.g. LRR values) that are removed from each end before the mean is computed (0 to 0.5). This is a robust summary of LRR in a given region since it takes into account possible regions in the genome having CNV alterations. By default 5% of probes are trimed. This values is recommended to be increased, for instance, when analyzing cancer data where a large number of alterations is expected to be found. In case of being interested in summaryzing LRR data by using the median value, trim should be set equal to 0.5.

Samples with bad quality, which is, having large variability in the LRR data are recommended to do not be included in the analysis. Therefore, the function returns NA values of summarized mLRR-Y for those samples having LRR variability larger than 2 times the observed mean standard deviation among samples.

Visualizing summarized mLRR-Y data in a set of samples

Data can be visually inspected by using the generic plot function. This function depicts the mean difference between the LLR of Y chromosome and the reference region that is performed in order to control for possible technical artifacts. As previously mentioned, the reference is consider the autosomes and LRR is summarized by using the trimmed mean in order to remove the effect of having any gain or lose in the genome. Notice that this reference chromosome can be changed by the user.

plot(ex, print.labels=TRUE, threshold=-0.3)

This figure shows several samples that may have a LOY event (those in the -.2, -.8 range of Y-axis). These samples can be further analyzed by looking at the mLRR-Y region to verify whether they are real LOY rearrangements as described in the next section.

Visualizing mLRR-Y region of a single sample

We can visually inspect the information of mLRR-Y region (e.g., one by one sample) and decide whether a given individuals is having a LOY or not. This can be performed either by using plotIndSNP or plotIndLRR functions. Let's create these plot of a given sample having a normalized mLRR-Y value around 0 (i.e SAMPLE_1). The figure represents the expected values of normal LRR at 0 (red line). However, as previously mentioned, depending the algorithm used to get LRR in the mLRR-Y region these values can be centered around -0.46 indicating that only 1 chromosome is present in males. Orange line represents the median value of mLRR-Y in the subset of analyzed samples. This is the value that must be considered as the reference to indicate whether the mLRR-Y is a LOY event or not.

```{r, plotSample1, fig.cap="LRR and BAF in the mLRR-Y region (shaded) of SAMPLE_1 from illustrative example"} plotIndSNP(ex, sample="SAMPLE_1")

```r
plotIndLRR(ex, sample="SAMPLE_1")

Now, let us create the plot of SAMPLE_21 and SAMPLE_4 that are having the lowest values of mLRR-Y.

```r plotIndLRR(ex, sample="SAMPLE_29") plotIndLRR(ex, sample="SAMPLE_81") ````

The plots indicate that both samples are probably carrying a LOY because the LRR (brown dots) in the mLRR-Y region (shaded area) is far below from the reference (orange line). The blue line represents the median LRR values in the mLRR-Y region. Notice that the example of SAMPLE_1 is having the blue line very close to the reference one (orange) indicating that it is a normal sample.

isglobal-brge/madloy_v1 documentation built on May 6, 2019, 11:46 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com