knitr::opts_chunk$set(cache=TRUE, message = FALSE,
                      warning = FALSE)

Introduction

MADloy is a package to detect mosaic loss of chromosome Y (LOY) events from genotype-array-intensity data. This vignette illustrates how to obtain summarized log R ratio (LRR) values of SNPs probes in the male-specific region of chromosome Y (mLRR-Y) of a set of samples, and checking the putative LOY events with B Allele Frequency (BAF) values of SNP probes in the pseudoautosomal regions (PAR1, PAR2) and X-transposed region (XTR) on chromosome X. The mLRR-Y is located in the 56-Mb region between PAR1 and PAR2 on chromosome Y (chrY:2,694,521-59,034,049, hg19/GRCh37). The median value of mLRR-Y can be use as a quantitative proxy of LOY, but checking Bdev in shared regions with chromosome X is necessary to assess calling. Downstream association analyses can be performed by regressing median mLRR-Y values with quantitative (e.g. age, gene expression, ...) or qualitative (e.g., case/control, smoking, ...) traits. The p-values of those regression models can be used values to establish correlations between LOY and traits.

Previous strategy considers mLRR-Y as a surrogate of LOY status @forsberg2014mosaic. Here we illustrate how to perform calling (e.g LOY/normal/XYY) to stablish whether an individual is carrying an alteration in chromosome Y or not using the methodology described in @gonzalezMADloy that uses a new method to thresholding mLRR-Y data and incorporates information about B-deviation to reduce missclassification and, hence, improve statistical power.

Getting started

You can install MADloy from Github by typing

devtools::install_github("isglobal-brge/MADloy")

Then the package is loaded as usual

library(MADloy)
library(parallel)
options(mc.cores=detectCores()-1)

We have prepared a set of data files including 124 reported males and 2 females to be used as an illustrative example about how to get mLRR-Y data of each sample. Files and data have been anonymized and belong to individuals from general population. These files can be downloaded from this link LOYdata - github repository

The zipped file includes one file per sample in the required format to be processed with MADloy package. This format is described in the next section.

In order to reproduce this vignette, decompress the .zip file in a folder and set this folder path in an R object (for instance rawDataPath).

rawDataPath <- "C:/rawData"
rawDataPath
files <- dir(rawDataPath, pattern = "*.txt")
length(files)
files[1:5]

Processing SNP data with madloy: loading and summarizing LRR (and BAF)

Required input data

The function madloy processes individual SNP array data in pennCNV format. Basically, different files of each sample must be created containing information about SNP, chromosome, position, LRR, BAF and genotype (although having only the first 4 columns is enough to summarized mLRR-Y). Different tools can be used to get the required information. Affymetrix data (.CEL files) can be processed by using Birdseed v2 algorithm. Affymetrix power tools can also be used to process .CEL files as well as affy2sv R package. Illumina data (.idat files) can be processed by using Genome Studio software. crlmm Bioconductor package can also be used to get LRR and BAF.

This is an example of an input data for one individual:

Name       Chr Position   Log.R.Ratio B.Allele.Freq GType
rs758676   7   12878632   0.1401      0.4977        AB
rs3916934  13  103143536  0.3934      0.4610        AA
rs2711935  4   38838852  -0.1091      0.0026        AA
rs17126880 1   64922104   0.0478      0.9910        AA
rs12831433 12  4995220   -0.1661      0.0000        AA

Notice that the SNP name is in the first column, the chromosome in the second and the log-r-ratio in the third. These are the columns that are assumed by default in the arguments of all functions in the package (arguments rsCol = 1, ChrCol = 2, PosCol = 3, LRRCol = 4). However, these can be changed by the user.

\textcolor{red}{NOTE:} It is important to mention that all files of the analyzed individuals must belong to the same type of array (e.g. same number of probes) because the annotation data is obtained from one of these files. If samples have been analyzed using different platforms, the best option is to perform the analyses by batches and then merge the results.

Filtering female samples

LOY association studies are performed only using male samples. Obviously, phenotype (e.g. clinical/epidemiological/questionaire) data can be used to filter such individuals. However, in some occasions there are errors in those databases that can be detected by using genomic data. The function checkSex can be used to further verify that there are no female samples in our data. Let us perform this filtering using the test data.

sex <- checkSex(rawDataPath)

This function only requires the path containing the raw data in pennCNV format. By default the function is assuming that the LRR information is in column number 4. As previously mentioned, this can be changed through the argument LRRCol. If any of the samples lacks information on some of the chromosomes, the samples are omitted. Notice that this function speed up the process by changing the argument mc.cores.

The function checkSex returns an object that can be plotted by using the generic function plot. The figure depicts the LRR in both X and Y chromosomes.

plot(sex)

This figure shows that there are 4 female samples, although only 2 were identified as female in the phenotype data. This information can also be seen by typing

sex

These samples can be identified

sex$par$files[na.omit(sex$class=="FEMALE")]

as we can observe, there two reported female samples are classified as "FEMALE" and we also detect some discrepancies in other two samples that were reported as "MALE" while their genetic data indicates the contrary. Those samples are removed from downstream analysis since this is a normal QC step in genetic studies:

files.males <- sex$par$files[na.omit(sex$class!="FEMALE")]

Data normalization

Summarized (median) mLRR-Y data is used as a proxy of LOY events. The median mLRR-Y value can be affected by several artifacts that have to be corrected before analyzing mLRR-Y data. First, it can be a systematic bias in the mLRR-Y due to the fact that the overall intensity of LRR distribution shifted slightly away from 0 in the whole array. This issue is addressed by normalizing the median mLRR-Y data using the LRR intensity in the autosomes (reference). In particular, we propose to compute the 5% trimmed-mean of LRR to avoid regions having copy number alterations. This parameter can be tuned to take into account the different nature of the data we are dealing with. As an example, studies in cancer are expected to have individuals with large number of aneuploidies. Therefore, the trimmed value of the LRR may be increased up to, for instance, 25%.

Second, some of the existing algorithms used to get LRR information do not normalize the intensity of mLRR-Y to be 0. They provide values close to -0.46 indicating that only 1 copy is present in males (e.g ploidy is equal to 1) and hence, the LRR is centered at $\frac{2}{3}\log(1/2)$ = r round(2/3*log(1/2),2). We address this issue by shifting the observed values of mLRR-Y towards 0 by removing the median value of the mLRR-Y in all individuals.

Get summarized mLRR-Y data

madloy function processes raw data (e.g. separate files in pennCNV format of each sample) and provides the normalized median mLRR-Y value of each sample. The normalization procedure consists on considering technical artifacts that may affect the LRR values in the mLRR-Y region by removing:

The madloy function is designed to process LRR files and only requires the path where those files are located. Let us illustrate how to get summarized data of our illustrative example available at LOYdata package (see Getting Started section). NOTE: if checkSex is not executed, files.males should be replaced by rawDataPath

ex <- madloy(files.males, hg="hg18")

The function creates an object of class MADloy that can be inspected by using the generic print function.

ex

We observe that LRR data has been summarized in a target and a reference region. By default the target region corresponds to the mLRR-Y region and the reference region (the one used to normalized LRR in the target region) corresponds to autosomal chromosomes. The arguments target.region and ref.region can be used to change those values. This information has to be passed in UCSC format (e.g. "chr21" or "chr21:1000-10000").

By default the human genome reference is GRCh38 that can be changed in the argument hg as we did in this example (we use hg18 in the argument hg). The package also contains files encoding the required information to retrieve summarized data in X and Y PAR regions, XTR region, p and q arms, and msY region that are used to better describe the characteristics of LOY samples.

The LRR data of the reference is summarized by using the trimmed-mean. The argument trim of madloy function controls the fraction of probes (e.g. LRR values) that are removed from each end before the mean is computed (0 to 0.5). This is a robust summary of LRR in a given region since it takes into account possible regions in the genome having CNV alterations. By default 5% of probes are trimed. This values is recommended to be increased, for instance, when analyzing cancer data where a large number of alterations are expected. In case of being interested in summaryzing LRR data by using the median value, trim should be set equal to 0.5.

Quality control

Samples with bad quality, that is, having large variability in the LRR data are recommended to do not be included in the analysis. madloy function returns NA values of summarized mLRR-Y for those samples having LRR standard deviation larger than 0.28 in the reference chromosome as recommended here: www.illumina.com/content/dam/illumina-marketing/documents/products/appnotes/appnote_cnv_loh.pdf.

IMPORTANT NOTE: This value is the default and correspond to Illumina arrays. If analyzing Affymetrix arrays, or you expect to have data with large variability, set the argument qc.sds of madloy function to the desired value. Sometimes it is complicated to know which is the expected variability in the arrays. In that case, set the argument qc.sds=NULL to remove from the analysis those samples having LRR in target region (chrY:2,694,521-59,034,049, hg19/GRCh37) larger than 2 times the LRR standard deviation of autosomes.

ex$par$files[ex$par$QCremoved]

Visualizing summarized mLRR-Y data in a set of samples

Data can be visually inspected by using the generic plot function. This function depicts the mean difference between the LLR of Y chromosome and the reference region that is performed in order to control for possible technical artifacts. As previously mentioned, the reference is consider the autosomes and LRR is summarized by using the trimmed mean in order to remove the effect of having any gain or lose in the genome. Notice that this reference chromosome can be changed by the user.

plot(ex, print.labels = TRUE, threshold = -0.2)

This figure shows four samples that could have a LOY event (those in the -.2, -.8 range of Y-axis). Visual inspection of the mLRR-Y region can be used to verify whether they are real LOY rearrangements as we are describing in the next section.

Visualizing mLRR-Y region of a single sample

We can visually inspect the information of mLRR-Y region (e.g., one by one sample) and decide whether a given individuals is having a LOY or not. This can be performed either by using plotIndSNP or plotIndLRR functions. Let's create these plot of a given sample having a normalized mLRR-Y value around 0 (i.e SAMPLE_2) with the plotIndLRR function. The figure represents the expected values of normal LRR at 0 (red line). However, as previously mentioned, depending the algorithm used to get LRR in the mLRR-Y region these values can be centered around -0.46 indicating that only 1 chromosome is present in males. Orange line represents the median value of mLRR-Y in the subset of analyzed samples. This is the value that must be considered as the reference to indicate whether the mLRR-Y is a LOY event or not.

plotIndLRR(ex, sample="SAMPLE_2")

Now, let us create the plot of SAMPLE_21 and SAMPLE_81 that are having the lowest values of mLRR-Y.

plotIndLRR(ex, sample="SAMPLE_21")
plotIndLRR(ex, sample="SAMPLE_81")

The plots indicate that both samples are probably carrying a LOY because the LRR (brown dots) in the mLRR-Y region (shaded area) is far below from the reference (orange line). The blue line represents the median LRR values in the mLRR-Y region. Notice that the example of SAMPLE_2 is having the blue line very close to the reference one (orange) indicating that it is a normal sample.

NOTE: Sometimes, double-checking that a sample carries a real LOY requires to plot the b-allele frequency (BAF). To do so, the function plotIndSNP can be used. In addition to the chromosome Y, we can plot the chromosome X values with the plotIndSNPX function, with the shared PAR1, PAR2 and XTR regions seen in the chromosome Y and the B-deviation should be computed. This is illustrated in the section Checking BAF in X-Y shared regions for LOY calls below.

Calling LOY

Calling LOY is currently based on assuming that experimental variation in mLRR-Y is distributed in a non-skewed fashion. @forsberg2014mosaic propose to generate the expected experimental background noise of mLRR-Y data using a symmetrical distribution. They impose the observed variation in the positive tail of the mLRR-Y distribution into a reflected negative tail by mirroring over median value of mLRR-Y the data observed in the positive tail. However, this approach has some drawbacks since it is expected that large values of mLRR-Y indicates the existence of XYY cases as observed with cancer patients in @gonzalezMADloy. In order to overcome this diffitulties and to account for the posibility of having XYY individuals, we first set a calling threshold of XYY status, following this equation

$$\text{mLRRy} < \text{median} + 1.2 \times \text{IQR}(\text{mLRRy})$$

for the symmetric distribution that is obtained by reflecting the positive side of the centered mLRR-Y distribution around its maximum. LOY status is then called with the negative value of the XYY threshold.

The calling is performed by using getLOY function. This function classifies the samples according the method that we propose, the method proposed by @forsberg2014mosaic that considers mLRR-Y as a continuous surrogate variable of LOY.

ex.call <- getLOY(ex)
ex.call

The plot indicating those sample with LOY alteration can be obtained by:

plot(ex.call, ylim=c(-2, 1), print.labels=TRUE, pos.leg="bottomright")

Checking BAF in X-Y shared regions for LOY calls

To assess the LOY events detected with the mLRRY values and increase conficende of the call, we propose to check the B Allele Frequency (BAF) values of heterozygous probes in the shared regions PAR1, PAR2 and XTR for all the samples. These values measures the balance between A and B alleles, and are expected to be near 0.5 due to the balance between alleles. However, if there is an increased or decreased number of alleles due to a loss (LOY) or a gain (XYY), BAF values will be altered depending on the number of cell affected by the event. In addition to checking the BAF values in shared regions, we also propose to check the number of heterozygous probes in the non-shared region of chromosome X, in order to remove possible other events that are not LOY events. This checking is performed by using the checkBdev function. By default this function takes into account the PAR1, PAR2 and XTR probes, but can be used also with samples only with the XTR region. The calling obtained with this function is compared with the previous calling and the resulting call can be the same one, "discordant" if there is no concordance between the mLRRY and BAF calls, or "other" if there are anomalous values of heterozygous probes in the non-shared region of chromosome X that could explain the decrease in mLRRY (contamination of other samples, or other chromosomal events, for example).

ex.Bdev <- checkBdev(ex.call)
ex.Bdev

The resulting call show that 4 samples have true LOY events which are confirmed with b-deviation information. We can have a look a the entire information used to process b-deviation data ans see, for instance, why some samples have been called as 'discordant'

ex.Bdev$Bdev$data[1:5,]

The sample ids having LOY can be obtained by:

getSamples(ex.Bdev, type="LOY")

The samples classified as "other" are:

getSamples(ex.Bdev, type="other")

can be visually inspect those samples (using plotIndLRR, plotIndSNP and plotIndSNPX functions) to identify their type of rearrangements. Let's illustrate three of them:

par(mfrow=c(1,2))
plotIndSNPX(ex.Bdev, "SAMPLE_47")
plotIndSNP(ex.Bdev, "SAMPLE_47")

plotIndSNPX(ex.Bdev, "SAMPLE_89")
plotIndSNP(ex.Bdev, "SAMPLE_89")

plotIndSNPX(ex.Bdev, "SAMPLE_98")
plotIndSNP(ex.Bdev, "SAMPLE_98")

"SAMPLE_47" belongs to an XY individual with a contamination of a non-related XX sample of 12\% aprox of the whole sample. This can be seen in non-shared regions of chromosome X by the presence of three-band clusters in bottom and top regions of B Allele Frequency values (red points). This contamination can generate a false LOY call due to reduction in expected probe intensity for msY specific probes.

"SAMPLE_89" belongs to an XY individual with a contamination of a non-related XY sample of 50\% aprox. of the whole sample. This can be seen in the heterozygous band in non-shared regions of chromosome X (deviated upwards due to the decrease in intensity), the presence of three-band BAF centered at 0.5 in shared regions of chromosome X and Y, and a high number of heterozygous probes in non-shared region of chromosome Y.

"SAMPLE_98" belongs to an XY individual with a contamination of a non-related XY sample of 38\% aprox. of the whole sample. This can be seen in the BAF split in non-shared regions of chromosome X without the presence of the central band indicative of three diferent chromosomes, and the presence of the BAF split in PAR regions.

References



isglobal-brge/MADloy documentation built on March 3, 2024, 7:27 p.m.