knitr::opts_chunk$set(collapse = TRUE, comment = "", cache=TRUE, message = FALSE )
The aim of this package is to help the user to detect individuals who present extreme downregulation of chromosome Y (EDY) from microarray or RNA-seq experiments. For each individual, we measured the relative expression of the entire chromosome Y with respect to the autosomes. For $n$ probes (exons) in chromosome Y, with $x_i$ intensity (read counts) for the $i$-th probe, we computed $y=1/n \sum_{i=1}^N \log_2(x_i)$ as a measure of the average expression of chromosome Y. Likewise, for $m$ probes in the autosomes, we computed the mean expression of autosomes $a = 1/m \sum_{i=1}^M \log_2(x_i)$ [NOTE: for RNAseq data $\log_2(x_i + 1)$ is computed to avoid problems with zero counts]. The relative amount of an individual's Y expression with respect to the individual's autosomes was then $Ry=y-a$, and, in a population sample, the individual $j$ was considered with EDY if
$$Ry_j < median(Ry) - 1.2 IQR(Ry)$$,
where IQR is the inter-quartile range across the sample.
First, let us load the required packages
library(EDY) library(SummarizedExperiment) library(caret) library(tidyverse) library(Biobase)
EDY can be computed using getEDY()
function from EDY
package. The main input is an object of class ExpressionSet
or RangedSummarizedExperiment
. This function requires also some information, including:
microarray
or RNAseq
.exprs
in ExpressionSets
or with assay
in RangedSummarizedExperiments
) is in logarithmic scale or not. fData
in ExpressionSets
or with rowData
in RangedSummarizedExperiments
) contains the Gene symbol
. This is required to get the relative expression of chromosome Y.pData
in ExpressionSets
or with colData
in RangedSummarizedExperiments
) contains the information about the gender
and the symbol that codes for males
. This is required since EDY is computed only in male samples.pData
in ExpressionSets
or with colData
in RangedSummarizedExperiments
) contains the information about the groups, and the symbol that codes for control. When this information is available, the $Ry$ value is estimated only in controls. Let us illustrate how to compute EDY for a real ExpressionSet
from a microarray experiment available at GEO. Let us illustrate how to compute EDY for a real data set available at GEO. It contains information about DNA methylation and gene expression in brainstem, thalamic, and supratentorial gliomas (GEO accesion number GSE50774). Data can be downloaded into R by:
library(GEOquery) GSE50774.expr <- getGEO("GSE50774")[[2]]
In order to facilitate reproducibility, we have also include this ExpressionSet
into our package and it is loaded by default. We can visualize the content by:
GSE50774.expr
The information about whether the expression data is in logarithmic scale or not should be specified in the description of the data set, but it can also be checked with exprs
(or assay
in RangedSummarizedExperiments
). In this case, it is logarithmic:
exprs(GSE50774.expr)[1:5, 1:5]
The information about gene symbol can be accessed by fData
(or rowData
in RangedSummarizedExperiments
). If you cannot find it there, please see Missing information section below. In this case, the gene symbol column is named Gene Symbol
:
fData(GSE50774.expr)[1:5,1:4] names(fData(GSE50774.expr))
In some occasions, the same entry has several gene symbols. In this example, the gene symbols are separted by ///
. To select only the first symbol we can use the following code, otherwise, getEDY
will not recognise the transcript:
tmp <- strsplit(fData(GSE50774.expr)[, "Gene Symbol"], " /// ") fData(GSE50774.expr)$Symbol <- unlist(lapply(tmp, function(x) x[1])) head(fData(GSE50774.expr)[,c("Gene Symbol","Symbol")])
The information about the sex of individuals must be found using pData
(or colData
in RangedSummarizedExperiments
). In this case, gender column is called gender:ch1
and males key is M
. If, for some reason, your data do not contain this information, it can be estimated from expression data in microarrays (see Missing information section below).
names(pData(GSE50774.expr))
This is the total number of male and female samples
table(GSE50774.expr$`gender:ch1`)
Once this information is obtained, the EDY status of male samples can be obtained by:
edy.GSE50774 <- getEDY(x = GSE50774.expr, gender.var = 'gender:ch1', male.key = 'M', gene.key = 'Symbol', experiment.type = "microarray") edy.GSE50774
Notice that the argument log = TRUE
has been omitted in the function since this is the default value. In RNA-seq experiments you usually must specify log = FALSE
since the expression is measured in transcript counts and it is not logarithmic. This function returns an EDY
object that contains a factor with two levels, NO
and YES
, that indicates whether the individual has EDY or not ($EDY
); a matrix containing the relative expresison of chromosome Y of each individual with respect to the autosomes ($EDYcontinuous
); the threshold from wich down an individual is considered to have EDY ($threshold
), and an ExpressionSet
or RangedSummarizedExperiment
containing the same information than the original one but with only male individuals and a new column in pData
or colData
that corresponds to the EDY status of each individual.
table(edy.GSE50774$EDY)
A scatter plot can be easily generated by using the generic function plot
. It allows to visualize the relative expression of chromosome Y with regard to autosomal genes for each individual, as well as the threshold (dashed line) and the individuals considered as EDY:
plot(edy.GSE50774)
The ids of individuals classified as EDY can be obtained by
EDYindividuals <- names(edy.GSE50774$EDY[edy.GSE50774$EDY=="Yes"]) EDYindividuals
In this section we are describing how to obtain some of the required information to estimate EDY. It includes examples about how to get gender of individuals or gene symbol that is required to select probes from chromosome Y.
In case your data set comes from a microarray experiment and it does not contain information about the gender of the individuals, you can assess it using massiR.
If your dataset does not contain information about the gene symbol, the function get_hgnc
can be used to get the hgnc (HUGO gene nomenclature committee) symbols from genBank accession numbers, entrezgenes (NCBI gene IDs) or other types of identifiers. This function uses dictionaries constructed from the information available at genenames.org and, additionally, it uses the package org.Hs.eg.db when we need to obtain the gene symbols from genBank accession numbers.
The required inputs are:
ExpressionSet
, a RangedSummarizedExperiment
or a vector of IDs.refseq
, uniprot
, ensembl
, entrezgene
or genbank
.fData
.In the example data set, we can use for instance entrezgene IDs. If the same entry has several entrezgene IDs, we need to select just one:
head(fData(GSE50774.expr)[,"ENTREZ_GENE_ID"]) tmp <- strsplit(fData(GSE50774.expr)[, "ENTREZ_GENE_ID"], " /// ") fData(GSE50774.expr)$ENTREZ_GENE_ID <- unlist(lapply(tmp, function(x) x[1])) head(fData(GSE50774.expr)[,"ENTREZ_GENE_ID"])
# Get HGNC symbols from entrezgene IDs: GSE50774.expr_2 <- get_hgnc(x = GSE50774.expr, key.type = "entrezgene", key.col = "ENTREZ_GENE_ID") fData(GSE50774.expr_2)[1:5, 1:5]
Observe that column hgnc_symbol
has been added to the annotation. Similarly, in this example we could have used genBank accession numbers by simply:
# Get HGNC symbols from genBank accession numbers: tmp <- strsplit(fData(GSE50774.expr)[, "GB_ACC"], " /// ") fData(GSE50774.expr)$GB_ACC <- unlist(lapply(tmp, function(x) x[1])) GSE50774.expr_2 <- get_hgnc(x = GSE50774.expr, key.type = "genbank", key.col = "GB_ACC")
It is recommended not to use genbank accession numbers when possible since in this case, the function must consult an external database and lots of gene IDs are lost during the process, so the results may not be reliable. Now getEDY
can be applied with gene.key = "hgnc_symbol"
:
edy.GSE50774_2 <- getEDY(x = GSE50774.expr_2, gender.var = 'gender:ch1', male.key = 'M', gene.key = 'hgnc_symbol', experiment.type = "microarray") edy.GSE50774_2 plot(edy.GSE50774_2)
table(edy.GSE50774_2$EDY)
EDYindividuals2 <- names(edy.GSE50774_2$EDY[edy.GSE50774_2$EDY=="Yes"]) EDYindividuals2
As we can see, the results are the same as when using gene symbols from the original data set except from slight differences in the relative expression of chromosome Y in each individual ($EDYcontinuous
). It is due to the loss of some gene IDs during the translation, since not all the IDs find their equivalence in the dictionaries.
If your dataset does not have gene symbols, genBank accession numbers, entrezgene IDs nor any of the accepted identifiers, you can try to get this information from other columns in fData
or rowData
using biomaRt.
Another utility of this package it to predict extreme downregulation of chromosome Y using methylation data. For this purpose, we use the function predictEDY
. The only input for this function is one of:
The input must contain only male individuals, since it does not make sense to do a prediction of EDY in females. If it contains NAs, some predictions may be NA too.
The prediction is based on an Elastic Net regression model, trained with data from The Cancer Genome Atlas (TCGA).In the following example, we use a matrix containing the methylation information of chromosome Y of the previous data set (GSE50774
). It is already filtered and contains only the same individuals for which we calculated EDY previously. It is available in this package and loaded by default to facilitate reproducibility:
# You can download the whole methylation data using: # GSE50774.meth <- getGEO('GSE50774')[[1]] chrY.meth.GSE50774[1:5,1:5] edy.prediction <- predictEDY(chrY.meth.GSE50774)
The accuracy and the confidence interval 95% of the model is calculated by using the same CpGs of chromosome Y in your data to predict EDY in the TCGA dataset and then comparing with the real EDY calculated with getEDY
in the TCGA dataset. Next, the model is used to predict EDY in the problem matrix. The results are the following:
edy.prediction
table(edy.prediction)
table(edy.GSE50774$EDY, edy.prediction)
As we can observe, 10 individuals are negative for EDY both in the prediction and with getEDY
, 1 individual is positive for EDY both in the prediction and with getEDY
, 1 individual is classified as EDY with getEDY
but it is not predicted as so, and there are no individuals predicted as EDY but not classified as so with getEDY
.
We also can observe that some individuals can not be predicted because there are NAs in the methylaltion matrix that impede the prediction. We could remove this NAs but in that case, the available CpGs would be too few for the algorithm to do the calculations.
The conclusion we can extract is that the more reliable EDY individual is GSM1228961
because it is classified as EDY with the both methods. We must have into account that getEDY
just selects the individuals from a group which have the lowest relative expression of chromosome Y with respect to the autosomal genes, so the results vary dependeing on the individuals in the group. However, the prediction method is based on individual CpG methylation in chromosome Y, the results of one individual does not depend on the others. This means that, depending on what we are looking for, one method is preferable over the other. If we want to compare the expression levels of chromosome Y of some individuals with a condition, we may use the getEDY
method, but if we want to take the EDY status as an absolute, we should use the prediction method as it always has the same value for the same individual regardless of the rest of the samples.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.