DMRs identification with mCSEA package
In mCSEA: Methylated CpGs Set Enrichment Analysis

Previous steps

The input of mCSEA is tipically a matrix with the processed $\beta$-values for each probe and sample. If you start from the raw methylation files (like .idat files) you should first preprocess the data with any of the available packages for that purpose (e.g. r BiocStyle::Biocpkg("minfi") or r BiocStyle::Biocpkg("ChAMP")). Minfi includes functions to get a matrix of $\beta$-values (getBeta()) or M-values (getM()). ChAMP output class depends on the type of analysis performed. For instance, champ.norm() function returns a matrix, while champ.load() returns a list of results, and one of them is a $\beta$-values matrix. So mCSEA is totally compatible with minfi and ChAMP outputs as long as a matrix with the methylation values is obtained.

Reading .idat files

Here we provide a minimal example to read .idat files with r BiocStyle::Biocpkg("minfi") package. A samplesheet must be provided in order to read the raw files and generate a RGChannelSet object. For more information about this step, read minfi's vignette.

library(minfi)
minfiDataDir <- system.file("extdata", package = "minfiData")
targets <- read.metharray.sheet(minfiDataDir, verbose = FALSE)
RGset <- read.metharray.exp(targets = targets)

Cell type heterogeneity correction

Different cell types proportions across samples are one of the major sources of variability in methylation data from tissues like blood or saliva (@mcgregor16). There are a lot of packages which can be used to estimate cell types proportions in each sample in order to correct for this bias (reviewed in @mcgregor16). Here we supply an example where blood reference data is used to estimate cell types proportions of each blood sample.

library(FlowSorted.Blood.450k)
library(mCSEA)
data(mcseadata)

cellCounts = estimateCellCounts(RGset)
print(cellCounts)

These proportions could be introduced as covariates in the linear models.

Step 1: Ranking CpGs probes

To run a mCSEA analysis, you must rank all the evaluated CpGs probes with some metric (e.g. t-statistic, Fold-Change...). You can use rankProbes() function for that aim, or prepare a ranked list with the same structure as the rankProbes() output.

We load sample data to show how rankProbes() works:

library(mCSEA)
data(mcseadata)

We loaded to our R environment betaTest and phenoTest objects, in addition to exprTest, annotation objects and association objects (we will talk about these after). betaTest is a matrix with the $\beta$-values of 10000 EPIC probes for 20 samples. phenoTest is a dataframe with the explanatory variable and covariates associated to the samples. When you load your own data, the structure of your objects should be similar.

head(betaTest, 3)
print(phenoTest)

rankProbes() function uses these two objects as input and apply a linear model with r BiocStyle::Biocpkg("limma") package. By default, rankProbes() considers the first column of the phenotypes table as the explanatory variable in the model (e.g. cases and controls) and does not take into account any covariate to adjust the models. You can change this behaviour modifying explanatory and covariates options.

By default, rankProbes() assumes that the methylation data object contains $\beta$-values and transform them to M-values before calculating the linear models. If your methylation data object contains M-values, you must specify it (typeInput = "M"). You can also use $\beta$-values for models calculation (typeAnalysis = "beta"), although we do not recommend it due to it has been proven that M-values better accomplish the statistical assumptions of limma analysis (@du10).

myRank <- rankProbes(betaTest, phenoTest, refGroup = "Control")

myRank is a named vector with the t-values for each CpG probe.

head(myRank)

You can also supply rankProbes() function with a SummarizedExperiment object. In that case, if you don't specify a pheno object, phenotypes will be extracted from the SummarizedExperiment object with colData() function.

Paired analysis

It is possible to take into account paired samples in rankProbes() analysis. For that aim, you should use paired = TRUE parameter and specify the column in pheno containing pairing information (pairColumn parameter).

Step 2: Searching DMRs in predefined regions

Once you calculated a score for each CpG, you can perform the mCSEA analysis. For that purpose, you should use mCSEATest() function. This function takes as input the vector generated in the previous step, the methylation data and the phenotype information. By default, it searches for differentially methylated promoters, gene bodies and CpG Islands. You can specify the regions you want to test with regionsTypes option. minCpGs option specifies the minimum amount of CpGs in a region to be considered in the analysis (5 by default). You can increase the number of processors to use with nproc option (recommended if you have enough computational resources). Finally, you should specify if the array platform is 450k or EPIC with the platform option.

myResults <- mCSEATest(myRank, betaTest, phenoTest, 
                        regionsTypes = "promoters", platform = "EPIC")

mCSEATest() returns a list with the GSEA results and the association objects for each region type analyzed, in addition to the input data (methylation, phenotype and platform).

ls(myResults)

promoters is a data frame with the following columns (partially extracted from r BiocStyle::Biocpkg("fgsea") help):

pval: Estimated P-value.
padj: P-value adjusted by BH method.
log2err: Expected error for the standard deviation of the P-value logarithm.
ES: Enrichment score.
NES: Normalized enrichment score by number of CpGs associated to the feature.
size: Number of CpGs associated to the feature.
leadingEdge: Leading edge CpGs which drive the enrichment.

head(myResults[["promoters"]][,-7])

On the other hand, promoters_association is a list with the CpG probes associated to each feature:

head(myResults[["promoters_association"]], 3)

You can also provide a custom association object between CpG probes and regions (customAnnotation option). This object should be a list with a structure similar to this:

head(assocGenes450k, 3)

Step 3: Plotting the results

Once you found some DMRs, you can make a plot with the genomic context of the interesting ones. For that, you must provide mCSEAPlot() function with the mCSEATest() results, and you must specify which type of region you want to plot and the name of the DMR to be plotted (e.g. gene name). There are some graphical parameters you can adjust (see mCSEAPlot() help). Take into account that this function connects to some online servers in order to get genomic information. For that reason, this function could take some minutes to finish the plot, specially the first time it is executed.

mCSEAPlot(myResults, regionType = "promoters", 
           dmrName = "CLIC6",
           transcriptAnnotation = "symbol", makePDF = FALSE)

You can also plot the GSEA results for a DMR with mCSEAPlotGSEA() function.

mCSEAPlotGSEA(myRank, myResults, regionType = "promoters", dmrName = "CLIC6")

Integrating methylation and expression data

If you have both methylation and expression data for the same samples, you can integrate them in order to discover significant associations between methylation changes in a DMR and an expression alterations in a close gene. mCSEAIntegrate() considers the DMRs identified by mCSEATest() passing a P-value threshold (0.05 by default). It calculates the mean methylation for each condition using the leading edge CpGs and performs a correlation test between this mean DMR methylation and the expression of close genes. This function automatically finds the genes located within a determined distance (1.5 kb) from the DMR. Only correlations passing thresholds (0.5 for correlation value and 0.05 por P-value by default) are returned. For promoters, only negative correlations are returned due to this kind of relationship between promoters methylation and gene expression has been largely observed (@jones12). On the contrary, only positive correlations between gene bodies methylation and gene expression are returned, due to this is a common relationship observed (@jones12). For CpG islands and custom regions, both positive and negative correlations are returned, due to they can be located in both promoters and gene bodies.

To test this function, we extracted a subset of 100 genes expression from bone marrows of 10 healthy and 10 leukemia patients (exprTest). Data was extracted from r BiocStyle::Biocpkg("leukemiasEset") package.

# Explore expression data
head(exprTest, 3)

# Run mCSEAIntegrate function
resultsInt <- mCSEAIntegrate(myResults, exprTest, "promoters", "ENSEMBL")

resultsInt

It is very important to specify the correct gene identifiers used in the expression data (geneIDs parameter). mCSEAIntegrate() automatically generates correlation plots for the significant results and save them in the directory specified by folder parameter (current directory by default).

Integration plot for GATA2 promoter methylation and ENSG00000179348
expression. Note that, actually, both names refers to the same gene, but
SYMBOL was used to analyze promoters methylation and ENSEMBL ID was used as
gene identifiers in the expression data.