Identifying differentially methylated regions using dmrff

Download and prepare an example DNA methylation dataset

We'll use a small cord blood DNA methylation dataset available on GEO: GSE79056.

library(geograbi) ## https://github.com/yousefi138
samples <- geograbi.get.samples("GSE79056")
vars <- geograbi.extract.characteristics(samples)
methylation <- geograbi.get.data("GSE79056") ## 86Mb

We'll be investigating associations with gestational age.

colnames(vars)[which(colnames(vars) == "ga weeks")] <- "ga"
vars$ga <- as.numeric(vars$ga)

Test DNA methylation associations

Prepare surrogate variables to handle unknown confounding.

library(sva, quietly=T)
## construct EWAS model
mod <- model.matrix(~ gender + ga, vars)
## construct null model
mod0 <- mod[,1]
## to save time, SVA will be applied to a random selection of 5000 CpG sites
set.seed(20191029)
random.idx <- sample(1:nrow(methylation), 5000)
methylation.sva <- methylation[random.idx,]
## missing methylation values are replaced with mean values
methylation.mean <- rowMeans(methylation.sva, na.rm=T)
idx <- which(is.na(methylation.sva), arr.ind=T)
if (nrow(idx) > 0)
    methylation.sva[idx] <- methylation.mean[idx[,"row"]]
sva.fit <- sva(methylation.sva, mod=mod, mod0=mod0)

Add surrogate variables to the model.

design <- cbind(mod, sva.fit$sv)

Test the associations using limma.

library(limma)
fit <- lmFit(methylation, design)
fit <- eBayes(fit)

Save the summary statistics for gestational age.

stats <- data.frame(estimate=fit$coefficients[,"ga"],
                    se=sqrt(fit$s2.post) * fit$stdev.unscaled[,"ga"],
                    p.value=fit$p.value[,"ga"])

Add genomic coordinates for CpG sites (Illumina Beadchips)

dmrff and all other methods for identifying differentially methylated regions require information about the genomic locations of all CpG sites.

If your data was generated using one of the Illumina BeadChips, then it is possible to annotate your EWAS summary statistics using the appropriate Bioconductor annotation package.

For the Illumina 450K microaray, we use the IlluminaHumanMethylation450kanno.ilmn12.hg19 package.

if (!require(IlluminaHumanMethylation450kanno.ilmn12.hg19, quietly=T)) {
    install.packages("BiocManager")
    BiocManager::install("IlluminaHumanMethylation450kanno.ilmn12.hg19")
}
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)

If we were using the more recent Illumina MethylationEPIC BeadChip, then we would load the IlluminaHumanMethylationEPICanno.ilm10b2.hg19 package instead.

We construct an annotation data frame as follows:

data(list="IlluminaHumanMethylation450kanno.ilmn12.hg19")
data(Locations)
data(Other)
annotation <- cbind(as.data.frame(Locations), as.data.frame(Other))

We then add the annotation to the stats object (Warning: we assume that the row names of the stats object are the Illumina CpG site identifiers, e.g. cg05775921).

common <- intersect(rownames(methylation), rownames(annotation))
annotation <- annotation[match(common, rownames(annotation)),]
stats <- stats[match(common, rownames(stats)),]
methylation <- methylation[match(common, rownames(methylation)),]

stats <- cbind(stats, annotation)

Apply dmrff to summary statistics

If dmrff is not already installed, we install it and then load it:

if (!require(dmrff, quietly=T)) {
    library(devtools)
    install_github("perishky/dmrff")
}
library(dmrff)

Below we assume that you have already performed an epigenome-wide association analysis and have loaded your summary statistics in R as a data frame stats which has the following columns: - estimate (regression coefficient), - se (standard error of the coefficient), - p.value, - chr (chromosome of the CpG site), - pos (position of the CpG site on the chromosome).

We also assume that you have loaded your DNA methylation dataset in R as matrix methylation for which rows correspond to CpG sites and columns to samples. The DNA methylation levels are necessary for dmrff to calculate and adjust for dependencies between CpG sites.

dmrff can then be applied as follows.

dmrs <- dmrff(estimate=stats$estimate,
              se=stats$se,
              p.value=stats$p.value,
              methylation=methylation,
              chr=stats$chr,
              pos=stats$pos,
              maxgap=500,
              verbose=T)

The algorithm will then identify differentially methylated regions by:

Our output dmrs is a data frame listing all genomic regions tested along with summary statistics for each.

We just keep regions with > 1 CpG site and Bonferroni adjusted p < 0.05.

dmrs <- dmrs[which(dmrs$p.adjust < 0.05 & dmrs$n > 1),]

There are r nrow(dmrs) such DMRs.

Here are the first 10 regions:

kable(dmrs[1:10,])

We can also list the information for each CpG site in these regions.

sites <- dmrff.sites(dmrs, stats$chr, stats$pos)
sites <- cbind(sites, stats[sites$site, c("UCSC_RefGene_Name", "estimate", "p.value")])
sites <- cbind(sites, dmr=dmrs[sites$region,c("start","end","z","p.adjust")])

Here are the sites in the largest DMR:

max.region <- which.max(dmrs$n)
kable(sites[which(sites$region == max.region),],row.names=F)

Here we generate a simple plot of the largest DMR:

dmrff.plot(
    dmrs$chr[max.region], dmrs$start[max.region], dmrs$end[max.region],
    stats$estimate, stats$se, stats$chr, stats$pos, ci=T, expand=1)


perishky/dmrff documentation built on Jan. 4, 2024, 10:23 p.m.