We'll use a small cord blood DNA methylation dataset available on GEO: GSE79056.
library(geograbi) ## https://github.com/yousefi138 samples <- geograbi.get.samples("GSE79056") vars <- geograbi.extract.characteristics(samples) methylation <- geograbi.get.data("GSE79056") ## 86Mb
We'll be investigating associations with gestational age.
colnames(vars)[which(colnames(vars) == "ga weeks")] <- "ga" vars$ga <- as.numeric(vars$ga)
Prepare surrogate variables to handle unknown confounding.
library(sva, quietly=T) ## construct EWAS model mod <- model.matrix(~ gender + ga, vars) ## construct null model mod0 <- mod[,1] ## to save time, SVA will be applied to a random selection of 5000 CpG sites set.seed(20191029) random.idx <- sample(1:nrow(methylation), 5000) methylation.sva <- methylation[random.idx,] ## missing methylation values are replaced with mean values methylation.mean <- rowMeans(methylation.sva, na.rm=T) idx <- which(is.na(methylation.sva), arr.ind=T) if (nrow(idx) > 0) methylation.sva[idx] <- methylation.mean[idx[,"row"]] sva.fit <- sva(methylation.sva, mod=mod, mod0=mod0)
Add surrogate variables to the model.
design <- cbind(mod, sva.fit$sv)
Test the associations using limma.
library(limma) fit <- lmFit(methylation, design) fit <- eBayes(fit)
Save the summary statistics for gestational age.
stats <- data.frame(estimate=fit$coefficients[,"ga"], se=sqrt(fit$s2.post) * fit$stdev.unscaled[,"ga"], p.value=fit$p.value[,"ga"])
dmrff and all other methods for identifying differentially methylated regions
require information about the genomic locations of all CpG sites.
If your data was generated using one of the Illumina BeadChips, then it is possible to annotate your EWAS summary statistics using the appropriate Bioconductor annotation package.
For the Illumina 450K microaray, we use the
IlluminaHumanMethylation450kanno.ilmn12.hg19 package.
if (!require(IlluminaHumanMethylation450kanno.ilmn12.hg19, quietly=T)) { install.packages("BiocManager") BiocManager::install("IlluminaHumanMethylation450kanno.ilmn12.hg19") } library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
If we were using the more recent Illumina MethylationEPIC BeadChip,
then we would load the IlluminaHumanMethylationEPICanno.ilm10b2.hg19 package instead.
We construct an annotation data frame as follows:
data(list="IlluminaHumanMethylation450kanno.ilmn12.hg19") data(Locations) data(Other) annotation <- cbind(as.data.frame(Locations), as.data.frame(Other))
We then add the annotation to the stats object
(Warning: we assume that the row names of the stats object
are the Illumina CpG site identifiers, e.g. cg05775921).
common <- intersect(rownames(methylation), rownames(annotation)) annotation <- annotation[match(common, rownames(annotation)),] stats <- stats[match(common, rownames(stats)),] methylation <- methylation[match(common, rownames(methylation)),] stats <- cbind(stats, annotation)
dmrff to summary statisticsIf dmrff is not already installed, we install it and then load it:
if (!require(dmrff, quietly=T)) { library(devtools) install_github("perishky/dmrff") } library(dmrff)
Below we assume that you have already performed an epigenome-wide association analysis
and have loaded your summary statistics in R as a data frame stats
which has the following columns:
- estimate (regression coefficient),
- se (standard error of the coefficient),
- p.value,
- chr (chromosome of the CpG site),
- pos (position of the CpG site on the chromosome).
We also assume that you have loaded your DNA methylation dataset in R as matrix methylation
for which rows correspond to CpG sites and columns to samples.
The DNA methylation levels are necessary for dmrff to calculate
and adjust for dependencies between CpG sites.
dmrff can then be applied as follows.
dmrs <- dmrff(estimate=stats$estimate, se=stats$se, p.value=stats$p.value, methylation=methylation, chr=stats$chr, pos=stats$pos, maxgap=500, verbose=T)
The algorithm will then identify differentially methylated regions by:
maxgap parameter).Our output dmrs is a data frame listing all genomic regions tested
along with summary statistics for each.
We just keep regions with > 1 CpG site and Bonferroni adjusted p < 0.05.
dmrs <- dmrs[which(dmrs$p.adjust < 0.05 & dmrs$n > 1),]
There are r nrow(dmrs) such DMRs.
Here are the first 10 regions:
kable(dmrs[1:10,])
We can also list the information for each CpG site in these regions.
sites <- dmrff.sites(dmrs, stats$chr, stats$pos) sites <- cbind(sites, stats[sites$site, c("UCSC_RefGene_Name", "estimate", "p.value")]) sites <- cbind(sites, dmr=dmrs[sites$region,c("start","end","z","p.adjust")])
Here are the sites in the largest DMR:
max.region <- which.max(dmrs$n) kable(sites[which(sites$region == max.region),],row.names=F)
Here we generate a simple plot of the largest DMR:
dmrff.plot( dmrs$chr[max.region], dmrs$start[max.region], dmrs$end[max.region], stats$estimate, stats$se, stats$chr, stats$pos, ci=T, expand=1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.