require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
There are several approaches available to adjust for differents in the relative proportion of cell types in whole blood measured from DNA methylation (DNAm). For example, reference-based approaches require the use of reference data sets made up of purified cell types to identify cell type-specific DNAm signatures. These cell type-specific DNAm signatures are used to estimate the relative proportions of cell types directly, but these reference data sets are laborious and expensive to collect. Furthermore, these reference data sets will need to be continuously collected over time as new platform technologies emerge measuring DNAm because the observed methylation levels for the same CpGs in the same sample vary depending the platform technology.
In contrast, there are reference-free approaches, which are based on methods related to surrogate variable analysis or linear mixed models. These approaches do not provide estimates of the relative proportions of cell types, but rather these methods just remove the variability induced from the differences in relative cell type proportions in whole blood samples.
Here, we present a statistical model that estimates the cell composition of whole blood samples measured from DNAm. The method can be applied to microarray or sequencing data (for example whole-genome bisulfite sequencing data, WGBS, reduced representation bisulfite sequencing data, RRBS). Our method is based on the idea of identifying informative genomic regions that are clearly methylated or unmethylated for each cell type, which permits estimation in multiple platform technologies as cell types preserve their methylation state in regions independent of platform despite observed measurements being platform dependent.
methylCC R package and other packages that we'll need
library(FlowSorted.Blood.450k) library(methylCC) library(minfi) library(tidyr) library(dplyr) library(ggplot2)
# Phenotypic information about samples head(pData(FlowSorted.Blood.450k)) # RGChannelSet rgset <- FlowSorted.Blood.450k[, pData(FlowSorted.Blood.450k)$CellTypeLong %in% "Whole blood"]
estimatecc() function must have
one object as input:
objectsuch as an
RGChannelSetfrom the R package
BSseqobject from the R package
bsseq. This object should contain observed DNAm levels at CpGs (rows) in a set of $N$ whole blood samples (columns).
In this example, we are interested in estimating the cell
composition of the whole blood samples listed in the
FlowSorted.Blood.450k R/Bioconductor package.
To run the
just provide the
RGChannelSet. This will
estimatecc object. We
will call the object
set.seed(12345) est <- estimatecc(object = rgset) est
To see the cell composition estimates, use the
We can also use the
estimateCellCounts() from R/Bioconductor package
to estimate the cell composition for each of the whole blood samples.
sampleNames(rgset) <- paste0("Sample", 1:6) est_minfi <- minfi::estimateCellCounts(rgset) est_minfi
Then, we can compare the estimates to
df_minfi = gather(cbind("samples" = rownames(cell_counts(est)), as.data.frame(est_minfi)), celltype, est, -samples) df_methylCC = gather(cbind("samples" = rownames(cell_counts(est)), cell_counts(est)), celltype, est, -samples) dfcombined <- full_join(df_minfi, df_methylCC, by = c("samples", "celltype")) ggplot(dfcombined, aes(x=est.x, y = est.y, color = celltype)) + geom_point() + xlim(0,1) + ylim(0,1) + geom_abline(intercept = 0, slope = 1) + xlab("Using minfi::estimateCellCounts()") + ylab("Using methylCC::estimatecc()") + labs(title = "Comparing cell composition estimates")
We see the estimates closely match for the six cell types.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.