require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
BiocStyle::markdown()

Introduction

The r Biocpkg("FlowSorted.Blood.WGBS.BLUEPRINT") package provides a R / Bioconductor resource for representing and manipulating N=44 flow sorted purified cell types from whole blood measured on a whole genome bisulfite-sequencing (WGBS) platform generated by the BLUEPRINT consortium (http://www.blueprint-epigenome.eu).

This package makes extensive use of the r Biocpkg("HDF5Array") package to avoid loading the entire data set in memory, instead storing the methylated and coverage reads on disk as HDF5 files and loading subsets of the data into memory upon request.

Work flow

Loading the data

We use the FlowSorted.Blood.WGBS.BLUEPRINT() function to download the relevant files from Bioconductor's ExperimentHub web resource. This includes the HDF5 files containing the number of methylated reads (M) and total reads (Cov), as well as the genomic locations of the CpGs and metadata on the columns (samples). The output is a single BSseq object from the r Biocpkg("bsseq") package. This is equivalent to a SummarizedExperiment class but with a number of features specific to DNA methylation data.

library(FlowSorted.Blood.WGBS.BLUEPRINT)
bs <- FlowSorted.Blood.WGBS.BLUEPRINT()
bs

The colData containing the phenotypic information about each sample can be extracted using

colData(bs)

The coverage matrix (total number of reads or Cov) and methylation matrix (number of methylated reads or M) is represented as a DelayedMatrix from the r Biocpkg("DelayedArray") package. This wraps the underlying HDF5 file in a container that can be manipulated in R.

hdf5_cov <- getCoverage(bs, type = "Cov")
hdf5_meth <- getCoverage(bs, type = "M")

hdf5_cov
hdf5_meth

Exploring the data

To quickly explore the data set, we compute some summary statistics on the coverage matrix. We tell the r Biocpkg("DelayedArray") block size to indicate that we can use up to 1 GB of memory for loading the data into memory from disk.

options(DelayedArray.block.size=1e9)

We are interested in removing all CpGs that have no coverage in at least one sample, a naive implementation might be

keep_ids <- (DelayedArray::rowSums(hdf5_cov == 0) == 0)
table(keep_ids) 

We see that results in about r sum(keep_ids) / 1e6 million CpGs. Now, we can reduce the size of the object to

bs.filtered <- bs[keep_ids,]
bs.filtered

Session information

sessionInfo()


stephaniehicks/FlowSorted.Blood.WGBS.BLUEPRINT documentation built on Nov. 5, 2019, 9:35 a.m.