Parsing output from the Bismark alignment suite.
1 2 3 4 5 6 7 8 9 10 11 12 13
The path to the files created by running Bismark's methylation
extractor, one sample per file.
Files ending in
An optional DataFrame describing the samples.
Row names, if present, become the column names of the BSseq
An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchJobsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details.
Only applicable if
Only applicable if
Only applicable if
Only applicable if
The number of threads used by
A BSseq object.
The format of each file is automatically detected using the internal function
Files ending in
.zip will be automatically decompressed to
Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported.
loci = NULL, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus.
The 'coverage' output does not contain strand information and so the
strand of the returned BSseq object will be set to
* unless stranded
loci are supplied.
Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.
The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the
--zero_based argument was used when running Bismark's methylation extractor.
Furthermore, the default co-ordinate counting system varies by version of Bismark.
bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.
Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.
We recommend the following to achieve fast and efficient importing of Bismark files:
Specify the set of methylation loci via the
Use Bismark files in the 'coverage' output format.
rmZeroCov = FALSE.
BPPARAM with a moderate number of workers (cores).
BACKEND = "HDF5Array".
Use multiple threads per worker (i.e.
nThread > 1).
Each point is discussed below.
Specifying the set of methylation loci via the
loci argument means that
read.bismark() does not need to first parse all
files to identify the set of candidate loci.
rmZeroCov = FALSE, this means that each file is only read once.
This may be a considerable saving when there are a large number of
If you are unsure whether the below-described shortcuts apply to your data, leave
loci = NULL and let
read.bismark() identify the set of candidate loci from
You may wish to use the
findLoci() function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the
Alternatively, if all
files are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all
files contain the exact same set of methylation loci.
In this case, you may wish to first construct
loci using the internal function
bsseq:::.readBismarkAsFWGRanges() applied to a single file, e.g.,
loci = bsseq:::.readBismarkAsFWGRanges(files, rmZeroCov, strandCollapse).
It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).
rmZeroCov = FALSE
If you set
rmZeroCov = TRUE, then
read.bismark() must first parse all the
files to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci.
This will happen even if you supply
loci with a GenomicRanges of candidate loci.
Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object.
In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to
In such cases, it is premature to use
rmZeroCov = TRUE when running each
read.bismark(); regretably, combining these objects will often then lead to an inefficiently stored BSseq object.
BPPARAMwith a moderate number of workers (cores)
Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.
BACKEND = "HDF5Array"
By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.
Conversely, if you opt for all data to be in-memory (via
BACKEND = NULL), then each worker will pass each file's data back to the main process and memory usage will steadily accumulate to often unreasonable levels.
read each file, which supports threaded-parallisation. Depending on the
nThread can achieve near-linear speed-ups in the
number of threads for reading each file. Care needs to be taken when
nThread > 1 is used in conjunction with a parallel backend via
BPPARAM to ensure the system isn't overloaded. For example, using
BPPARAM = MulticoreParam(workers = 10) with
nThread = 4 may
use up to 40 workers simultaneously.
read.bismark() function creates a BSseq object with two assays,
The choice of realization backend controls whether these assays are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example.
The choice of realization backend is controlled by the
BACKEND argument, which defaults to the current value of
read.bismark() supports the following realization backends:
NULL (in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array (on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.
Please note that certain combinations of realization backend and parallelization backend are currently not supported.
For example, the HDF5Array realization backend is currently only compatible when used with a single-machine parallelization backend (i.e. it is not compatible
with a SnowParam that specifies an ad hoc cluster of
BSmooth() will issue an error when given such incompatible realization and parallelization backends.
Additional arguments related to the realization backend can be passed via the
These arguments must be named and are passed to the relevant RealizationSink constructor.
For example, the
... argument can be used to specify the path to the HDF5 file to be
Please see the examples at the bottom of the page.
read.bismark() now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a
SnowParam object as the value of
Detailed and extensive job logging facilities.
All parallelization options are controlled via the
In general, we recommend that users combine multicore (single-machine)
parallelization with an on-disk realization backend (see section,
'Realization backend'). For Unix and Mac users, this means using
a MulticoreParam. For Windows users, this means using a
single-machine SnowParam. Please consult the BiocParallel
documentation to take full advantage of the more advanced features.
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as
BSmooth(). Progress bars are
controlled via the
progressbar argument in the
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Peter Hickey [email protected]
read.bsmooth() for parsing output from the BSmooth
read.umtab() for parsing legacy (old) formats from the BSmooth alignment suite.
collapseBSseq() for collapsing (aggregating) data from sample's with multiple Bismark methylation extractor files (e.g., technical replicates).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# Run read.bismark() on a single sample to construct a matrix-backed BSseq # object. infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz", package = "bsseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, verbose = TRUE) # This is a matrix-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) bsseq ## Not run: # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq # object (with data written to 'test_dir') test_dir <- tempfile("BSseq") bsseq <- read.bismark(files = infile, colData = DataFrame(row.names = "test_data"), rmZeroCov = FALSE, strandCollapse = FALSE, BACKEND = "HDF5Array", dir = test_dir, verbose = TRUE) # This is a HDF5Array-backed BSseq object. sapply(assays(bsseq, withDimnames = FALSE), class) # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir'). sapply(assays(bsseq, withDimnames = FALSE), path) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.