read.bismark | R Documentation |
Parsing output from the Bismark alignment suite.
read.bismark(files,
loci = NULL,
colData = NULL,
rmZeroCov = FALSE,
strandCollapse = TRUE,
BPPARAM = bpparam(),
BACKEND = NULL,
dir = tempfile("BSseq"),
replace = FALSE,
chunkdim = NULL,
level = NULL,
nThread = 1L,
verbose = getOption("verbose"))
files |
The path to the files created by running Bismark's methylation
extractor, one sample per file.
Files ending in |
loci |
|
colData |
An optional |
rmZeroCov |
A |
strandCollapse |
A |
BPPARAM |
An optional
|
BACKEND |
|
dir |
Only applicable if |
replace |
Only applicable if |
chunkdim |
Only applicable if |
level |
Only applicable if |
nThread |
The number of threads used by |
verbose |
A |
A BSseq object.
The format of each file is automatically detected using the internal function bsseq:::.guessBismarkFileType()
.
Files ending in .gz
, .bz2
, .xz
, or .zip
will be automatically decompressed to tempdir()
.
Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported.
If setting loci = NULL
, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus.
The 'coverage' output does not contain strand information and so the strand
of the returned BSseq object will be set to *
unless stranded loci
are supplied.
Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.
The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the --zero_based
argument was used when running Bismark's methylation extractor.
Furthermore, the default co-ordinate counting system varies by version of Bismark.
bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.
Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.
read.bismark()
We recommend the following to achieve fast and efficient importing of Bismark files:
Specify the set of methylation loci via the loci
argument.
Use Bismark files in the 'coverage' output format.
Leave rmZeroCov = FALSE
.
Use a BPPARAM
with a moderate number of workers (cores).
Use BACKEND = "HDF5Array"
.
Use multiple threads per worker (i.e. nThread
> 1).
Each point is discussed below.
loci
Specifying the set of methylation loci via the loci
argument means that read.bismark()
does not need to first parse all files
to identify the set of candidate loci.
Provided that rmZeroCov = FALSE
, this means that each file is only read once.
This may be a considerable saving when there are a large number of files
.
If you are unsure whether the below-described shortcuts apply to your data, leave loci = NULL
and let read.bismark()
identify the set of candidate loci from files
.
You may wish to use the findLoci()
function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the loci
argument.
Alternatively, if all files
are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all files
contain the exact same set of methylation loci.
In this case, you may wish to first construct loci
using the internal function bsseq:::.readBismarkAsFWGRanges()
applied to a single file, e.g., loci = bsseq:::.readBismarkAsFWGRanges(files[1], rmZeroCov, strandCollapse)
.
It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).
rmZeroCov = FALSE
If you set rmZeroCov = TRUE
, then read.bismark()
must first parse all the files
to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci.
This will happen even if you supply loci
with a
GenomicRanges
of candidate loci.
Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object.
In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to read.bismark()
.
In such cases, it is premature to use rmZeroCov = TRUE
when running each read.bismark()
; regretably, combining these objects will often then lead to an inefficiently stored BSseq object.
BPPARAM
with a moderate number of workers (cores)Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.
BACKEND = "HDF5Array"
By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.
Conversely, if you opt for all data to be in-memory (via
BACKEND = NULL
), then each worker will pass each file's data
back to the main process and memory usage will steadily accumulate
to often unreasonable levels.
nThread
> 1read.bismark
uses data.table::fread
to
read each file, which supports threaded-parallisation. Depending on the
system, increasing nThread
can achieve near-linear speed-ups in the
number of threads for reading each file. Care needs to be taken when
nThread
> 1 is used in conjunction with a parallel backend via
BPPARAM
to ensure the system isn't overloaded. For example, using
BPPARAM = MulticoreParam(workers = 10)
with nThread = 4
may
use up to 40 workers simultaneously.
The read.bismark()
function creates a BSseq
object with two assays, M
and Cov
. The choice of
realization backend controls whether these assays are stored
in-memory as an ordinary matrix or on-disk as a
HDF5Array
, for example. The choice of
realization backend is controlled by the BACKEND
argument,
which defaults to the current value of
DelayedArray::getAutoRealizationBackend()
.
read.bismark()
supports the following realization backends:
NULL
(in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array
(on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix
from HDF5Array.
Please note that certain combinations of realization backend and
parallelization backend are currently not supported. For example, the
HDF5Array realization backend is currently only
compatible when used with a single-machine parallelization backend
(i.e. it is not compatible with a SnowParam
that
specifies an ad hoc cluster of multiple machines).
BSmooth()
will issue an error when given such
incompatible realization and parallelization backends.
Additional arguments related to the realization backend can be passed
via the ...
argument. These arguments must be named and are
passed to the relevant RealizationSink
constructor. For
example, the ...
argument can be used to specify the path to
the HDF5 file to be used by BSmooth()
. Please see the examples
at the bottom of the page.
read.bismark()
now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of
a SnowParam
object as the value of
BPPARAM
.
Detailed and extensive job logging facilities.
All parallelization options are controlled via the BPPARAM
argument. In general, we recommend that users combine multicore
(single-machine) parallelization with an on-disk realization backend
(see section, 'Realization backend'). For Unix and Mac users, this
means using a MulticoreParam
. For Windows
users, this means using a single-machine
SnowParam
. Please consult the
BiocParallel documentation to take full advantage of the more
advanced features.
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as BSmooth()
. Progress bars are
controlled via the progressbar
argument in the
BiocParallelParam
constructor.
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Peter Hickey peter.hickey@gmail.com
collapseBSseq()
for collapsing (aggregating) data from sample's with multiple Bismark methylation extractor files (e.g., technical replicates).
# Run read.bismark() on a single sample to construct a matrix-backed BSseq
# object.
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
package = "bsseq")
bsseq <- read.bismark(files = infile,
colData = DataFrame(row.names = "test_data"),
rmZeroCov = FALSE,
strandCollapse = FALSE,
verbose = TRUE)
# This is a matrix-backed BSseq object.
sapply(assays(bsseq, withDimnames = FALSE), class)
bsseq
## Not run:
# Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq
# object (with data written to 'test_dir')
test_dir <- tempfile("BSseq")
bsseq <- read.bismark(files = infile,
colData = DataFrame(row.names = "test_data"),
rmZeroCov = FALSE,
strandCollapse = FALSE,
BACKEND = "HDF5Array",
dir = test_dir,
verbose = TRUE)
# This is a HDF5Array-backed BSseq object.
sapply(assays(bsseq, withDimnames = FALSE), class)
# The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir').
sapply(assays(bsseq, withDimnames = FALSE), path)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.