read.bismark | R Documentation |
Parsing output from the Bismark alignment suite.
read.bismark(files,
loci = NULL,
colData = NULL,
rmZeroCov = FALSE,
strandCollapse = TRUE,
BPPARAM = bpparam(),
BACKEND = NULL,
dir = tempfile("BSseq"),
replace = FALSE,
chunkdim = NULL,
level = NULL,
nThread = 1L,
verbose = getOption("verbose"))
files |
The path to the files created by running Bismark's methylation
extractor, one sample per file.
Files ending in |
loci |
|
colData |
An optional DataFrame describing the samples.
Row names, if present, become the column names of the BSseq
object. If |
rmZeroCov |
A |
strandCollapse |
A |
BPPARAM |
An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchtoolsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details. |
BACKEND |
|
dir |
Only applicable if |
replace |
Only applicable if |
chunkdim |
Only applicable if |
level |
Only applicable if |
nThread |
The number of threads used by |
verbose |
A |
A BSseq object.
The format of each file is automatically detected using the internal function bsseq:::.guessBismarkFileType()
.
Files ending in .gz
, .bz2
, .xz
, or .zip
will be automatically decompressed to tempdir()
.
Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported.
If setting loci = NULL
, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus.
The 'coverage' output does not contain strand information and so the strand
of the returned BSseq object will be set to *
unless stranded loci
are supplied.
Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.
The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the --zero_based
argument was used when running Bismark's methylation extractor.
Furthermore, the default co-ordinate counting system varies by version of Bismark.
bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.
Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.
read.bismark()
We recommend the following to achieve fast and efficient importing of Bismark files:
Specify the set of methylation loci via the loci
argument.
Use Bismark files in the 'coverage' output format.
Leave rmZeroCov = FALSE
.
Use a BPPARAM
with a moderate number of workers (cores).
Use BACKEND = "HDF5Array"
.
Use multiple threads per worker (i.e. nThread
> 1).
Each point is discussed below.
loci
Specifying the set of methylation loci via the loci
argument means that read.bismark()
does not need to first parse all files
to identify the set of candidate loci.
Provided that rmZeroCov = FALSE
, this means that each file is only read once.
This may be a considerable saving when there are a large number of files
.
If you are unsure whether the below-described shortcuts apply to your data, leave loci = NULL
and let read.bismark()
identify the set of candidate loci from files
.
You may wish to use the findLoci()
function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the loci
argument.
Alternatively, if all files
are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all files
contain the exact same set of methylation loci.
In this case, you may wish to first construct loci
using the internal function bsseq:::.readBismarkAsFWGRanges()
applied to a single file, e.g., loci = bsseq:::.readBismarkAsFWGRanges(files[1], rmZeroCov, strandCollapse)
.
It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).
rmZeroCov = FALSE
If you set rmZeroCov = TRUE
, then read.bismark()
must first parse all the files
to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci.
This will happen even if you supply loci
with a GenomicRanges of candidate loci.
Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object.
In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to read.bismark()
.
In such cases, it is premature to use rmZeroCov = TRUE
when running each read.bismark()
; regretably, combining these objects will often then lead to an inefficiently stored BSseq object.
BPPARAM
with a moderate number of workers (cores)Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.
BACKEND = "HDF5Array"
By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.
Conversely, if you opt for all data to be in-memory (via BACKEND = NULL
), then each worker will pass each file's data back to the main process and memory usage will steadily accumulate to often unreasonable levels.
nThread
> 1read.bismark
uses data.table::fread
to
read each file, which supports threaded-parallisation. Depending on the
system, increasing nThread
can achieve near-linear speed-ups in the
number of threads for reading each file. Care needs to be taken when
nThread
> 1 is used in conjunction with a parallel backend via
BPPARAM
to ensure the system isn't overloaded. For example, using
BPPARAM = MulticoreParam(workers = 10)
with nThread = 4
may
use up to 40 workers simultaneously.
The read.bismark()
function creates a BSseq object with two assays, M
and Cov
.
The choice of realization backend controls whether these assays are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example.
The choice of realization backend is controlled by the BACKEND
argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend()
.
read.bismark()
supports the following realization backends:
NULL
(in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array
(on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.
Please note that certain combinations of realization backend and parallelization backend are currently not supported.
For example, the HDF5Array realization backend is currently only compatible when used with a single-machine parallelization backend (i.e. it is not compatible
with a SnowParam that specifies an ad hoc cluster of
multiple machines).
BSmooth()
will issue an error when given such incompatible realization and parallelization backends.
Additional arguments related to the realization backend can be passed via the ...
argument.
These arguments must be named and are passed to the relevant RealizationSink constructor.
For example, the ...
argument can be used to specify the path to the HDF5 file to be
used by BSmooth()
.
Please see the examples at the bottom of the page.
read.bismark()
now uses the BiocParallel package to implement
parallelization. This brings some notable improvements:
Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a
SnowParam object as the value of BPPARAM
.
Detailed and extensive job logging facilities.
All parallelization options are controlled via the BPPARAM
argument.
In general, we recommend that users combine multicore (single-machine)
parallelization with an on-disk realization backend (see section,
'Realization backend'). For Unix and Mac users, this means using
a MulticoreParam. For Windows users, this means using a
single-machine SnowParam. Please consult the BiocParallel
documentation to take full advantage of the more advanced features.
A useful feature of BiocParallel are progress bars to monitor the
status of long-running jobs, such as BSmooth()
. Progress bars are
controlled via the progressbar
argument in the
BiocParallelParam constructor.
BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.
Peter Hickey peter.hickey@gmail.com
collapseBSseq()
for collapsing (aggregating) data from sample's with multiple Bismark methylation extractor files (e.g., technical replicates).
# Run read.bismark() on a single sample to construct a matrix-backed BSseq
# object.
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
package = "bsseq")
bsseq <- read.bismark(files = infile,
colData = DataFrame(row.names = "test_data"),
rmZeroCov = FALSE,
strandCollapse = FALSE,
verbose = TRUE)
# This is a matrix-backed BSseq object.
sapply(assays(bsseq, withDimnames = FALSE), class)
bsseq
## Not run:
# Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq
# object (with data written to 'test_dir')
test_dir <- tempfile("BSseq")
bsseq <- read.bismark(files = infile,
colData = DataFrame(row.names = "test_data"),
rmZeroCov = FALSE,
strandCollapse = FALSE,
BACKEND = "HDF5Array",
dir = test_dir,
verbose = TRUE)
# This is a HDF5Array-backed BSseq object.
sapply(assays(bsseq, withDimnames = FALSE), class)
# The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir').
sapply(assays(bsseq, withDimnames = FALSE), path)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.