read.bismark: Parsing output from the Bismark alignment suite.
In kasperdanielhansen/bsseq: Analyze, manage and store whole-genome methylation data

read.bismark

R Documentation

Parsing output from the Bismark alignment suite.

Description

Parsing output from the Bismark alignment suite.

Usage

read.bismark(files,
             loci = NULL,
             colData = NULL,
             rmZeroCov = FALSE,
             strandCollapse = TRUE,
             BPPARAM = bpparam(),
             BACKEND = NULL,
             dir = tempfile("BSseq"),
             replace = FALSE,
             chunkdim = NULL,
             level = NULL,
             nThread = 1L,
             verbose = getOption("verbose"))

Arguments

`files`	The path to the files created by running Bismark's methylation extractor, one sample per file. Files ending in `.gz` or `.bz2` will be automatically decompressed to `tempfile()`. We strongly recommend you use the 'genome wide cytosine report' output files. See section 'File formats' for further details.
`loci`	`NULL` (default) or a `GenomicRanges` instance containing methylation loci (all with width equal to 1). If `loci = NULL`, then `read.bismark()` will perform a first pass over the Bismark file to identify candidate loci. If `loci` is a `GenomicRanges` instance, then these form the candidate loci. In either case, the candidate loci will be filtered if `rmZeroCov = TRUE` and collapsed if `strandCollapse = TRUE` to form the final set of methylation loci that form the `rowRanges` of the returned BSseq object. See section 'Efficient use of `read.bismark()`' for further details.
`colData`	An optional `DataFrame` describing the samples. Row names, if present, become the column names of the BSseq object. If `NULL`, then a `DataFrame` will be created with `files` used as the row names.
`rmZeroCov`	A `logical(1)` indicating whether methylation loci that have zero coverage in all samples be removed. For several reasons, the default `rmZeroCov = FALSE` is recommended even in cases where you ultimately want to remove such loci. See section 'Efficient use of `read.bismark()`' for further details.
`strandCollapse`	A `logical(1)` indicating whether strand-symmetric methylation loci (i.e. CpGs) should be collapsed across strands. This is only applicable for stranded methylation loci, e.g., loci extracted from 'genome wide cytosine reports' (see section 'File formats' for further details).
`BPPARAM`	An optional `BiocParallelParam` instance determining the parallel back-end to be used during evaluation. Currently supported are `SerialParam` (Unix, Mac, Windows), `MulticoreParam` (Unix and Mac), `SnowParam` (Unix, Mac, and Windows, limited to single-machine clusters), and `BatchtoolsParam` (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details.
`BACKEND`	`NULL` or a single string specifying the name of the realization backend. When the backend is set to `NULL`, the `M` and `Cov` assays are realized in memory as ordinary matrices, otherwise these are realized with the given `BACKEND`. See section 'Realization backends' for further details.
`dir`	Only applicable if `BACKEND == "HDF5Array"`. The path (as a single string) to the directory where to save the HDF5-based BSseq object. The directory will be created so should not already exist, unless `replace` is set to `TRUE`.
`replace`	Only applicable if `BACKEND == "HDF5Array"`. If directory `dir` already exists, should it be replaced with a new one? The content of the existing directory will be lost!
`chunkdim`	Only applicable if `BACKEND == "HDF5Array"`. The dimensions of the chunks to use for writing the data to disk. By default, `getHDF5DumpChunkDim()` using the dimensions of the returned BSseq object will be used. See `?getHDF5DumpChunkDim` for more information.
`level`	Only applicable if `BACKEND == "HDF5Array"`. The compression level to use for writing the data to disk. By default, `getHDF5DumpCompressionLevel()` will be used. See `?getHDF5DumpCompressionLevel` for more information.
`nThread`	The number of threads used by `fread` when reading the `files`. Be careful when combining a parallel backend specified with `BPPARAM` with `nThread` > 1 because each worker will use `nThread`.
`verbose`	A `logical(1)` indicating whether progress messages should be printed (default `TRUE`).

Value

A BSseq object.

File formats

The format of each file is automatically detected using the internal function bsseq:::.guessBismarkFileType(). Files ending in .gz, .bz2, .xz, or .zip will be automatically decompressed to tempdir().

Supported file formats

Bismark's 'genome wide cytosine report' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-genome-wide-cytosine-report-optional-is-tab-delimited-in-the-following-format-1-based-coords) and 'coverage' (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-coverage-output-looks-like-this-tab-delimited-1-based-genomic-coords) formats are both supported. If setting loci = NULL, then we strongly recommend using the 'genome wide cytosine report' output format because this includes strand information for each locus. The 'coverage' output does not contain strand information and so the strand of the returned BSseq object will be set to * unless stranded loci are supplied.

Unsupported file formats

Neither the 'bedGraph' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bedgraph-output-optional-looks-like-this-tab-delimited-0-based-start-1-based-end-coords) nor the 'bismark_methylation_extractor' output format (https://github.com/FelixKrueger/Bismark/tree/master/Docs#the-bismark_methylation_extractor-output-is-in-the-form-tab-delimited-1-based-coords) are supported. The former does not include the required counts of methylated and unmethylated reads hile the is an intermediate file containing read-level, rather than locus-level, data on methylation.

One-based vs. zero-based genomic co-ordinates

The genomic co-ordinates of the Bismark output files may be zero-based or one-based depending on whether the --zero_based argument was used when running Bismark's methylation extractor. Furthermore, the default co-ordinate counting system varies by version of Bismark. bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.

Since Bioconductor packages typically use one-based co-ordinates, we strongly recommend that your Bismark files are also one-based.

Efficient use of `read.bismark()`

We recommend the following to achieve fast and efficient importing of Bismark files:

Specify the set of methylation loci via the loci argument.
Use Bismark files in the 'coverage' output format.
Leave rmZeroCov = FALSE.
Use a BPPARAM with a moderate number of workers (cores).
Use BACKEND = "HDF5Array".
Use multiple threads per worker (i.e. nThread > 1).

Each point is discussed below.

Specifying `loci`

Specifying the set of methylation loci via the loci argument means that read.bismark() does not need to first parse all files to identify the set of candidate loci. Provided that rmZeroCov = FALSE, this means that each file is only read once. This may be a considerable saving when there are a large number of files.

If you are unsure whether the below-described shortcuts apply to your data, leave loci = NULL and let read.bismark() identify the set of candidate loci from files.

You may wish to use the findLoci() function to find all methylation loci of interest in your reference genome (e.g., all CpGs) and then pass the result via the loci argument.

Alternatively, if all files are 'genome wide cytosine reports' for samples aligned to the same reference genome, then all files contain the exact same set of methylation loci. In this case, you may wish to first construct loci using the internal function bsseq:::.readBismarkAsFWGRanges() applied to a single file, e.g., loci = bsseq:::.readBismarkAsFWGRanges(files[1], rmZeroCov, strandCollapse).

Using the 'coverage' Bismark files

It will generally be faster to parse Bismark files in the 'coverage' output format than those in the 'genome wide cytosine report' format This is because the former only includes loci with non-zero coverage and so the file size is often considerably smaller, particularly for shallowly sequenced samples (e.g., those from single-cell bisulfite sequencing).

Leaving `rmZeroCov = FALSE`

If you set rmZeroCov = TRUE, then read.bismark() must first parse all the files to identify which loci have zero coverage in all samples and then filter these out from the set of candidate loci. This will happen even if you supply loci with a GenomicRanges of candidate loci.

Furthermore, any coverage-based filtering of methylation loci is best left until you have constructed your final BSseq object. In our experience, the final BSseq object is often the product of combining multiple BSseq objects, each constructed with a separate call to read.bismark(). In such cases, it is premature to use rmZeroCov = TRUE when running each read.bismark(); regretably, combining these objects will often then lead to an inefficiently stored BSseq object.

Using a `BPPARAM` with a moderate number of workers (cores)

Each file can be processed on its own, so you can process in parallel as many files as you have workers. However, if using the HDF5Array backend, then writing to the HDF5 file cannot be performed in parallel and so becomes the bottleneck. Nonetheless, by using a moderate number of workers (2 - 10), we can ensure there is processed data available to write to disk as soon as the current write is completed.

Using `BACKEND = "HDF5Array"`

By using the HDF5Array realization backend from HDF5Array, we reduce the amount of data that is kept in-memory at any one time. Once each file is parsed, the data are written to the HDF5 file and are no longer needed in-memory. When combined with multiple workers (cores), this means that each file will only need to read and retain in-memory 1 sample's worth of data at a time.

Conversely, if you opt for all data to be in-memory (via BACKEND = NULL), then each worker will pass each file's data back to the main process and memory usage will steadily accumulate to often unreasonable levels.

Using `nThread` > 1

read.bismark uses data.table::fread to read each file, which supports threaded-parallisation. Depending on the system, increasing nThread can achieve near-linear speed-ups in the number of threads for reading each file. Care needs to be taken when nThread > 1 is used in conjunction with a parallel backend via BPPARAM to ensure the system isn't overloaded. For example, using BPPARAM = MulticoreParam(workers = 10) with nThread = 4 may use up to 40 workers simultaneously.

Realization backends

The read.bismark() function creates a BSseq object with two assays, M and Cov. The choice of realization backend controls whether these assays are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example. The choice of realization backend is controlled by the BACKEND argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend().

read.bismark() supports the following realization backends:

NULL (in-memory): This stores each new assay in-memory using an ordinary matrix.
HDF5Array (on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.

Please note that certain combinations of realization backend and parallelization backend are currently not supported. For example, the HDF5Array realization backend is currently only compatible when used with a single-machine parallelization backend (i.e. it is not compatible with a SnowParam that specifies an ad hoc cluster of multiple machines). BSmooth() will issue an error when given such incompatible realization and parallelization backends.

Additional arguments related to the realization backend can be passed via the ... argument. These arguments must be named and are passed to the relevant RealizationSink constructor. For example, the ... argument can be used to specify the path to the HDF5 file to be used by BSmooth(). Please see the examples at the bottom of the page.

Parallelization, progress monitoring, and logging

read.bismark() now uses the BiocParallel package to implement parallelization. This brings some notable improvements:

Imported files can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.
Parallelization is now supported on Windows through the use of a SnowParam object as the value of BPPARAM.
Detailed and extensive job logging facilities.

All parallelization options are controlled via the BPPARAM argument. In general, we recommend that users combine multicore (single-machine) parallelization with an on-disk realization backend (see section, 'Realization backend'). For Unix and Mac users, this means using a MulticoreParam. For Windows users, this means using a single-machine SnowParam. Please consult the BiocParallel documentation to take full advantage of the more advanced features.

A useful feature of BiocParallel are progress bars to monitor the status of long-running jobs, such as BSmooth(). Progress bars are controlled via the progressbar argument in the BiocParallelParam constructor.

BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.

Author(s)

Peter Hickey peter.hickey@gmail.com

Examples

  # Run read.bismark() on a single sample to construct a matrix-backed BSseq
  # object.
  infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
                        package = "bsseq")
  bsseq <- read.bismark(files = infile,
                        colData = DataFrame(row.names = "test_data"),
                        rmZeroCov = FALSE,
                        strandCollapse = FALSE,
                        verbose = TRUE)
  # This is a matrix-backed BSseq object.
  sapply(assays(bsseq, withDimnames = FALSE), class)
  bsseq

  ## Not run: 
  # Run read.bismark() on a single sample to construct a HDF5Array-backed BSseq
  # object (with data written to 'test_dir')
  test_dir <- tempfile("BSseq")
  bsseq <- read.bismark(files = infile,
                        colData = DataFrame(row.names = "test_data"),
                        rmZeroCov = FALSE,
                        strandCollapse = FALSE,
                        BACKEND = "HDF5Array",
                        dir = test_dir,
                        verbose = TRUE)
  # This is a HDF5Array-backed BSseq object.
  sapply(assays(bsseq, withDimnames = FALSE), class)
  # The 'M' and 'Cov' assays are in the HDF5 file 'assays.h5' (in 'test_dir').
  sapply(assays(bsseq, withDimnames = FALSE), path)
  
## End(Not run)

kasperdanielhansen/bsseq documentation built on June 14, 2025, 6:51 a.m.

kasperdanielhansen/bsseq index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kasperdanielhansen/bsseq
Analyze, manage and store whole-genome methylation data

read.bismark: Parsing output from the Bismark alignment suite.
In kasperdanielhansen/bsseq: Analyze, manage and store whole-genome methylation data

Parsing output from the Bismark alignment suite.

Description

Usage

Arguments

Value

File formats

Supported file formats

Unsupported file formats

One-based vs. zero-based genomic co-ordinates

Efficient use of `read.bismark()`

Specifying `loci`

Using the 'coverage' Bismark files

Leaving `rmZeroCov = FALSE`

Using a `BPPARAM` with a moderate number of workers (cores)

Using `BACKEND = "HDF5Array"`

Using `nThread` > 1

Realization backends

Parallelization, progress monitoring, and logging

Author(s)

See Also

Examples

Related to read.bismark in kasperdanielhansen/bsseq...

R Package Documentation

Browse R Packages

We want your feedback!

kasperdanielhansen/bsseq Analyze, manage and store whole-genome methylation data

read.bismark: Parsing output from the Bismark alignment suite. In kasperdanielhansen/bsseq: Analyze, manage and store whole-genome methylation data

Parsing output from the Bismark alignment suite.

Description

Usage

Arguments

Value

File formats

Supported file formats

Unsupported file formats

One-based vs. zero-based genomic co-ordinates

Efficient use of read.bismark()

Specifying loci

Using the 'coverage' Bismark files

Leaving rmZeroCov = FALSE

Using a BPPARAM with a moderate number of workers (cores)

Using BACKEND = "HDF5Array"

Using nThread > 1

Realization backends

Parallelization, progress monitoring, and logging

Author(s)

See Also

Examples

Related to read.bismark in kasperdanielhansen/bsseq...

R Package Documentation

Browse R Packages

We want your feedback!

kasperdanielhansen/bsseq
Analyze, manage and store whole-genome methylation data

read.bismark: Parsing output from the Bismark alignment suite.
In kasperdanielhansen/bsseq: Analyze, manage and store whole-genome methylation data

Efficient use of `read.bismark()`

Specifying `loci`

Leaving `rmZeroCov = FALSE`

Using a `BPPARAM` with a moderate number of workers (cores)

Using `BACKEND = "HDF5Array"`

Using `nThread` > 1