loadCoverage: Load the coverage information from a group of BAM files
In lcolladotor/derfinder: Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution via the DER Finder approach

loadCoverage

R Documentation

Load the coverage information from a group of BAM files

Description

For a group of samples this function reads the coverage information for a specific chromosome directly from the BAM files. It then merges them into a DataFrame and removes the bases that do not pass the cutoff.

Usage

loadCoverage(
  files,
  chr,
  cutoff = NULL,
  filter = "one",
  chrlen = NULL,
  output = NULL,
  bai = NULL,
  ...
)

Arguments

`files`	A character vector with the full path to the sample BAM files (or BigWig files). The names are used for the column names of the DataFrame. Check rawFiles for constructing `files`. `files` can also be a BamFileList, BamFile, BigWigFileList, or BigWigFile object.
`chr`	Chromosome to read. Should be in the format matching the one used in the raw data.
`cutoff`	This argument is passed to filterData.
`filter`	Has to be either `'one'` (default) or `'mean'`. In the first case, at least one sample has to have coverage above `cutoff`. In the second case, the mean coverage has to be greater than `cutoff`.
`chrlen`	The chromosome length in base pairs. If it's `NULL`, the chromosome length is extracted from the BAM files.
`output`	If `NULL` then no output is saved in disk. If `auto` then an automatic name is constructed using UCSC names (chrXCovInfo.Rdata for example). If another character is specified, then that name is used for the output file.
`bai`	The full path to the BAM index files. If `NULL` it is assumed that the BAM index files are in the same location as the BAM files and that they have the .bai extension. Ignored if `files` is a `BamFileList` object or if `inputType=='BigWig'`.
`...`	Arguments passed to other methods and/or advanced arguments. Advanced arguments: verbose If `TRUE` basic status updates will be printed along the way. inputType Has to be either `bam` or `BigWig`. It specifies the format of the raw data files. By default it's set to `bam` before trying to guess it from the file names. tilewidth When specified, tileGenome is used to break up the chromosome into chunks. We don't recommend this for BAM files as the coverage in the borders of the chunks might be slightly off. which `NULL` by default. When a `GRanges` is specified, this specific region of the genome is loaded instead of the full chromosome. fileStyle The naming style of the chromosomes in the input files. If the global option `chrsStyle` is not set, the naming style won't be changed. This option is useful when you want to use a specific naming style but the raw files use another style. protectWhich When not `NULL` and `which` is specified, this argument specifies by how much the ranges in `which` will be expanded. This helps get the same base level coverage you would get from reading the coverage for a full chromosome from BAM files. Otherwise some reads might be excluded and thus the base level coverage will be lower. `NULL` by default. drop.D Whether to drop the bases with 'D' in the CIGAR strings or to include them. Only used with BAM files. `FALSE` by default. sampleNames Column names to be used the samples. By default it's `names(files)`. Passed to extendedMapSeqlevels, define_cluster, scanBamFlag and filterData. Note that filterData is used internally by loadCoverage and has the important arguments `totalMapped` and `targetSize` which can be used to normalize the coverage by library size. See getTotalMapped for calculating `totalMapped`.

Details

The ... argument can be used to control which alignments to consider when reading from BAM files. See scanBamFlag.

Parallelization for loading the data in chunks is used only used when tilewidth is specified. You may use up to one core per tile.

If you set the advanced argument drop.D = TRUE, bases with CIGAR string "D" (deletion from reference) will be excluded from the base-level coverage calculation.

If you are working with data from an organism different from 'Homo sapiens' specify so by setting the global 'species' and 'chrsStyle' options. For example: options(species = 'arabidopsis_thaliana') options(chrsStyle = 'NCBI')

Value

A list with two components.

coverage: is a DataFrame object where each column represents a sample. The number of rows depends on the number of base pairs that passed the cutoff and the information stored is the coverage at that given base.
position: is a logical Rle with the positions of the chromosome that passed the cutoff.

Author(s)

Leonardo Collado-Torres, Andrew Jaffe

Examples

datadir <- system.file("extdata", "genomeData", package = "derfinder")
files <- rawFiles(
    datadir = datadir, samplepatt = "*accepted_hits.bam$",
    fileterm = NULL
)
## Shorten the column names
names(files) <- gsub("_accepted_hits.bam", "", names(files))

## Read and filter the data, only for 2 files
dataSmall <- loadCoverage(files = files[1:2], chr = "21", cutoff = 0)
## Not run: 
## Export to BigWig files
createBw(list("chr21" = dataSmall))

## Load data from BigWig files
dataSmall.bw <- loadCoverage(c(
    ERR009101 = "ERR009101.bw", ERR009102 =
        "ERR009102.bw"
), chr = "chr21")

## Compare them
mapply(function(x, y) {
    x - y
}, dataSmall$coverage, dataSmall.bw$coverage)

## Note that the only difference is the type of Rle (integer vs numeric) used
## to store the data.

## End(Not run)

lcolladotor/derfinder documentation built on Dec. 17, 2024, 4:53 p.m.