readData: Read Data function

Description Usage Arguments Details Value Author(s) Examples

Description

This is the core function to read and parse raw data from a config file. At the moment only the BAM format is supported. It is not intended to be used by the user directly, as it is called internally by the GenoGAMDataSet constructor. However it is exported if people wish to separately assemble their data and construct the GenoGAMDataSet from SummarizedExperiment afterwards. It also offers the possibility to use the HDF5 backend.

Usage

1
2
readData(config, hdf5 = FALSE, split = FALSE,
  settings = GenoGAMSettings(), ...)

Arguments

config

A data.frame containing the experiment design of the model to be computed with the first three columns fixed. See the 'experimentDesign' parameter in GenoGAMDataSet or details here.

hdf5

Should the data be stored on HDD in HDF5 format? By default this is disabled, as the Rle representation of count data already provides a decent compression of the data. However in case of large organisms, a complex experiment design or just limited memory, this might further decrease the memory footprint.

split

If TRUE the data will be stored as a list of DataFrames by chromosome instead of one big DataFrame. This is only necessary if organisms with a genome size bigger than 2^31 (approx. 2.14Gbp) are analyzed, in which case Rs lack of long integers prevents having a well compressed Rle of sufficient size.

settings

A GenoGAMSettings object. Not needed by default, but might be of use if only specific regions should be read in. See GenoGAMSettings.

...

Further parameters that can be passed to low-level functions. Mostly to pass arguments to custom process functions. In case the default process functions are used, i.e. the default settings paramenter, the most interesting parameters might be fragment length estimator method from ?chipseq::estimate.mean.fraglen for single-end data.

Details

The config data.frame contains the actual experiment design. It must contain at least three columns with fixed names: 'ID', 'file' and 'paired'.

The field 'ID' stores a unique identifier for each alignment file. It is recommended to use short and easy to understand identifiers because they are subsequently used for labelling data and plots.

The field 'file' stores the complete path to the BAM file.

The field 'paired', values TRUE for paired-end sequencing data, and FALSE for single-end sequencing data.

Other columns will be ignored by this function.

Value

A DataFrame of counts for each sample and position. Or if split = TRUE, a list of DataFrames by chromosomes

Author(s)

Georg Stricker georg.stricker@in.tum.de

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Read data

## Set config file
config <- system.file("extdata/Set1", "experimentDesign.txt", package = "fastGenoGAM")
config <- read.table(config, header = TRUE, sep = '\t', stringsAsFactors = FALSE)
for(ii in 1:nrow(config)) {
    absPath <- system.file("extdata/Set1/bam", config$file[ii], package = "fastGenoGAM")
    config$file[ii] <- absPath
}

## Read all data
df <- readData(config)
df

## Read data of a particular chromosome
settings <- GenoGAMSettings(chromosomeList = "chrI")
df <- readData(config, settings = settings)
df

## Read data of particular range
region <- GenomicRanges::GRanges("chrI", IRanges(10000, 20000))
params <- Rsamtools::ScanBamParam(which = region)
settings <- GenoGAMSettings(bamParams = params)
df <- readData(config, settings = settings)
df

gstricker/fastGenoGAM documentation built on May 17, 2019, 8:56 a.m.