GenoGAMDataSet-class: GenoGAMDataSet

Description Usage Arguments Value Methods (by generic) Slots Config Design/Formula Further parameters Author(s) Examples

Description

The GenoGAMDataSet class contains the pre-processed raw data and additional slots that define the input and framework for the model. It extends the RangedSummarizedExperiment class by adding an index that defines ranges on the entire genome, mostly for purposes of parallel evaluation. Furthermore adding a couple more slots to hold information such as experiment design. It also contains the GenoGAMSettings class that defines global settings for the session. For information on the slots inherited from SummarizedExperiment check the respective class.

GenoGAMDataSet is the constructor function for the GenoGAMDataSet-class.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
GenoGAMDataSet(experimentDesign, design, chunkSize = NULL,
  overhangSize = NULL, directory = ".", settings = NULL,
  hdf5 = FALSE, split = hdf5, fromHDF5 = FALSE, ignoreM = FALSE,
  ...)

## S4 method for signature 'GenoGAMDataSet'
getIndex(object)

## S4 method for signature 'GenoGAMDataSet'
getCountMatrix(object)

## S4 method for signature 'GenoGAMDataSet'
tileSettings(object)

## S4 method for signature 'GenoGAMDataSet'
dataRange(object)

## S4 method for signature 'GenoGAMDataSet'
getChromosomes(object)

## S4 method for signature 'GenoGAMDataSet'
getTileSize(object)

## S4 method for signature 'GenoGAMDataSet'
getChunkSize(object)

## S4 method for signature 'GenoGAMDataSet'
getOverhangSize(object)

## S4 method for signature 'GenoGAMDataSet'
getTileNumber(object)

## S4 method for signature 'GenoGAMDataSet'
is.HDF5(object)

## S4 method for signature 'GenoGAMDataSet'
design(object)

## S4 replacement method for signature 'GenoGAMDataSet,ANY'
design(object) <- value

## S4 method for signature 'GenoGAMDataSet'
sizeFactors(object)

## S4 replacement method for signature 'GenoGAMDataSet,ANY'
sizeFactors(object) <- value

## S4 replacement method for signature 'GenoGAMDataSet,numeric'
getChunkSize(object) <- value

## S4 replacement method for signature 'GenoGAMDataSet,numeric'
getTileSize(object) <- value

## S4 replacement method for signature 'GenoGAMDataSet,numeric'
getOverhangSize(object) <- value

## S4 replacement method for signature 'GenoGAMDataSet,numeric'
getTileNumber(object) <- value

Arguments

experimentDesign

Either a character object specifying the path to a delimited text file (the delimiter will be determined automatically), a data.frame specifying the experiment design or a RangedSummarizedExperiment object with the GPos class being the rowRanges. See details for the structure of the experimentDesign.

design

A formula object. See details for its structure.

chunkSize

An integer specifying the size of one chunk in bp.

overhangSize

An integer specifying the size of the overhang in bp. As the overhang is taken to be symmetrical, only the overhang of one side should be provided.

directory

The directory from which to read the data. By default the current working directory is taken.

settings

A GenoGAMSettings object. Not needed by default, but might be of use if only specific regions should be read in. See GenoGAMSettings.

hdf5

Should the data be stored on HDD in HDF5 format? By default this is disabled, as the Rle representation of count data already provides a decent compression of the data. However in case of large organisms, a complex experiment design or just limited memory, this might further decrease the memory footprint. Note this only applies to the input count data, results are usually stored in HDF5 format due to their space requirements for type double. Exceptions are small organisms like yeast.

split

A logical argument specifying if the data should be stored as a list split by chromosome. This is useful and necessary for huge organisms like human, as R does not support long integers.

fromHDF5

A logical argument specifying if the data is already present in form of HDF5 files and should be rather read in from there.

ignoreM

A logical argument to ignore the Mitochondria DNA on data read in. This is useful, if one is not interested in chrM, but it's size prevents the tiles to be larger, as all tiles has to be of some size.

...

Further parameters, mostly for arguments of custom processing functions or to specify a different method for fragment size estimation. See details for further information.

object

For use of S4 methods. The GenoGAMDataSet object.

value

For use of S4 methods. The value to be assigned to the slot.

Value

An object of class GenoGAMDataSet or the respective slot.

Methods (by generic)

Slots

settings

The global and local settings that will be used to compute the model.

design

The formula describing how to evaluate the data. See details.

sizeFactors

The normalized values for each sample. A named numeric vector.

index

A GRanges object representing an index of the ranges defined on the genome. Mostly used to store tiles.

hdf5

A logical slot indicating if the object should be stored as HDF5

countMatrix

Either a matrix or HDF5Matrix to store the sums of counts of the regions (could also be seen as bins) for later use especially by DESeq2

Config

The config file/data.frame contains the actual experiment design. It must contain at least three columns with fixed names: 'ID', 'file' and 'paired'.

The field 'ID' stores a unique identifier for each alignment file. It is recommended to use short and easy to understand identifiers because they are subsequently used for labelling data and plots.

The field 'file' stores the BAM file name.

The field 'paired', values TRUE for paired-end sequencing data, and FALSE for single-end sequencing data.

All other columns are stored in the colData slot of the GenoGAMDataSet object. Note that all columns which will be used for analysis must have at most two conditions, which are for now restricted to 0 and 1. For example, if the IP data schould be corrected for input, then the input will be 0 and IP will be 1, since we are interested in the corrected IP. See examples.

Design/Formula

Design must be a formula. At the moment only the following is possible: Either ~ s(x) for a smooth fit over the entire data or s(x, by = myColumn), where 'myColumn' is a column name in the experimentDesign. Any combination of this is possible:

~ s(x) + s(x, by = myColumn) + s(x, by = ...) + ...

For example the formula for correcting IP for input would look like this:

~ s(x) + s(x, by = experiment)

where 'experiment' is a column with 0s and 1s, with the ip samples annotated with 1 and input samples with 0. '

Further parameters

In case of single-end data it might be usefull to specify a different method for fragment size estimation. The argument 'shiftMethod' can be supplied with the values 'coverage' (default), 'correlation' or 'SISSR'. See ?chipseq::estimate.mean.fraglen for explanation.

Author(s)

Georg Stricker georg.stricker@in.tum.de

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Build from config file

config <- system.file("extdata/Set1", "experimentDesign.txt", package = "GenoGAM")
dir <- system.file("extdata/Set1/bam", package = "GenoGAM")

## For all data
ggd <- GenoGAMDataSet(config, chunkSize = 1000, overhangSize = 200,
    design = ~ s(x) + s(x, by = genotype), directory = dir)
ggd

## Read data of a particular chromosome
settings <- GenoGAMSettings(chromosomeList = "chrXIV")
ggd <- GenoGAMDataSet(config, chunkSize = 1000, overhangSize = 200,
    design = ~ s(x) + s(x, by = genotype), directory = dir,
    settings = settings)
ggd

## Read data of particular range
region <- GenomicRanges::GRanges("chrI", IRanges(10000, 15000))
params <- Rsamtools::ScanBamParam(which = region)
settings <- GenoGAMSettings(bamParams = params)
ggd <- GenoGAMDataSet(config, chunkSize = 1000, overhangSize = 200,
    design = ~ s(x) + s(x, by = genotype), directory = dir,
    settings = settings)
ggd

# Build from data.frame config

df <- read.table(config, header = TRUE, sep = '\t')
ggd <- GenoGAMDataSet(df, chunkSize = 1000, overhangSize = 200,
    design = ~ s(x) + s(x, by = genotype), directory = dir,
    settings = settings)
ggd

# Build from SummarizedExperiment

gr <- GenomicRanges::GPos(GRanges("chr1", IRanges(1, 10000)))
seqlengths(gr) <- 1e6
df <- S4Vectors::DataFrame(colA = 1:10000, colB = round(runif(10000)))
se <- SummarizedExperiment::SummarizedExperiment(rowRanges = gr, assays = list(df))
ggd <- GenoGAMDataSet(se, chunkSize = 2000, overhangSize = 250, 
                      design = ~ s(x) + s(x, by = experiment))
ggd

GenoGAM documentation built on Nov. 8, 2020, 7:45 p.m.