GPos-class: Memory-efficient representation of genomic positions

Description Usage Arguments Details Value Accessors Coercion Subsetting Concatenation Splitting and Relisting Note Author(s) See Also Examples

Description

The GPos class is a container for storing a set of genomic positions (a.k.a. genomic loci). It exists in 2 flavors: UnstitchedGPos and StitchedGPos. Each flavor uses a particular internal representation:

Because genomic positions can be seen as genomic ranges of width 1, the GPos class extends the GenomicRanges virtual class (via the GRanges class).

Usage

1
2
3
## Constructor function
GPos(seqnames=NULL, pos=NULL, strand=NULL,
     ..., seqinfo=NULL, seqlengths=NULL, stitch=NA)

Arguments

seqnames, strand, ..., seqinfo, seqlengths

See documentation of the GRanges() constructor function for a description of these arguments.

pos

NULL, or an integer or numeric vector, or an IRanges or IPos object (or other IntegerRanges derivative). If not NULL, GPos() will try to turn it into an IPos derivative with IPos(pos, stitch=stitch).

When pos is an IRanges object (or other IntegerRanges derivative), each range in it is interpreted as a run of consecutive positions.

stitch

TRUE, FALSE, or NA (the default).

Controls which internal representation should be used: StitchedGPos (when stitch is TRUE) or UnstitchedGPos (when stitch is FALSE).

When stitch is NA (the default), which internal representation will be used depends on the flavour of the IPos derivative returned by IPos(pos): UnstitchedGPos if IPos(pos) returns an UnstitchedIPos instance, and StitchedGPos if it returns a StitchedIPos instance.

Details

Even though a GRanges object can be used for storing genomic positions, using a GPos object is more efficient. In particular the memory footprint of an UnstitchedGPos object is typically about half that of a GRanges object.

OTOH the memory footprint of a StitchedGPos object can vary a lot but will never be worse than that of a GRanges object. However it will reduce dramatically if the vector of positions contains long runs of consecutive positions. In the worst case scenario (i.e. when the object contains no consecutive positions) its memory footprint will be the same as that of a GRanges object.

Like for any Vector derivative, the length of a GPos object cannot exceed .Machine$integer.max (i.e. 2^31 on most platforms). GPos() will return an error if pos contains too many positions.

Value

An UnstitchedGPos or StitchedGPos object.

Accessors

Getters

GPos objects support the same set of getters as other GenomicRanges derivatives (i.e. seqnames(), ranges(), start(), end(), strand(), mcols(), seqinfo(), etc...), plus the pos() getter which is equivalent to start() or end(). See ?GenomicRanges for the list of getters supported by GenomicRanges derivatives.

Note that ranges() returns an IPos derivative instead of the IRanges object that one gets with other GenomicRanges derivatives. To get an IRanges object, you need to call ranges() again on this IPos derivative i.e. do ranges(ranges(x)) on GPos object x.

Setters

Like GRanges objects, GPos derivatives support the following setters:

However, there is no pos() setter for GPos derivatives at the moment (although one might be added in the future).

Coercion

From UnstitchedGPos to StitchedGPos and vice-versa: coercion back and forth between UnstitchedGPos and StitchedGPos is supported via as(x, "StitchedGPos") and as(x, "UnstitchedGPos"). This is the most efficient and recommended way to switch between the 2 internal representations. Note that this switch can have dramatic consequences on memory usage so is for advanced users only. End users should almost never need to do this switch when following a typical workflow.

From GenomicRanges to UnstitchedGPos, StitchedGPos, or GPos: A GenomicRanges derivative x in which all the ranges have a width of 1 can be coerced to an UnstitchedGPos or StitchedGPos object with as(x, "UnstitchedGPos") or as(x, "StitchedGPos"), respectively. For convenience as(x, "GPos") is supported and is equivalent to as(x, "UnstitchedGPos").

From GPos to GRanges: A GPos derivative x can be coerced to a GRanges object with as(x, "GRanges"). However be aware that the resulting object can use thousands times (or more) memory than x! See "MEMORY USAGE" in the Examples section below.

From GPos to ordinary R objects: Like with any other GenomicRanges derivative, as.character(), as.factor(), and as.data.frame() work on a GPos derivative x. Note however that as.data.frame(x) returns a data frame with a pos column (containing pos(x)) instead of the start, end, and width columns that one gets with other GenomicRanges derivatives.

Subsetting

A GPos derivative can be subsetted exactly like a GRanges object.

Concatenation

GPos derivatives can be concatenated with c() or append(). See ?c in the S4Vectors package for more information about concatenating Vector derivatives.

Splitting and Relisting

Like with any other GRanges object, split() and relist() work on a GPos derivative.

Note

Internal representation of GPos objects has changed in GenomicRanges 1.29.10 (Bioc 3.6). Update any old object x with: x <- updateObject(x, verbose=TRUE).

Author(s)

Hervé Pagès; based on ideas borrowed from Georg Stricker georg.stricker@in.tum.de and Julien Gagneur gagneur@in.tum.de

See Also

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
showClass("GPos")  # shows the known subclasses

## ---------------------------------------------------------------------
## BASIC EXAMPLES
## ---------------------------------------------------------------------

## Example 1:
gpos1a <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)),
               pos=c(44:53, 5:10, 2:5))
gpos1a  # unstitched

length(gpos1a)
seqnames(gpos1a)
pos(gpos1a)  # same as 'start(gpos1a)' and 'end(gpos1a)'
strand(gpos1a)
as.character(gpos1a)
as.data.frame(gpos1a)
as(gpos1a, "GRanges")
as.data.frame(as(gpos1a, "GRanges"))
gpos1a[9:17]

gpos1b <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)),
               pos=c(44:53, 5:10, 2:5), stitch=TRUE)
gpos1b  # stitched

## 'gpos1a' and 'gpos1b' are semantically equivalent, only their
## internal representations differ:
all(gpos1a == gpos1b)

gpos1c <- GPos(c("chr1:44-53", "chr2:5-10", "chr1:2-5"))
gpos1c  # stitched

identical(gpos1b, gpos1c)

## Example 2:
pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)),
                    strand=c("*", "-", "-", "+"))
gpos2 <- GPos(pos_runs)
gpos2  # stitched
strand(gpos2)

## Example 3:
gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000"))
npos <- length(gpos3A)

mcols(gpos3A)$sample <- Rle("sA")
sA_counts <- sample(10, npos, replace=TRUE)
mcols(gpos3A)$counts <- sA_counts

mcols(gpos3B)$sample <- Rle("sB")
sB_counts <- sample(10, npos, replace=TRUE)
mcols(gpos3B)$counts <- sB_counts

gpos3 <- c(gpos3A, gpos3B)
gpos3

## Example 4:
library(BSgenome.Scerevisiae.UCSC.sacCer2)
genome <- BSgenome.Scerevisiae.UCSC.sacCer2
gpos4 <- GPos(seqinfo(genome))
gpos4  # all the positions along the genome are represented
mcols(gpos4)$dna <- do.call("c", unname(as.list(genome)))
gpos4

## Note however that, like for any Vector derivative, the length of a
## GPos derivative cannot exceed '.Machine$integer.max' (i.e. 2^31 on
## most platforms) so the above only works with a "small" genome.
## For example it doesn't work with the Human genome:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## Not run: 
  GPos(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene))  # error!

## End(Not run)

## You can use isSmallGenome() to check upfront whether the genome is
## "small" or not.
isSmallGenome(genome)  # TRUE
isSmallGenome(TxDb.Hsapiens.UCSC.hg19.knownGene)  # FALSE

## ---------------------------------------------------------------------
## MEMORY USAGE
## ---------------------------------------------------------------------

## Coercion to GRanges works...
gr4 <- as(gpos4, "GRanges")
gr4
## ... but is generally not a good idea:
object.size(gpos4)
object.size(gr4)     # 8 times bigger than the StitchedGPos object!

## Shuffling the order of the positions impacts memory usage:
gpos4r <- rev(gpos4)
object.size(gpos4r)  # significantly
gpos4s <- sample(gpos4)
object.size(gpos4s)  # even worse!

## If one anticipates a lot of shuffling of the genomic positions,
## then an UnstitchedGPos object should be used instead:
gpos4b <- as(gpos4, "UnstitchedGPos")
object.size(gpos4b)  # initial size is bigger than stitched version
object.size(rev(gpos4b))  # size didn't change
object.size(sample(gpos4b))  # size increased, but is still < stitched
                             # version

## AN IMPORTANT NOTE: In the worst situations, GPos still performs as
## good as a GRanges object.
object.size(as(gpos4r, "GRanges"))  # same size as 'gpos4r'
object.size(as(gpos4s, "GRanges"))  # same size as 'gpos4s'

## Best case scenario is when the object is strictly sorted (i.e.
## positions are in strict ascending order).
## This can be checked with:
is.unsorted(gpos4, strict=TRUE)  # 'gpos4' is strictly sorted

## ---------------------------------------------------------------------
## USING MEMORY-EFFICIENT METADATA COLUMNS
## ---------------------------------------------------------------------
## In order to keep memory usage as low as possible, it is recommended
## to use a memory-efficient representation of the metadata columns that
## we want to set on the object. Rle's are particularly well suited for
## this, especially if the metadata columns contain long runs of
## identical values. This is the case for example if we want to use a
## GPos object to represent the coverage of sequencing reads along a
## genome.

## Example 5:
library(pasillaBamSubset)
library(Rsamtools)  # for the BamFile() constructor function
bamfile1 <- BamFile(untreated1_chr4())
bamfile2 <- BamFile(untreated3_chr4())
gpos5 <- GPos(seqinfo(bamfile1))
library(GenomicAlignments)  # for "coverage" method for BamFile objects
cvg1 <- unlist(coverage(bamfile1), use.names=FALSE)
cvg2 <- unlist(coverage(bamfile2), use.names=FALSE)
mcols(gpos5) <- DataFrame(cvg1, cvg2)
gpos5

object.size(gpos5)  # lightweight

## Keep only the positions where coverage is at least 10 in one of the
## 2 samples:
gpos5[mcols(gpos5)$cvg1 >= 10 | mcols(gpos5)$cvg2 >= 10]

## ---------------------------------------------------------------------
## USING A GPos OBJECT IN A RangedSummarizedExperiment OBJECT
## ---------------------------------------------------------------------
## Because the GPos class extends the GenomicRanges virtual class, a
## GPos derivative can be used as the rowRanges component of a
## RangedSummarizedExperiment object.

## As a 1st example, we show how the counts for samples sA and sB in
## 'gpos3' can be stored in a SummarizedExperiment object where the rows
## correspond to unique genomic positions and the columns to samples:
library(SummarizedExperiment)
counts <- cbind(sA=sA_counts, sB=sB_counts)
mcols(gpos3A) <- NULL
rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A)
rse3
rowRanges(rse3)
head(assay(rse3))

## Finally we show how the coverage data from Example 5 can be easily
## stored in a lightweight SummarizedExperiment derivative:
cvg <- mcols(gpos5)
mcols(gpos5) <- NULL
rse5 <- SummarizedExperiment(list(cvg=cvg), rowRanges=gpos5)
rse5
rowRanges(rse5)
assay(rse5)

## Keep only the positions where coverage is at least 10 in one of the
## 2 samples:
rse5[assay(rse5)$cvg1 >= 10 | assay(rse5)$cvg2 >= 10]

vjcitn/GenomicRangesGHA documentation built on Jan. 18, 2021, 12:39 a.m.