GPos-class: Memory-efficient representation of genomic positions

GPos-classR Documentation

Memory-efficient representation of genomic positions

Description

The GPos class is a container for storing a set of genomic positions (a.k.a. genomic loci). It exists in 2 flavors: UnstitchedGPos and StitchedGPos. Each flavor uses a particular internal representation:

  • In an UnstitchedGPos instance the positions are stored as an integer vector.

  • In a StitchedGPos instance the positions are stored as an IRanges object where each range represents a run of consecutive positions (i.e. a run of positions that are adjacent and in ascending order). This storage is particularly memory-efficient when the vector of positions contains long runs of consecutive positions.

Because genomic positions can be seen as genomic ranges of width 1, the GPos class extends the GenomicRanges virtual class (via the GRanges class).

Usage

## Constructor function
GPos(seqnames=NULL, pos=NULL, strand=NULL,
     ..., seqinfo=NULL, seqlengths=NULL, stitch=NA)

Arguments

seqnames, strand, ..., seqinfo, seqlengths

See documentation of the GRanges() constructor function for a description of these arguments.

pos

NULL, or an integer or numeric vector, or an IRanges or IPos object (or other IntegerRanges derivative). If not NULL, GPos() will try to turn it into an IPos derivative with IPos(pos, stitch=stitch).

When pos is an IRanges object (or other IntegerRanges derivative), each range in it is interpreted as a run of consecutive positions.

stitch

TRUE, FALSE, or NA (the default).

Controls which internal representation should be used: StitchedGPos (when stitch is TRUE) or UnstitchedGPos (when stitch is FALSE).

When stitch is NA (the default), which internal representation will be used depends on the flavour of the IPos derivative returned by IPos(pos): UnstitchedGPos if IPos(pos) returns an UnstitchedIPos instance, and StitchedGPos if it returns a StitchedIPos instance.

Details

Even though a GRanges object can be used for storing genomic positions, using a GPos object is more efficient. In particular the memory footprint of an UnstitchedGPos object is typically about half that of a GRanges object.

OTOH the memory footprint of a StitchedGPos object can vary a lot but will never be worse than that of a GRanges object. However it will reduce dramatically if the vector of positions contains long runs of consecutive positions. In the worst case scenario (i.e. when the object contains no consecutive positions) its memory footprint will be the same as that of a GRanges object.

Like for any Vector derivative, the length of a GPos object cannot exceed .Machine$integer.max (i.e. 2^31 on most platforms). GPos() will return an error if pos contains too many positions.

Value

An UnstitchedGPos or StitchedGPos object.

Accessors

Getters

GPos objects support the same set of getters as other GenomicRanges derivatives (i.e. seqnames(), ranges(), start(), end(), strand(), mcols(), seqinfo(), etc...), plus the pos() getter which is equivalent to start() or end(). See ?GenomicRanges for the list of getters supported by GenomicRanges derivatives.

Note that ranges() returns an IPos derivative instead of the IRanges object that one gets with other GenomicRanges derivatives. To get an IRanges object, you need to call ranges() again on this IPos derivative i.e. do ranges(ranges(x)) on GPos object x.

Setters

Like GRanges objects, GPos derivatives support the following setters:

  • The seqnames() and strand() setters.

  • The names(), mcols() and metadata() setters.

  • The family of setters that operate on the seqinfo component of an object: seqlevels(), seqlevelsStyle(), seqlengths(), isCircular(), genome(), and seqinfo(). These setters are defined and documented in the GenomeInfoDb package.

However, there is no pos() setter for GPos derivatives at the moment (although one might be added in the future).

Coercion

From UnstitchedGPos to StitchedGPos and vice-versa: coercion back and forth between UnstitchedGPos and StitchedGPos is supported via as(x, "StitchedGPos") and as(x, "UnstitchedGPos"). This is the most efficient and recommended way to switch between the 2 internal representations. Note that this switch can have dramatic consequences on memory usage so is for advanced users only. End users should almost never need to do this switch when following a typical workflow.

From GenomicRanges to UnstitchedGPos, StitchedGPos, or GPos: A GenomicRanges derivative x in which all the ranges have a width of 1 can be coerced to an UnstitchedGPos or StitchedGPos object with as(x, "UnstitchedGPos") or as(x, "StitchedGPos"), respectively. For convenience as(x, "GPos") is supported and is equivalent to as(x, "UnstitchedGPos").

From GPos to GRanges: A GPos derivative x can be coerced to a GRanges object with as(x, "GRanges"). However be aware that the resulting object can use thousands times (or more) memory than x! See "MEMORY USAGE" in the Examples section below.

From GPos to ordinary R objects: Like with any other GenomicRanges derivative, as.character(), as.factor(), and as.data.frame() work on a GPos derivative x. Note however that as.data.frame(x) returns a data frame with a pos column (containing pos(x)) instead of the start, end, and width columns that one gets with other GenomicRanges derivatives.

Subsetting

A GPos derivative can be subsetted exactly like a GRanges object.

Concatenation

GPos derivatives can be concatenated with c() or append(). See ?c in the S4Vectors package for more information about concatenating Vector derivatives.

Splitting and Relisting

Like with any other GRanges object, split() and relist() work on a GPos derivative.

Note

Internal representation of GPos objects has changed in GenomicRanges 1.29.10 (Bioc 3.6). Update any old object x with: x <- updateObject(x, verbose=TRUE).

Author(s)

Hervé Pagès; based on ideas borrowed from Georg Stricker georg.stricker@in.tum.de and Julien Gagneur gagneur@in.tum.de

See Also

  • The IPos class in the IRanges package for storing a set of integer positions (i.e. integer ranges of width 1).

  • The GRanges class for storing a set of genomic ranges of arbitrary width.

  • Seqinfo objects and the seqinfo accessor and family in the GenomeInfoDb package for accessing/modifying information about the underlying sequences of a GenomicRanges derivative.

  • GenomicRanges-comparison for comparing and ordering genomic ranges and/or positions.

  • findOverlaps-methods for finding overlapping genomic ranges and/or positions.

  • intra-range-methods and inter-range-methods for intra range and inter range transformations of GenomicRanges derivatives.

  • coverage-methods for computing the coverage of a set of genomic ranges and/or positions.

  • nearest-methods for finding the nearest genomic range/position neighbor.

  • The snpsBySeqname, snpsByOverlaps, and snpsById methods for SNPlocs objects defined in the BSgenome package for extractors that return a GPos derivative.

  • SummarizedExperiment objects and derivatives in the SummarizedExperiment package.

Examples

showClass("GPos")  # shows the known subclasses

## ---------------------------------------------------------------------
## BASIC EXAMPLES
## ---------------------------------------------------------------------

## Example 1:
gpos1a <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)),
               pos=c(44:53, 5:10, 2:5))
gpos1a  # unstitched

length(gpos1a)
seqnames(gpos1a)
pos(gpos1a)  # same as 'start(gpos1a)' and 'end(gpos1a)'
strand(gpos1a)
as.character(gpos1a)
as.data.frame(gpos1a)
as(gpos1a, "GRanges")
as.data.frame(as(gpos1a, "GRanges"))
gpos1a[9:17]

gpos1b <- GPos(seqnames=Rle(c("chr1", "chr2", "chr1"), c(10, 6, 4)),
               pos=c(44:53, 5:10, 2:5), stitch=TRUE)
gpos1b  # stitched

## 'gpos1a' and 'gpos1b' are semantically equivalent, only their
## internal representations differ:
all(gpos1a == gpos1b)

gpos1c <- GPos(c("chr1:44-53", "chr2:5-10", "chr1:2-5"))
gpos1c  # stitched

identical(gpos1b, gpos1c)

## Example 2:
pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)),
                    strand=c("*", "-", "-", "+"))
gpos2 <- GPos(pos_runs)
gpos2  # stitched
strand(gpos2)

## Example 3:
gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000"))
npos <- length(gpos3A)

mcols(gpos3A)$sample <- Rle("sA")
sA_counts <- sample(10, npos, replace=TRUE)
mcols(gpos3A)$counts <- sA_counts

mcols(gpos3B)$sample <- Rle("sB")
sB_counts <- sample(10, npos, replace=TRUE)
mcols(gpos3B)$counts <- sB_counts

gpos3 <- c(gpos3A, gpos3B)
gpos3

## Example 4:
library(BSgenome.Scerevisiae.UCSC.sacCer2)
genome <- BSgenome.Scerevisiae.UCSC.sacCer2
gpos4 <- GPos(seqinfo(genome))
gpos4  # all the positions along the genome are represented
mcols(gpos4)$dna <- do.call("c", unname(as.list(genome)))
gpos4

## Note however that, like for any Vector derivative, the length of a
## GPos derivative cannot exceed '.Machine$integer.max' (i.e. 2^31 on
## most platforms) so the above only works with a "small" genome.
## For example it doesn't work with the Human genome:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## Not run: 
  GPos(seqinfo(TxDb.Hsapiens.UCSC.hg38.knownGene))  # error!

## End(Not run)

## You can use isSmallGenome() to check upfront whether the genome is
## "small" or not.
isSmallGenome(genome)  # TRUE
isSmallGenome(TxDb.Hsapiens.UCSC.hg38.knownGene)  # FALSE

## ---------------------------------------------------------------------
## MEMORY USAGE
## ---------------------------------------------------------------------

## Coercion to GRanges works...
gr4 <- as(gpos4, "GRanges")
gr4
## ... but is generally not a good idea:
object.size(gpos4)
object.size(gr4)     # 8 times bigger than the StitchedGPos object!

## Shuffling the order of the positions impacts memory usage:
gpos4r <- rev(gpos4)
object.size(gpos4r)  # significantly
gpos4s <- sample(gpos4)
object.size(gpos4s)  # even worse!

## If one anticipates a lot of shuffling of the genomic positions,
## then an UnstitchedGPos object should be used instead:
gpos4b <- as(gpos4, "UnstitchedGPos")
object.size(gpos4b)  # initial size is bigger than stitched version
object.size(rev(gpos4b))  # size didn't change
object.size(sample(gpos4b))  # size increased, but is still < stitched
                             # version

## AN IMPORTANT NOTE: In the worst situations, GPos still performs as
## good as a GRanges object.
object.size(as(gpos4r, "GRanges"))  # same size as 'gpos4r'
object.size(as(gpos4s, "GRanges"))  # same size as 'gpos4s'

## Best case scenario is when the object is strictly sorted (i.e.
## positions are in strict ascending order).
## This can be checked with:
is.unsorted(gpos4, strict=TRUE)  # 'gpos4' is strictly sorted

## ---------------------------------------------------------------------
## USING MEMORY-EFFICIENT METADATA COLUMNS
## ---------------------------------------------------------------------
## In order to keep memory usage as low as possible, it is recommended
## to use a memory-efficient representation of the metadata columns that
## we want to set on the object. Rle's are particularly well suited for
## this, especially if the metadata columns contain long runs of
## identical values. This is the case for example if we want to use a
## GPos object to represent the coverage of sequencing reads along a
## genome.

## Example 5:
library(pasillaBamSubset)
library(Rsamtools)  # for the BamFile() constructor function
bamfile1 <- BamFile(untreated1_chr4())
bamfile2 <- BamFile(untreated3_chr4())
gpos5 <- GPos(seqinfo(bamfile1))
library(GenomicAlignments)  # for "coverage" method for BamFile objects
cvg1 <- unlist(coverage(bamfile1), use.names=FALSE)
cvg2 <- unlist(coverage(bamfile2), use.names=FALSE)
mcols(gpos5) <- DataFrame(cvg1, cvg2)
gpos5

object.size(gpos5)  # lightweight

## Keep only the positions where coverage is at least 10 in one of the
## 2 samples:
gpos5[mcols(gpos5)$cvg1 >= 10 | mcols(gpos5)$cvg2 >= 10]

## ---------------------------------------------------------------------
## USING A GPos OBJECT IN A RangedSummarizedExperiment OBJECT
## ---------------------------------------------------------------------
## Because the GPos class extends the GenomicRanges virtual class, a
## GPos derivative can be used as the rowRanges component of a
## RangedSummarizedExperiment object.

## As a 1st example, we show how the counts for samples sA and sB in
## 'gpos3' can be stored in a SummarizedExperiment object where the rows
## correspond to unique genomic positions and the columns to samples:
library(SummarizedExperiment)
counts <- cbind(sA=sA_counts, sB=sB_counts)
mcols(gpos3A) <- NULL
rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A)
rse3
rowRanges(rse3)
head(assay(rse3))

## Finally we show how the coverage data from Example 5 can be easily
## stored in a lightweight SummarizedExperiment derivative:
cvg <- mcols(gpos5)
mcols(gpos5) <- NULL
rse5 <- SummarizedExperiment(list(cvg=cvg), rowRanges=gpos5)
rse5
rowRanges(rse5)
assay(rse5)

## Keep only the positions where coverage is at least 10 in one of the
## 2 samples:
rse5[assay(rse5)$cvg1 >= 10 | assay(rse5)$cvg2 >= 10]

Bioconductor/GenomicRanges documentation built on Nov. 17, 2024, 11:43 a.m.