In LTLA/IndexedRelations: Representing Indexed Relationships Between Features

library(BiocStyle)
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Overview

The r Biocpkg("IndexedRelations") package implements the IndexedRelations class for representing 'omics relationships. This can be used for physical interactions, regulatory relationships or other associations involving any number of features. The IndexedRelations class is primarily intended as a base class from which concrete subclasses can be derived for specific contexts. For example, the r Biocpkg("GenomicInteractions") package derives a subclass to represent physical interactions between genomic intervals.

The `IndexedRelations` class

Each IndexedRelations object holds relationships between "partner" features. Partners can be any instance of a Vector-like class such as GenomicRanges and IRanges^[See r Biocpkg("S4Vectors") for details.]. In a single IndexedRelations object, each entry is a relationship that involves the same number and type of partners.

To illustrate, let's make up a GenomicRanges and an IRanges object. The former might represent a chromosomal interval while the latter might represent, say, an interval on an exogenous sequence like a plasmid.

# Making up features.
library(GenomicRanges)
gr <- GRanges(sample(1:21, 100, replace=TRUE),
    IRanges(sample(1e8, 100, replace=TRUE), width=10))
gr

ir <- IRanges(sample(1000, 20, replace=TRUE), width=5)
ir

Now, assume that there are multiple relationships between individual entries of these two objects. (Let's pretend that the parts of the plasmid come into contact with the chromosomes at various points.) The partner.G vector refers to elements of gr while partner.I refers to elements of ir. The first element of partner.G is in a relationship with the first element of partner.I, and so on.

# Making up relationships.
partner.G <- sample(length(gr), 10000, replace=TRUE)
partner.I <- sample(length(ir), 10000, replace=TRUE)

# Parallel entries across vectors are in a relationship:
gr[partner.G[1]]
ir[partner.I[1]]

It is then straightforward to construct an IndexedRelations object. While this example only uses two partnering features, the class can support any number of partners in a single object.

# Easy but inefficient:
library(IndexedRelations)
rel <- IndexedRelations(list(gr[partner.G], ir[partner.I]))

# More work but more efficient:
rel <- IndexedRelations(list(partner.G, partner.I), list(gr, ir))
rel

The key feature of the IndexedRelations class is that it does not actually store the partnering features explicitly. Rather, the relationships are internally represented by indices that point to sets of features. This avoids redundant representation of the same features when large numbers of relationships are to be stored, reducing memory use and improving the efficiency of algorithms that operate on these relationships.

Getters and setters

The partnerFeatures() and partners() methods are used to extract the partners. The former will extract the partners as a full set of features, which is convenient but less efficient. The latter will only extract the indices, and it is up to the user to use them to index the relevant feature set.

# Extract partnering features:
partnerFeatures(rel, 1)
partnerFeatures(rel, 2)

# Extract partnering indices only:
partners(rel)

The featureSets() method is used to extract the feature sets. The first feature set contains the features for the first partner, the second set for the second partner and so on.

featureSets(rel)[[1]]
featureSets(rel)[[2]]

The IndexedRelations object behaves like a vector. It has length, names and can be subsetted and combined.

length(rel)
rel[1:10,]
c(rel, rel)

You can also modify the partners or the feature sets on an existing IndexedRelations object. This is most obviously done by passing relevant features to the replacement methods:

# Set partner with features.
partnerFeatures(rel, 2) <- ir[sample(length(ir), length(rel), replace=TRUE)]
rel

# Modify the first feature set.
featureSets(rel)[[1]] <- resize(featureSets(rel)[[1]], 20)
rel

Advanced users can also modify the indices directly - though, as anyone who has played with pointers can attest, it is important that the indices point to valid entries in the corresponding feature set!

Npossibles <- length(featureSets(rel)[[2]])
partners(rel)[,2] <- sample(Npossibles, length(rel), replace=TRUE)

In most cases, an IndexedRelations object will behave "as if" it were an object containing the partnering features explicitly. Users do not have to worry about the specifics of the index-based representation unless they are specifically manipulating it.

Comparisons

Users can compare different IndexedRelations objects with the same feature classes. One relationship is considered "less than" or "greater than" another based on the first non-equal partner and the definitions of inequality for that partner's feature class.

scrambled <- sample(rel)
summary(scrambled < rel)

Relationships can be sorted based on the ordering of the partner features. Specifically, the order of the relationships is defined based on the ordering of the first partner; if those are equal, the second partner; and so on.

sort(rel)

It is also possible to match relationships across different IndexedRelations objects. Note that all of these methods are agnostic to the identities of the underlying feature sets, as long as the same classes of features are used in the same order across different objects.

m <- match(rel, scrambled)
head(m)

Conversions

The IndexedRelations object can be converted to and from some similar classes. The most obvious of these is the Pairs class from r Biocpkg("S4Vectors")^[Of course, this only works if the IndexedRelations object has two partners!]:

makePairsFromIndexedRelations(rel)

p <- Pairs(gr[1:10], ir[1:10])
as(p, "IndexedRelations")

Another option is to convert to and from a DataFrame object. This provides a quick and general method to "realize" the indices into the partnering features.

as(rel, "DataFrame")

Advanced manipulation

The rearrangePartners() function can reorder, drop or duplicate partners:

# Duplicate partner
rearrangePartners(rel, c(1,1,2))

# Swap partners
rearrangePartners(rel, c(2,1))

# Drop partner
rearrangePartners(rel, 2)

Developers may find standardizeFeatureSets useful to synchronize feature sets across IndexedRelations instances. This allows downstream procedures to compare integer indices directly for greater efficiency.

# Setting up an alternative object:
gr2 <- GRanges(sample(1:21, 50, replace=TRUE),
    IRanges(sample(1e8, 50, replace=TRUE), width=10))
ir2 <- IRanges(sample(1000, 50, replace=TRUE), width=5)
rel2 <- IndexedRelations(list(gr2, ir2))

identical(featureSets(rel), featureSets(rel2))
out <- standardizeFeatureSets(rel, list(rel2))
identical(featureSets(out$x), featureSets(out$objects[[1]]))

# ... though the partnering features are still the same.
stopifnot(all(out$x==rel))

The cleanFeatureSets() function will remove redundant features and sort each feature set in an IndexedRelations instance. This enables direct comparison of indices across relationships for a given partner, e.g., for sorting.

rel3 <- cleanFeatureSets(rel)
featureSets(rel3)[[2]]

The dropUnusedFeatures() function will discard features in each set that are not used, much like droplevels() does for levels of factors. This is useful for saving memory prior to serialization to file but is generally unnecessary for use within an R session - see ?dropUnusedFeatures for some commentary on this matter.