compartmentalize: Identify genomic compartments

View source: R/compartmentalize.R

compartmentalizeR Documentation

Identify genomic compartments

Description

Use contact matrices to identify self-interacting genomic compartments

Usage

compartmentalize(data, centers=2, dist.correct=TRUE, 
    cov.correct=TRUE, robust.cov=5, ...)

Arguments

data

an InteractionSet object containing bin pair data, like that produced by squareCounts

centers

an integer scalar, specifying the number of clusters to form in kmeans

dist.correct

a logical scalar, indicating whether abundances should be corrected for distance biases

cov.correct

a logical scalar, indicating whether abundances should be corrected for coverage biases

robust.cov

a numeric scalar, specifying the multiple of MADs beyond which coverage outliers are removed

...

other arguments to pass to kmeans

Details

This function uses the interaction space to partition each linear chromosome into compartments. Bins in the same compartment interact more frequently with each other compared to bins in different compartments. This forms a checkerboard-like pattern in the interaction space that can be used to define the genomic intervals in each compartment. Typically, one compartment is gene-rich and is defined as “open”, while the other is gene-poor and defined as “closed”.

Compartment identification is done by setting up a ContactMatrix object, where each row/rolumn represents a bin and each matrix entry contains the frequency of contacts between bins. Bins (i.e., rows) with similar interaction profiles (i.e., entries across columns) are clustered together with the k-means method. Those with the same ID in the output compartment vector are in the same compartment. Note that clustering is done separately for each chromosome, so bins with the same ID across different chromosomes cannot be interpreted as being in the same compartment.

If dist.correct=TRUE, frequencies are normalized to mitigate the effect of distance and to improve the visibility of long-range interactions. This is done by computing the residuals of the distance-dependent trend - see filterTrended for more details. If cov.correct=TRUE, frequencies are also normalized to eliminate coverage biases betwen bins. This is done by computing the average coverage of each row/column, and dividing each matrix entry by the square root averages of the relevant row and column.

Extremely low-coverage regions such as telomeres and centromeres can confound k-means clustering. To protect against this, all bins with (distance-corrected) coverages that are more than robust.cov MADs away from the median coverage of each chromosome are identified and removed. These bins will be marked with NA in the returned compartment for that chromosome. To turn off robustification, set robust.cov to NA.

By default, centers is set to 2 to model the open and closed compartments. While a larger value can be used to obtain more clusters, care is required as the interpretation of the resulting compartments becomes more difficult. If desired, users can also apply their own clustering methods on the matrix returned in the output.

Value

A named list of lists is returned where each internal list corresponds to a chromosome in data and contains compartment, an integer vector of compartment IDs for all bins in that chromosome; and matrix, a ContactMatrix object containing (normalized) contact frequencies for the intra-chromosomal space. Entries in compartment are named according to the matching index of regions(data).

Author(s)

Aaron Lun

References

Lieberman-Aiden E et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289-293.

Lajoie BR, Dekker J, Kaplan N (2014). The hitchhiker's guide to Hi-C analysis: practical guidelines. Methods 72, 65-75.

See Also

squareCounts, filterTrended, kmeans

Examples

# Dummying up some data.
set.seed(3426)
npts <- 100
npairs <- 5000
nlibs <- 4
anchor1 <- sample(npts, npairs, replace=TRUE)
anchor2 <- sample(npts, npairs, replace=TRUE)
data <- InteractionSet(
    list(counts=matrix(rpois(npairs*nlibs, runif(npairs, 10, 100)), nrow=npairs)),
    GInteractions(anchor1=anchor1, anchor2=anchor2,
        regions=GRanges(c(rep("chrA", 80), rep("chrB", 20)), 
            IRanges(c(1:80, 1:20), c(1:80, 1:20))), mode="reverse"),
    colData=DataFrame(totals=runif(nlibs, 1e6, 2e6)), metadata=List(width=1))
data <- unique(data)

# Running compartmentalization.
out <- compartmentalize(data)
head(out$chrA$compartment)
dim(out$chrA$matrix)
head(out$chrB$compartment)
dim(out$chrB$matrix)

test <- compartmentalize(data, cov.correct=FALSE)
test <- compartmentalize(data, dist.correct=FALSE)
test <- compartmentalize(data, robust.cov=NA)

LTLA/diffHic documentation built on Dec. 20, 2024, 7:06 p.m.