connectCounts: Count connecting read pairs

View source: R/connectCounts.R

connectCountsR Documentation

Count connecting read pairs

Description

Count the number of read pairs connecting pairs of user-specified regions

Usage

 
connectCounts(files, param, regions, filter=1L, type="any", 
    second.regions=NULL, restrict.regions=FALSE)

Arguments

files

a character vector containing the paths to the count file for each library

param

a pairParam object containing read extraction parameters

regions

a GRanges object specifying the regions between which read pairs should be counted

filter

an integer scalar specifying the minimum count for each interaction

type

a character string specifying how restriction fragments should be assigned to regions

second.regions

a GRanges object containing the second regions of interest, or an integer scalar specifying the bin size

restrict.regions

A logical scalar indicating whether the output regions should be limited to entries in param$restrict.

Details

Interactions of interest are defined as those formed by pairs of elements in regions. The number of read pairs connecting each pair of elements can then be counted in each library. This can be useful for quantifying/summarizing interactions between genomic features, e.g., promoters or gene bodies.

For a pair of intervals in regions, the interaction count is defined as the number of read pairs with one read in each interval. To save memory, pairs of intervals can be filtered to retain only those with a count sum across all libraries above filter. In each pair, the anchor interval is defined as that with the higher start position. Note that the end position may not be higher if nested intervals are present in regions.

For typical Hi-C experiments, mapping of read pairs into intervals is performed at the level of restriction fragments. The value of type feeds into findOverlaps and controls the manner in which restriction fragments are assigned to each region. By default, a restriction fragment is assigned to one or more regions if said fragment overlaps with any part of those regions. This expands the effective boundaries of each entry of regions to the nearest restriction site. In contrast, setting type="within" would contract each interval.

For DNase Hi-C experiments, the interval spanned by the alignment of each read is overlapped against the intervals in regions. This uses the linkOverlaps function, which responds to any specification of type. The boundaries of regions are not modified as no restriction fragments are involved.

Counting will consider the values of restrict, discard and cap in param - see pairParam for more details. In all cases, strandedness of the intervals is ignored in input and set to "*" in the output object. Any element metadata in the input regions is also removed in the output.

Value

An InteractionSet containing the number of read pairs in each library that are mapped between pairs of regions, or between regions and second.regions. Interacting regions are returned as a ReverseStrictGInteractions object containing the concatenated regions and second.regions.

Matching to a second set of regions

The second.regions argument allows specification of a second set of regions. Interactions are only considered between one entry in regions and one entry in second.regions. This differs from supplying all regions to regions, which would consider all pairwise interactions between regions regardless of whether they belong in the first or second set. Note that the sets are not parallel, and any pairing is considered if as long as it contains one region from the first set and another from the second set.

Specification of second.regions is useful for efficiently identifying interactions between two sets of regions. For example, the first set can be set to several “viewpoint” regions of interest. This is similar to the bait region in 4C-seq, or the captured regions in Capture Hi-C. Interactions between these viewpoints and the rest of the genome can then be examined by setting second.regions to some appropriate bin size.

If an integer scalar is supplied as second.regions, this value is used as a width to partition the genome into bins. These bins are then used as the set of second regions. This is useful for 4C-like experiments where interactions between viewpoints and the rest of the genome are of interest.

Note that this function does not guarantee that the second set of regions will be treated as the second anchor region (or the first) for each interaction. Those definitions are dependent on the sorting order of the coordinates for all regions. Users should only use the is.second field to identify the region from the second set in each interaction.

Format of the output regions

For standard Hi-C experiments, all supplied regions are expanded or contracted to the nearest restriction site. These modified regions can be extracted from the regions slot in the output InteractionSet object, which will be reordered according to the new start positions. The ordering permutation can be recovered from the original metadata field of the GRanges object. Similarly, the number of restriction fragments assigned to each interval is stored in the nfrags metadata field.

For DNase-C experiments, no expansion of the regions is performed, so the coordinates in the output regions slot are the same as those in the input regions. However, reordering may still be necessary, in which case the original field will specify the original index of each entry. All nfrags are set to zero as no restriction fragments are involved.

If second.regions is specified, the output regions slot will contain both the input regions and the second.regions (though not necessarily in that order). Entries that were originally in second.regions can be distinguished with the is.second metadata field. Each original index will also point towards the corresponding entry in the original second.regions when is.second=TRUE. Conversely, if is.second=FALSE, the index will point towards the corresponding entry in the original regions.

If second.regions is an integer scalar, the entries in the output regions slot will contain the coordinates for the resulting bins. Note that the original metadata field is set to NA for these bins, as no original GRanges existed for these intervals.

If restrict.regions=TRUE and param$restrict is not NULL, only bins on the chromosomes in param$restrict will be reported in the output regions slot. This avoids the overhead of constructing many bins when only a small subset of them are used. By default, restrict.regions=FALSE to ensure that the anchor IDs of the output object are directly comparable between different settings of param$restrict,

Author(s)

Aaron Lun

See Also

squareCounts, findOverlaps, InteractionSet-class, ReverseStrictGInteractions-class

Examples

hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts)

# Setting up the parameters
fout <- "output"
invisible(preparePairs(hic.file, param, fout))
regions <- suppressWarnings(c(
    GRanges("chrA", IRanges(c(1, 100, 150), c(20, 140, 160))), 
    GRanges("chrB", IRanges(50, 100))))

# Collating to count combinations.
con <- connectCounts(fout, param, regions=regions, filter=1L)
head(assay(con))
con <- connectCounts(fout, param, regions=regions, filter=1L, type="within")
head(assay(con))

# Still works with restriction and other parameters.
con <- connectCounts(fout, param=reform(param, restrict="chrA"), 
    regions=regions, filter=1L)
head(assay(con))
con <- connectCounts(fout, param=reform(param, discard=GRanges("chrA", IRanges(1, 50))),
    regions=regions, filter=1L)
head(assay(con))
con <- connectCounts(fout, param=reform(param, cap=1), regions=regions, filter=1L)
head(assay(con))

# Specifying a second region.
regions2 <- suppressWarnings(c(
    GRanges("chrA", IRanges(c(50, 100), c(100, 200))), 
    GRanges("chrB", IRanges(1, 50))))

con <- connectCounts(fout, param, regions=regions, filter=1L, second.region=regions2)
head(anchors(con, type="first"))
head(anchors(con, type="second"))
con <- connectCounts(fout, param, regions=regions, filter=1L, second.region=50)
head(anchors(con, type="first"))
head(anchors(con, type="second"))



LTLA/diffHic documentation built on Dec. 20, 2024, 7:06 p.m.