View source: R/connectCounts.R
connectCounts | R Documentation |
Count the number of read pairs connecting pairs of user-specified regions
connectCounts(files, param, regions, filter=1L, type="any",
second.regions=NULL, restrict.regions=FALSE)
files |
a character vector containing the paths to the count file for each library |
param |
a |
regions |
a GRanges object specifying the regions between which read pairs should be counted |
filter |
an integer scalar specifying the minimum count for each interaction |
type |
a character string specifying how restriction fragments should be assigned to regions |
second.regions |
a GRanges object containing the second regions of interest, or an integer scalar specifying the bin size |
restrict.regions |
A logical scalar indicating whether the output regions should be limited to entries in |
Interactions of interest are defined as those formed by pairs of elements in regions
.
The number of read pairs connecting each pair of elements can then be counted in each library.
This can be useful for quantifying/summarizing interactions between genomic features, e.g., promoters or gene bodies.
For a pair of intervals in regions
, the interaction count is defined as the number of read pairs with one read in each interval.
To save memory, pairs of intervals can be filtered to retain only those with a count sum across all libraries above filter
.
In each pair, the anchor interval is defined as that with the higher start position.
Note that the end position may not be higher if nested intervals are present in regions
.
For typical Hi-C experiments, mapping of read pairs into intervals is performed at the level of restriction fragments.
The value of type
feeds into findOverlaps
and controls the manner in which restriction fragments are assigned to each region.
By default, a restriction fragment is assigned to one or more regions if said fragment overlaps with any part of those regions.
This expands the effective boundaries of each entry of regions
to the nearest restriction site.
In contrast, setting type="within"
would contract each interval.
For DNase Hi-C experiments, the interval spanned by the alignment of each read is overlapped against the intervals in regions
.
This uses the linkOverlaps
function, which responds to any specification of type
.
The boundaries of regions
are not modified as no restriction fragments are involved.
Counting will consider the values of restrict
, discard
and cap
in param
- see pairParam
for more details.
In all cases, strandedness of the intervals is ignored in input and set to "*"
in the output object.
Any element metadata in the input regions
is also removed in the output.
An InteractionSet containing the number of read pairs in each library that are mapped between pairs of regions
, or between regions
and second.regions
.
Interacting regions are returned as a ReverseStrictGInteractions object containing the concatenated regions
and second.regions
.
The second.regions
argument allows specification of a second set of regions.
Interactions are only considered between one entry in regions
and one entry in second.regions
.
This differs from supplying all regions to regions
, which would consider all pairwise interactions between regions regardless of whether they belong in the first or second set.
Note that the sets are not parallel, and any pairing is considered if as long as it contains one region from the first set and another from the second set.
Specification of second.regions
is useful for efficiently identifying interactions between two sets of regions.
For example, the first set can be set to several “viewpoint” regions of interest.
This is similar to the bait region in 4C-seq, or the captured regions in Capture Hi-C.
Interactions between these viewpoints and the rest of the genome can then be examined by setting second.regions
to some appropriate bin size.
If an integer scalar is supplied as second.regions
, this value is used as a width to partition the genome into bins.
These bins are then used as the set of second regions.
This is useful for 4C-like experiments where interactions between viewpoints and the rest of the genome are of interest.
Note that this function does not guarantee that the second set of regions will be treated as the second anchor region (or the first) for each interaction.
Those definitions are dependent on the sorting order of the coordinates for all regions.
Users should only use the is.second
field to identify the region from the second set in each interaction.
For standard Hi-C experiments, all supplied regions are expanded or contracted to the nearest restriction site.
These modified regions can be extracted from the regions
slot in the output InteractionSet object, which will be reordered according to the new start positions.
The ordering permutation can be recovered from the original
metadata field of the GRanges object.
Similarly, the number of restriction fragments assigned to each interval is stored in the nfrags
metadata field.
For DNase-C experiments, no expansion of the regions is performed, so the coordinates in the output regions
slot are the same as those in the input regions
.
However, reordering may still be necessary, in which case the original
field will specify the original index of each entry.
All nfrags
are set to zero as no restriction fragments are involved.
If second.regions
is specified, the output regions
slot will contain both the input regions
and the second.regions
(though not necessarily in that order).
Entries that were originally in second.regions
can be distinguished with the is.second
metadata field.
Each original
index will also point towards the corresponding entry in the original second.regions
when is.second=TRUE
.
Conversely, if is.second=FALSE
, the index will point towards the corresponding entry in the original regions
.
If second.regions
is an integer scalar, the entries in the output regions
slot will contain the coordinates for the resulting bins.
Note that the original
metadata field is set to NA
for these bins, as no original GRanges existed for these intervals.
If restrict.regions=TRUE
and param$restrict
is not NULL
,
only bins on the chromosomes in param$restrict
will be reported in the output regions
slot.
This avoids the overhead of constructing many bins when only a small subset of them are used.
By default, restrict.regions=FALSE
to ensure that the anchor IDs of the output object are directly comparable between different settings of param$restrict
,
Aaron Lun
squareCounts
,
findOverlaps
,
InteractionSet-class
,
ReverseStrictGInteractions-class
hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts)
# Setting up the parameters
fout <- "output"
invisible(preparePairs(hic.file, param, fout))
regions <- suppressWarnings(c(
GRanges("chrA", IRanges(c(1, 100, 150), c(20, 140, 160))),
GRanges("chrB", IRanges(50, 100))))
# Collating to count combinations.
con <- connectCounts(fout, param, regions=regions, filter=1L)
head(assay(con))
con <- connectCounts(fout, param, regions=regions, filter=1L, type="within")
head(assay(con))
# Still works with restriction and other parameters.
con <- connectCounts(fout, param=reform(param, restrict="chrA"),
regions=regions, filter=1L)
head(assay(con))
con <- connectCounts(fout, param=reform(param, discard=GRanges("chrA", IRanges(1, 50))),
regions=regions, filter=1L)
head(assay(con))
con <- connectCounts(fout, param=reform(param, cap=1), regions=regions, filter=1L)
head(assay(con))
# Specifying a second region.
regions2 <- suppressWarnings(c(
GRanges("chrA", IRanges(c(50, 100), c(100, 200))),
GRanges("chrB", IRanges(1, 50))))
con <- connectCounts(fout, param, regions=regions, filter=1L, second.region=regions2)
head(anchors(con, type="first"))
head(anchors(con, type="second"))
con <- connectCounts(fout, param, regions=regions, filter=1L, second.region=50)
head(anchors(con, type="first"))
head(anchors(con, type="second"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.