preparePairs: Prepare Hi-C pairs

View source: R/preparePairs.R

preparePairsR Documentation

Prepare Hi-C pairs

Description

Identifies the interacting pair of restriction fragments corresponding to each read pair in a Hi-C library.

Usage

preparePairs(bam, param, file, dedup=TRUE, minq=NA, ichim=TRUE, 
    chim.dist=NA, output.dir=NULL, storage=5000L)

Arguments

bam

a character string containing the path to a name-sorted BAM file

param

a pairParam object containing read extraction parameters

file

a character string specifying the path to an output index file

dedup

a logical scalar indicating whether marked duplicate reads should be removed

minq

an integer scalar specifying the minimum mapping quality for each read

ichim

a logical scalar indicating whether invalid chimeras should be counted

chim.dist

an integer scalar specifying the maximum distance between segments for a valid chimeric read pair

output.dir

a character string specifying a directory for temporary files

storage

an integer scalar specifying the maximum number of pairs to store in memory before writing to file

Value

Multiple dataframe objects are stored within the specified file using the HDF5 format. Each object corresponds to a pair of chromosomes, designated as the anchor1 (later) or anchor2 (earlier) chromosome based on the order of their names. Each row of the dataframe contains information for a read pair, with one read mapped to each chromosome. The dataframe contains several integer fields:

anchor1.id, anchor2.id:

Index of the anchor1 or anchor2 restriction fragment to which each read was assigned.

anchor1.pos, anchor2.pos:

1-based genomic coordinate of the aligned read (or the 5' segment thereof, for chimeras) on the anchor1 or anchor2 fragment.

anchor1.len, anchor2.len:

Length of the alignment on the anchor1 or anchor2 fragment. This is multiplied by -1 for alignments on the reverse strand.

A list is also returned from the function, containing various diagnostics:

pairs:

an integer vector containing total, the total number of read pairs; marked, read pairs with at least one marked read or 5' segment; filtered, read pairs where the MAPQ score for either read or 5' segment is below minq; mapped, read pairs considered as successfully mapped (i.e., not filtered, and also not marked if dedup=TRUE)

same.id:

an integer vector containing dangling, the number of read pairs that are dangling ends; and self.circles, the number of read pairs forming self-circles

singles:

an integer scalar specifying the number of reads without a mate

chimeras:

an integer vector containing total, the total number of read pairs with one chimeric read; mapped, chimeric read pairs with all 5' segments and non-chimeric reads mapped; multi, mapped chimeric pairs with at least one successfully mapped 3' segment; and invalid, read pairs where the 3' location of one read disagrees with the 5' location of the mate

For DNase Hi-C data, the anchor1.id and anchor2.id fields are set to zero, and the same.id field in the output list is removed.

Converting to restriction fragment indices

The resolution of a Hi-C experiment is defined by the distribution of restriction sites across the genome. Thus, it makes sense to describe interactions in terms of restriction fragments. This function identifies the interacting fragments corresponding to each pair of reads in a Hi-C library. To save space, it stores the indices of the interacting fragments for each read pair, rather than the fragments themselves.

Indexing is performed by matching up the mapping coordinates for each read with the restriction fragment boundaries in param$fragments. Needless to say, the boundary coordinates in param$fragments must correspond to the reference genome being used. In most cases, these can be generated using the cutGenome function from any given BSgenome object. If, for any reason, a modified genome is used for alignment, then the coordinates of the restriction fragments on the modified genome are required.

Each read pair subsequently becomes associated with a pair of restriction fragments. The anchor1 fragment is that with the higher genomic coordinate, i.e., the larger index in param$fragments. The anchor2 fragment is that with the smaller coordinate/index. This definition avoids the need to consider both permutations of indices in a pair during downstream processing.

Details of read pair processing

A read pair is discarded if either read is unavailable, e.g., unmapped, mapping quality score below minq, marked as a duplicate. No MAPQ filtering is performed when minq is set to NA. Any duplicate read must be marked in the bit field of the BAM file using a tool like Picard's MarkDuplicates if it is to be removed with dedup=TRUE.

Self-circles are outward-facing read pairs mapped to the same restriction fragment. These are formed from inefficient cross-linking and are generally uninformative. Dangling ends are inward-facing read pairs mapped to the same fragment, and are generated from incomplete ligation of blunt ends. Both constructs are detected and discarded within the function. Note that this does not consider dangling ends or self-circles formed from incompletely digested fragments, which must be removed with prunePairs.

For pairs with chimeric reads, the segment containing the 5' end of each chimeric read is used to assign the fragment index. Chimeric read pairs are discarded if the 5' segments of the chimeric reads are not available, regardless of what happens with the 3' segment. Note that, when running MarkDuplicates with chimeric reads, the recommended approach is to designate the 5' segment as the only primary or non-supplementary alignment. This ensures that the duplicate computations are performed on the most relevant alignments for each read pair.

Invalid chimeras arise when the index/position of the 3' segment of a chimeric read is not consistent with that of the mate read. These are generally indicative of mapping errors but can also form due to non-specific ligation events. Computationally, invalid chimeras can be defined in two ways:

  • If chim.dist=NA, a chimeric pair is considered to be invalid if the 3' segment and the mate do not map onto the same restriction fragment in an inward-facing orientation. This reflects the resolution limits of the Hi-C protocol.

  • If chim.dist is not NA, chimeras are defined based on distance. A pair is considered invalid if the distance between the segment and mate is greater than chim.dist, or if the alignments are not inward-facing.

The second approach is more relevant in situations involving inefficient cleavage, where the mapping locations are broadly consistent but do not fall in the same restriction fragment. The maximum size of the ligation products can be used as a reasonable value for chim.dist, e.g., 1000 bp. While invalid chimeras can be explicitly removed, keeping ichim=TRUE is recommended to avoid excessive filtering due to inaccurate alignment of short chimeric 3' segments.

Processing DNase Hi-C experiments

DNase Hi-C involves random fragmentation with DNase instead of restriction enzymes. To indicate that the data are generated by DNase Hi-C, an empty GRanges object should be supplied as the fragments in pairParam. Genome information will instead be extracted from the seqlengths of the GRanges object. This empty object can be generated by emptyGenome.

The BAM file can be processed with preparePairs in a manner that is almost identical to that of normal Hi-C experiments. However, there are some key differences:

  • No reporting or removal of self-circles or dangling ends is performed, as these have no meaning when restriction fragments are not involved.

  • Chimeras are considered invalid if the 3' segment of one read and the 5' segment of the mate are not inward-facing or more than chim.dist away from each other. If chim.dist=NA, it will be set to a default of 1000 bp.

  • All fragment IDs in the output HDF5 file will be set to zero. The first read of each pair is defined as the read on the chromosome that is ordered later in seqlengths(fragments). For pairs on the same chromosome, the first read is defined as that with a higher genomic coordinate for its 5' end.

Miscellaneous information

If output.dir is not specified, a directory name is constructed using the tempfile command. If it is specified, users should make sure that no file already exists with that name. Otherwise, an error will be raised. This directory is used to store intermediate files that will be eventually processed into the HDF5 output file.

For low-memory systems or in cases where there are many chromosome pairs, users may need to reduce the value of storage. This will write data to file more frequently, which reduces memory usage at the cost of speed.

Users should note that the use of a pairParam object for input is strictly for convenience. Only the value of param$fragments will be used. Any non-empty values of param$discard and param$restrict will be ignored here. Reads will not be discarded if they lie outside the specified chromosomes, or if they lie within blacklisted regions.

Author(s)

Aaron Lun, with use of Rhtslib based on comments from Alessandro Mammana.

References

Imakaev M et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999-1003.

Belton, RP et al. (2012). Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268-276.

See Also

cutGenome, prunePairs, mergePairs, getPairData

Examples

hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <- readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts) 

# Note: don't save to a temporary file for actual data.
tmpf <- tempfile(fileext=".h5")
preparePairs(hic.file, param, tmpf)
preparePairs(hic.file, param, tmpf, minq=50)
preparePairs(hic.file, param, tmpf, ichim=TRUE)
preparePairs(hic.file, param, tmpf, dedup=FALSE)

# Pretending it's DNase Hi-C:
eparam <- pairParam(cuts[0]) 
preparePairs(hic.file, eparam, tmpf)
preparePairs(hic.file, eparam, tmpf, dedup=FALSE)
preparePairs(hic.file, eparam, tmpf, minq=50)
preparePairs(hic.file, eparam, tmpf, chim.dist=20)

LTLA/diffHic documentation built on April 1, 2024, 7:21 a.m.