savePairs: Save Hi-C read pairs
In diffHic: Differential Analyis of Hi-C Data

Description Usage Arguments Details Value Author(s) See Also Examples

Save a dataframe of read pairs into a directory structure for rapid chromosomal access.

1	savePairs(x, file, param)

`x`	A dataframe with integer fields `anchor1.id` and `anchor2.id`. Each row corresponds to a single read pair.
`file`	A character string specifying the path for the output index file.
`param`	A `pairParam` object containing read extraction parameters. In particular, `param$fragments` should contain genomic regions corresponding to the `anchor*.id` values.

This function facilitates the input of processed Hi-C data from other sources into the current pipeline. Each row of x corresponds to a read pair, and each entry in x$anchor1.id and x$anchor2.id contains an index for param$fragments. Thus, the pair of indices for each row denotes the the interacting regions for each read pair. These regions are generally expected to be restriction fragments in conventional Hi-C experiments.

Obviously, the coordinates of the restriction fragment boundaries in param$fragments should correspond to the genome to which the reads were aligned. These can be generated using the cutGenome function from any given BSgenome object or the FASTA files used for alignment. Values of param$discard and param$restrict will not be used here and can be ignored.

Any additional fields in x will also be saved to file. Users are recommended to put in anchor1.pos, anchor1.len, anchor2.pos and anchor2.len fields. These should mimic the output of preparePairs:

anchorY.pos:: Integer field, containing the 1-based genomic position of the left-most aligned base of read Y.
anchorY.len:: Integer field, containing the length of the alignment of read Y on the reference sequence. This should be multiplied by -1 if the alignment was on the negative strand.

These fields enable the use of more diffHic functions, e.g., removal of reads in param$discard during counting with squareCounts, correct calculation of statistics with getPairData, quality control with prunePairs.

For storing DNase Hi-C data, param$fragments should be empty but the seqinfo should contain the lengths and names of all chromosomes. Here, the input anchor1.id and anchor2.id should contain indices of the seqlengths. This specifies the chromosome to which each read is aligned, e.g., an anchor1.id of 2 means that read 1 is aligned to the second chromosome in seqlengths. Note that, for this type of data, it is essential to store the position and length fields mentioned above.

When constructing the output file, x will be resorted by anchor1.id, then anchor2.id. If necessary, anchor1 and anchor2 indices will be switched such that the former is never less than the latter. For DNase Hi-C data, both of these fields will ultimately be set to zero - see prepPseudoPairs for more details.

An index file is produced at the specified file location, containing the interaction data. A NULL value is invisibly returned.

Aaron Lun

preparePairs, cutGenome

hic.file <- system.file("exdata", "hic_sort.bam", package="diffHic")
cuts <-readRDS(system.file("exdata", "cuts.rds", package="diffHic"))
param <- pairParam(cuts)

n <- 1000
all.a <- as.integer(runif(n, 1L, length(cuts)))
all.t <- as.integer(runif(n, 1L, length(cuts)))
x <- data.frame(anchor1.id=all.a, anchor2.id=all.t,
	anchor1.pos=runif(1:100), anchor1.len=10, 
	anchor2.pos=runif(1:100), anchor2.len=-10)

# Note: don't save to a temporary file for actual data.
fout <- tempfile(fileext=".h5")
savePairs(x, fout, param)
require(rhdf5)
head(h5read(fout, "chrA/chrA"))