cutGenome: Cut up the genome
In LTLA/diffHic: Differential Analysis of Hi-C Data

cutGenome

R Documentation

Cut up the genome

Description

Perform an in silico restriction digest of a target genome.

Usage

cutGenome(bs, pattern, overhang=4L)

Arguments

`bs`	A BSgenome object or a character string containing a path to a FASTA file.
`pattern`	A character vector containing one or more recognition sites.
`overhang`	An integer vector specifying the length of the 5' overhang for each element in `pattern`.

Details

This function simulates a restriction digestion of a specified genome, given the recognition site and 5' overhang of the restriction enzyme. The total sequence spanned by each fragment is recorded including the two sticky ends after they are filled in. No support is currently provided for searching the reverse strand, so the recognition site should be an inverse palindrome.

The genome should be specified as a BSgenome object. However, a character string can also be provided, specifying a FASTA file containing all the reference sequences in a genome. The latter may be necessary to synchronise the fragments with the genome used for alignment.

Multiple restriction enzymes can be specified by passing vectors to pattern and overhang. All recognition sites are expected to be inverse palindromes, and for any given enzyme, the width of pattern and overhang must be both odd or even.

No attempt is made to remove fragments that cannot be physically formed, e.g., from recognition sites that overlap with themselves or each other. This generally is not problematic for downstream analysis, as they are short and will not have many assigned reads. If they are a concern, most of them can be removed by simply applying a suitable threshold (e.g., 10 bp) on the fragment width. However, the best solution is to simply choose (combinations of) restriction enzymes that do not overlap.

Value

A GRanges object containing the boundaries of each restriction fragment in the genome.

Note on FASTA sequence names

If bs is a FASTQ file, the chromosome names in the FASTQ headers will be loaded faithfully by cutGenome. However, many mapping pipelines will drop the rest of the name past the first whitespace when constructing the alignment index. To be safe, users should ensure that the chromosome names in the FASTQ headers consist of one word. Otherwise, there will be a discrepancy between the chromosome names in the output GRanges and those in the BAM files after alignment.

Author(s)

Aaron Lun

Examples

require(BSgenome.Ecoli.NCBI.20080805)

cutGenome(Ecoli, "AAGCTT", overhang=4L) # HindIII
cutGenome(Ecoli, "CCGCGG", overhang=2L) # SacII
cutGenome(Ecoli, "AGCT", overhang=0L) # AluI

# Trying with FASTA files.
x <- system.file("extdata", "fastaEx.fa", package="Biostrings")
cutGenome(x, "AGCT", overhang=2)
cutGenome(x, "AGCT", overhang=4)

# Multiple sites with different overhangs are supported.
cutGenome(x, c("AGCT", "AGNCT"), overhang=c(4, 3))

LTLA/diffHic documentation built on Dec. 20, 2024, 7:06 p.m.