EstimRearrScen: Estimate Genome Rearrangement Events with Double Cut and Join...

View source: R/EstimRearrScen.R

EstimRearrScenR Documentation

Estimate Genome Rearrangement Events with Double Cut and Join Operations

Description

Take in a Synteny object and return predicted rearrangement events.

Usage

EstimRearrScen(SyntenyObject, NumRuns = -1,
                Mean = FALSE, MinBlockLength = -1,
                Verbose = TRUE)
                

Arguments

SyntenyObject

Synteny object, as obtained from running FindSynteny. Expected input is unichromosomal sequences, though multichromosomal sequences are supported.

NumRuns

Numeric; Number of times to simulate scenarios. The default value of -1 (and all non-positive values) runs each analysis for \sqrt b iterations, where b is the number of unique breakpoints.

Mean

Logical; If TRUE, returns the mean number of inversions and transpositions found. If FALSE, returns the scenario corresponding to the minimum total number of operations across all runs. This parameter only affects the number of inversions and transpositions reported; the specific scenario returned is one of the runs that resulted in a minimum value.

MinBlockLength

Numeric; Minimum size of syntenic blocks to use for analysis. The default value accepts all blocks. Set to a larger value to ignore sections of short mutations that could be the result of SNPs or other small-scale mutations.

Verbose

Logical; indicates whether or not to display a progress bar and print the time difference upon completion.

Details

EstimRearrScen is an implementation of the Double Cut and Join (DCJ) method for analyzing large scale mutation events.

The DCJ model is commonly used to model genome rearrangement operations. Given a genome, we can create a connected graph encoding the order of conserved genomic regions. Each syntenic region is split into two nodes, with one encoding the beginning and one encoding the end (beginning and end defined relative to the direction of transcription). Each node is then connected to the two nodes it is adjacent to in the genome.

For example, given a genome with 3 syntenic regions a-b-c such that b is transcribed in the opposite direction relative to a,c, our graph would consist of nodes and edges a1-a2-b2-b1-c1-c2.

Given two genomes, we derive syntenic regions between the two samples and then construct two of these graph structures. A DCJ operation is one that cuts two connections of a common color and creates two new edges. The goal of the DCJ model is to rearrange the graph of the first genome into the second genome using DCJ operations. The DCJ distance is defined as the minimum number of DCJ operations to transform one graph into another.

It can be easily shown that inversions can be performed with a single DCJ operation, and block interchanges/order rearrangements can be performed with a sequence of two DCJ operations. DCJ distance defines a metric space, and prior work has demonstrated algorithms for fast computation of the DCJ distance.

However, DCJ distance inherently incentivizes inversions over block interchanges due to the former requiring half as many DCJ operations. This is a strong assumption, and there is no evidence to support gene order rearrangements occuring half as often as gene inversions.

This implementation incentivizes minimum number of total events rather than total number of DCJs. As the search space is large and multiple sequences of events can be equally parsimonious, this algorithm computes multiple scenarios with random sequences of operations to try to find the minimum amount of events. Users can choose to receive the best found solution or the mean number of events from all solutions.

Value

An NxN matrix of lists with the same shape as the input Synteny object. This is wrapped into a GenRearr object for pretty printing.

The diagonal corresponds to total sequence length of the corresponding genome.

In the upper triangle, entry [i,j] corresponds to the percent hits between genome i and genome j. In the lower triangle, entry [i,j] contains a List object with 5 properties:

  • $Inversions and $Transpositions contain the (Mean/min) number of estimated inversions and transpositions (resp.) between genome i and genome j.

  • $pct_hits contains percent hits between the genomes.

  • $Scenario shows the sequence of events corresponding to the minimum rearrangement scenario found. See below for details.

  • $Key provides a mapping between syntenic blocks and genome positions. See below for details.

The print.GenRearr method prints this data out as a matrix, with the diagonal showing the number of chromosomes and the lower triangle displaying xI,yT, where x,y the number of inversions and transpositions (resp.) between the corresponding entries.

The $Scenario entry describes a sequences of steps to rearrange one genome into another, as found by this algorithm. The goal of the DCJ model is to rearrange the second genome into the first. Thus, with N syntenic regions total, we can arbitrarily choose the syntenic blocks in genome 1 to be ordered 1,2,...,N, and then have genome 2 numbers relative to that.

As an example, suppose genome 1 has elements A B E(r) G and genome 2 has elements E B(r) A(r) G, with X(r) denoting block X has reversed direction of transcription. We can then arbitrarily assign blocks to numbers such that genome 1 is (1 2 3 4) and genome 2 is (3 -2 -1 4), where a negative indicates reversed direction of transcription relative to the corresponding syntenic block in genome 1.

Each entry in $Scenario details an operation, the result after that operation, and the number of blocks involved in the operation. If we reversed the middle two entries of genome 2, the entry in $Scenario would be:

inversion: 3 1 2 4 { 2 }

Here we inverted the whole block (-2 -1) into (1 2). We could then finish the rearrangement by performing a transposition to move block 3 between 2 and 4. The entries of $Scenario in this case would be the following:

Original: 3 -2 -1 4

inversion: 3 1 2 4 { 2 }

block interchange: 1 2 3 4 { 3 }

Step 1 is the original state of genome 2, step 2 inverts 2 elements to arrive at (3 1 2 4), and then step 3 moves one element to arrive at (1 2 3 4).

It is important to note that the numbered genomic regions in $Scenario are not genes, they are blocks of conserved syntenic regions between the genomes. These blocks may not match up with the original blocks from the Synteny object, since some are combined during pre-processing to expedite calculations.

$Key is a mapping between these numbered regions and the original genomic regions. This is a 5 column matrix with the following columns (in order):

  1. start1: Nucleotide position for the first nucleotide in of the syntenic region on genome 1.

  2. start2: Same as start1, but for genome 2

  3. length: Length of block, in nucleotides

  4. rel_direction_on_2: 1 if the blocks have the same transcriptonal direction on both genomes, and 0 if the direction is reversed in genome 2

  5. index1: Label of the genetic region used in $Scenario output

Author(s)

Aidan Lakshman (ahl27@pitt.edu)

References

Friedberg, R., Darling, A. E., & Yancopoulos, S. (2008). Genome rearrangement by the double cut and join operation. Bioinformatics, 385-416.

See Also

FindSynteny

Synteny

Examples

db <- system.file("extdata", "Influenza.sqlite", package="DECIPHER")
synteny <- FindSynteny(db)
synteny

rearrs <- EstimRearrScen(synteny)

rearrs          # view whole object
rearrs[[2,1]]   # view details on Genomes 1 and 2

npcooley/SynExtend documentation built on Nov. 15, 2024, 3:02 p.m.