PairSummaries: Summarize connected pairs in a LinkedPairs object
In npcooley/SynExtend: Tools for Comparative Genomics

PairSummaries

R Documentation

Summarize connected pairs in a LinkedPairs object

Description

Takes in a “LinkedPairs” object and gene calls, and returns a data.frame of paired features.

Usage

PairSummaries(SyntenyLinks,
              DBPATH,
              PIDs = FALSE,
              Score = FALSE,
              IgnoreDefaultStringSet = FALSE,
              Verbose = FALSE,
              Model = "Generic",
              DefaultTranslationTable = "11",
              AcceptContigNames = TRUE,
              OffSetsAllowed = NULL,
              Storage = 1,
              ...)

Arguments

`SyntenyLinks`	A `LinkedPairs` object. In previous versions of this function, a `GeneCalls` object was also required, but this object is now carried forward from `NucleotideOverlap` inside the `LinkedPairs` object.
`DBPATH`	A SQLite connection object or a character string specifying the path to the database file constructed from DECIPHER's `Seqs2DB` function. This path is always required as “PairsSummaries” always computes the tetramer distance between paired sequences.
`PIDs`	Logical indicating whether to provide a PID for each pair. If `TRUE` all pairs will be aligned using DECIPHER's `AlignProfiles`. This step can be time consuming, especially for large numbers of pairs. Default is `FALSE`.
`Score`	Logical indicating whether to provide a length normalized score with DECIPHER's `ScoreAlignment` function. If `TRUE` all pairs will be aligned using DECIPHER's `AlignProfiles`. This step can be time consuming, especially for large numbers of pairs. Default is `FALSE`.
`IgnoreDefaultStringSet`	Logical indicating alignment type preferences. If `FALSE` (the default) pairs that can be aligned in amino acid space will be aligned as an `AAStringSet`. If `TRUE` all pairs will be aligned in nucleotide space. For `PairSummaries` to align the translation of a pair of sequences, both sequences must be tagged as coding in the “GeneCalls” object, and be the correct width for translation.
`Verbose`	Logical indicating whether or not to display a progress bar and print the time difference upon completion.
`Model`	A character string specifying a model to use to predict PIDs without performing an alignment. By default this argument is “Generic” specifying a generic PID prediction model based on PIDs computed from a randomly selected set of genomes. Currently no other models are included. Users may also supply their own model of type “glm” if they so desire in the form of an RData file. This model will need to take in some, or of the columns of statistics per pair that PairSummaries supplies.
`DefaultTranslationTable`	A character used to set the default translation table for `translate`. Is passed to `getGeneticCode`. Used when no translation table is specified in the “GeneCalls” object.
`AcceptContigNames`	Match names of contigs between gene calls object and synteny object. Where relevant, the first white space and everything following are removed from contig names. If `TRUE`, PairSummaries assumes that the contigs at each position in the synteny object and “GeneCalls” object are in the same order. Is automatically set to `TRUE` when “GeneCalls” are of class “GRanges”. Is currently `TRUE` by default.
`OffSetsAllowed`	Defaults to `NULL`. Supplying an integer vector will indicate gap sizes to attempt to fill. A value of `2` will attempt to span gaps of size 1. If a vector larger than 1 is provided, i.e. `c(2, 3)`, will attempt to query all gap sizes implied by the vector, in this case gaps of size 1 and 2.
`Storage`	Numeric indicating the approximate size a user wishes to allow for holding `StringSet`s in memory to extract gene sequences, in “Gigabytes”. The lower `Storage` is set, the more likely that `PairSummaries` will need to reaccess `StringSet`s when extracting gene sequences. The higher `Storage` is set, the more sequences `PairSummaries` will attempt to hold in memory, avoiding the need to re-access the source database many times. Set to 1 by default, indicating that `PairSummaries` can store a “Gigabyte” of sequences in memory at a time.
`...`	Arguments to be passed to `AlignProfiles`, and `DistanceMatrix`.

Details

The LinkedPairs object generated by NucleotideOverlap is a container for raw data that describes possible orthologous relationships, however ultimate assignment of orthology is up to user discretion. PairSummaries generates a clear table with relevant statistics for a user to work with as they choose. The option to align all pairs, though onerous can allow users to apply a hard threshold to predictions by PID, while built in models can allow more expedient thresholding from predicted PIDs.

Value

A data.frame of class “data.frame” and “PairSummaries” of paired genes that are connected by syntenic hits. Contains columns describing the k-mers that link the pair. Columns “p1” and “p2” give the location ids of the the genes in the pair in the form “DatabaseIdentifier_ContigIdentifier_GeneIdentifier”. “ExactMatch” provides an integer representing the exact number of nucleotides contained in the linking k-mers. “TotalKmers” provides an integer describing the number of distinct k-mers linking the pair. “MaxKmer” provides an integer describing the largest k-mer that links the pair. A column titled “Consensus” provides a value between zero and 1 indicating whether the kmers that link a pair of features are in the same position in each feature, with 1 indicating they are in exactly the same position and 0 indicating they are in as different a position as is possible. The “Adjacent” column provides an integer value ranging between 0 and 2 denoting whether a feature pair's direct neighbors are also paired. Gap filled pairs neither have neighbors, or are included as neighbors. The “TetDist” column provides the euclidean distance between oligonucleotide - of size 4 - frequences between predicted pairs. “PIDType” provides a character vector with values of “NT” where either of the pair indicates it is not a translatable sequence or “AA” where both sequences are translatable. If users choose to perform pairwise alignments there will be a “PID” column providing a numeric describing the percent identity between the two sequences. If users choose to predict PIDs using their own, or a provided model, a “PredictedPID” column will be provided.

Author(s)

Nicholas Cooley npc19@pitt.edu

Examples

# this function will be deprecated soon,
# please see the new SummarizePairs() function.
DBPATH <- system.file("extdata",
                      "Endosymbionts_v05a.sqlite",
                      package = "SynExtend")
                      
data("Endosymbionts_LinkedFeatures", package = "SynExtend")

Pairs <- PairSummaries(SyntenyLinks = Endosymbionts_LinkedFeatures,
                       PIDs = FALSE,
                       DBPATH = DBPATH,
                       Verbose = TRUE)

npcooley/SynExtend documentation built on June 8, 2025, 5:24 a.m.