AlignDB: Align Two Sets of Aligned Sequences in a Sequence Database
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Merges the two separate sequence alignments in a database. The aligned sequences must have separate identifiers in the same table or be located in different database tables.

AlignDB(dbFile,
        tblName = "Seqs",
        identifier = "",
        type = "DNAStringSet",
        add2tbl = "Seqs",
        batchSize = 10000,
        perfectMatch = 5,
        misMatch = 0,
        gapOpening = -13,
        gapExtension = -1,
        gapPower = -0.5,
        terminalGap = -5,
        normPower = c(1, 0),
        substitutionMatrix = NULL,
        processors = 1,
        verbose = TRUE,
        ...)

`dbFile`	A SQLite connection object or a character string specifying the path to the database file.
`tblName`	Character string specifying the table(s) where the sequences are located. If two `tblName`s are provided then the sequences in both tables will be aligned.
`identifier`	Optional character string used to narrow the search results to those matching a specific identifier. If "" then all identifiers are selected. If two `identifier`s are provided then the set of sequences matching each `identifier` will be aligned.
`type`	The type of `XStringSet` being processed. This should be (an abbreviation of) one of `"AAStringSet"`, `"DNAStringSet"`, or `"RNAStringSet"`.
`add2tbl`	Character string specifying the table name in which to add the aligned sequences.
`batchSize`	Integer specifying the number of sequences to process at a time.
`perfectMatch`	Numeric giving the reward for aligning two matching nucleotides in the alignment. Only used when `type` is `DNAStringSet` or `RNAStringSet`.
`misMatch`	Numeric giving the cost for aligning two mismatched nucleotides in the alignment. Only used when `type` is `DNAStringSet` or `RNAStringSet`.
`gapOpening`	Numeric giving the cost for opening a gap in the alignment.
`gapExtension`	Numeric giving the cost for extending an open gap in the alignment.
`gapPower`	Numeric specifying the exponent to use in the gap cost function.
`terminalGap`	Numeric giving the cost for allowing leading and trailing gaps ("-" or "." characters) in the alignment. Either two numbers, the first for leading gaps and the second for trailing gaps, or a single number for both.
`normPower`	Numeric giving the exponent that controls the degree of normalization applied to scores by column occupancy. If two numerics are provided, the first controls the normalization power of terminal gaps, while the second controls that of internal gaps. A `normPower` of `0` does not normalize the scores, which results in all columns of the profiles being weighted equally, and is the optimal value for aligning fragmentary sequences. A `normPower` of `1` normalizes the score for aligning two positions by their column occupancy (1 - fraction of gaps). A `normPower` greater than `1` more strongly discourages aligning with “gappy” regions of the alignment.
`substitutionMatrix`	Either a substitution matrix representing the substitution scores for an alignment or the name of the amino acid substitution matrix to use in alignment. The latter may be one of the following: “BLOSUM45”, “BLOSUM50”, “BLOSUM62”, “BLOSUM80”, “BLOSUM100”, “PAM30”, “PAM40”, “PAM70”, “PAM120”, “PAM250”, or “MIQS”. The default (NULL) will use the `perfectMatch` and `misMatch` penalties for DNA/RNA or `PFASUM50` for AA.
`processors`	The number of processors to use, or `NULL` to automatically detect and use all available processors.
`verbose`	Logical indicating whether to display progress.
`...`	Further arguments to be passed directly to `Codec`.

Sometimes it is useful to align two large sets of sequences, where each set of sequences is already aligned but the two sets are not aligned to each other. AlignDB first builds a profile of each sequence set in increments of batchSize so that the entire sequence set is not required to fit in memory. Next the two profiles are aligned using dynamic programming. Finally, the new alignment is applied to all the sequences as they are incrementally added to the add2tbl.

Two identifiers or tblNames must be provided, indicating the two sets of sequences to align. The sequences corresponding to the first identifier and tblName will be aligned to those of the second identifier or tblName. The aligned sequences are added to add2tbl under a new identifier formed from the concatenation of the two identifiers or tblNames. (See examples section below.)

Returns the number of newly aligned sequences added to the database.

Erik Wright eswright@pitt.edu

Wright, E. S. (2015). DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics, 16, 322. http://doi.org/10.1186/s12859-015-0749-z

Wright, E. S. (2020). RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency. RNA 2020, 26, 531-540.

AlignProfiles, AlignSeqs, AlignTranslation, PFASUM

gen <- system.file("extdata", "Bacteria_175seqs.gen", package="DECIPHER")
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")

# Align two tables and place result into a third
dbConn <- dbConnect(SQLite(), ":memory:")
Seqs2DB(gen, "GenBank", dbConn, "Seqs1", tblName="Set1")
Seqs2DB(fas, "FASTA", dbConn, "Seqs2", tblName="Set2")
AlignDB(dbConn, tblName=c("Set1", "Set2"), add2tbl="AlignedSets")
l <- IdLengths(dbConn, "AlignedSets", add2tbl=TRUE)
BrowseDB(dbConn, tblName="AlignedSets") # all sequences have the same width
dbDisconnect(dbConn)

# Align two identifiers and place the result in the same table
dbConn <- dbConnect(SQLite(), ":memory:")
Seqs2DB(gen, "GenBank", dbConn, "Seqs1")
Seqs2DB(fas, "FASTA", dbConn, "Seqs2")
AlignDB(dbConn, identifier=c("Seqs1", "Seqs2"))
l <- IdLengths(dbConn, add2tbl=TRUE)
BrowseDB(dbConn) # note the sequences with a new identifier
dbDisconnect(dbConn)

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: RSQLite

Reading GenBank file chunk 1

175 total sequences in table Set1.
Time difference of 0.27 secs


Reading FASTA file chunk 1

175 total sequences in table Set2.
Time difference of 0.07 secs

================================================================================
Added 350 aligned sequences to table AlignedSets with identifier 'Set1_Set2'.

Lengths counted for 350 sequences.
Added to AlignedSets:  "bases", "nonbases", and "width".

Time difference of 0.03 secs


Reading GenBank file chunk 1

175 total sequences in table Seqs.
Time difference of 0.08 secs


Reading FASTA file chunk 1

Added 175 new sequences to table Seqs.
350 total sequences in table Seqs.
Time difference of 0.05 secs

================================================================================
Added 350 aligned sequences to table Seqs with identifier 'Seqs1_Seqs2'.

Lengths counted for 700 sequences.
Added to Seqs:  "bases", "nonbases", and "width".

Time difference of 0.05 secs

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.

DECIPHER index

Package overview Classify Sequences Design Group-Specific FISH Probes Design Group-Specific Primers Design Microarray Probes Design Primers That Yield Group-Specific Signatures Finding Chimeric Sequences Getting Started DECIPHERing The Art of Multiple Sequence Alignment in R The Magic of Gene Finding

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

AlignDB: Align Two Sets of Aligned Sequences in a Sequence Database
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Example output

Related to AlignDB in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER Tools for curating, analyzing, and manipulating biological sequences

AlignDB: Align Two Sets of Aligned Sequences in a Sequence Database In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Example output

Related to AlignDB in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

AlignDB: Align Two Sets of Aligned Sequences in a Sequence Database
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences