msa: Unified interface to multiple sequence alignment algorithms
In msa: Multiple Sequence Alignment

Description Usage Arguments Details Value Author(s) References See Also Examples

The msa function provides a unified interface to the three multiple sequence alignment algorithms in this package: ‘ClustalW’, ‘ClustalOmega’, and ‘MUSCLE’.

    msa(inputSeqs, method=c("ClustalW", "ClustalOmega", "Muscle"),
        cluster="default", gapOpening="default",
        gapExtension="default", maxiters="default",
        substitutionMatrix="default", type="default",
        order=c("aligned", "input"), verbose=FALSE, help=FALSE,
        ...)

`inputSeqs`	input sequences; this argument can be a character vector, an object of class `XStringSet` (includes the classes `AAStringSet`, `DNAStringSet`, and `RNAStringSet`), or a single character string with a file name. In the latter case, the file name is required to have the suffix ‘.fa’ or ‘.fasta’, and the file must be in FASTA format.
`method`	specifies the multiple sequence alignment to be used; currently, `"ClustalW"`, `"ClustalOmega"`, and `"Muscle"` are supported.
`cluster`	parameter related to sequence clustering; its interpretation and default value depends on the method; see `msaClustalW`, `msaClustalOmega`, or `msaMuscle` for algorithm-specific information.
`gapOpening`	gap opening penalty; the defaults are specific to the algorithm (see `msaClustalW`, and `msaMuscle`). Note that the sign of this parameter is ignored. The sign is automatically adjusted such that the called algorithm penalizes gaps instead of rewarding them.
`gapExtension`	gap extension penalty; the defaults are specific to the algorithm (see `msaClustalW`, and `msaMuscle`). Note that the sign of this parameter is ignored. The sign is automatically adjusted such that the called algorithm penalizes gaps instead of rewarding them.
`maxiters`	maximum number of iterations; its interpretation and default value depends on the method; see `msaClustalW`, `msaClustalOmega`, or `msaMuscle` for algorithm-specific information.
`substitutionMatrix`	substitution matrix for scoring matches and mismatches; format and defaults depend on the algorithm; see `msaClustalW`, `msaClustalOmega`, or `msaMuscle` for algorithm-specific information.
`type`	type of the input sequences `inputSeqs`; possible values are `"dna"`, `"rna"`, or `"protein"`. In the original ClustalW implementation, this parameter is also called `-type`; `"auto"` is also possible in the original ClustalW, but, in this package, `"auto"` is deactivated. The `type` argument is mandatory if `inputSeqs` is a character vector or the file name of a FASTA file (see above). If `inputSeqs` is an object of class `AAStringSet`, `DNAStringSet`, or `RNAStringSet`, the type of sequences is determined by the class of `inputSeqs` and the `type` parameter is not necessary. If it is nevertheless specified and the type does not match the class of `inputSeqs`, the function stops with an error.
`order`	how the sequences should be ordered in the output object; if `"aligned"` is chosen, the sequences are ordered in the way the multiple sequence alignment algorithm orders them. If `"input"` is chosen, the sequences in the output object are ordered in the same way as the input sequences. For MUSCLE, the choice `"input"` is not available for sequence data that is read directly from a FASTA file. Even if sequences are supplied directly via R, the sequences must have unique names, otherwise the input order cannot be recovered. If the sequences do not have names or if the names are not unique, the `msaMuscle` function assignes generic unique names `"Seq1"`-`Seqn` to the sequences and issues a warning.
`verbose`	if `TRUE`, the algorithm displays detailed information and progress messages.
`help`	if `TRUE`, information about algorithm-specific parameters is displayed. In this case, no multiple sequence alignment is performed and the function quits after displaying the additional help information.
`...`	all other parameters are passed on to the multiple sequence algorithm, i.e. to one of the functions `msaClustalW`, `msaClustalOmega`, or `msaMuscle`. An overview of parameters that are available for the chosen method is shown when calling `msa` with `help=TRUE`. For more details, see also the documentation of chosen multiple sequence alignment algorithm.

msa is a simple wrapper function that unifies the interfaces of the three functions msaClustalW, msaClustalOmega, and msaMuscle. Which function is called, is controlled by the method argument.

Note that the input sequences may be reordered by the multiple sequence alignment algorithms in order to group together similar sequences (see also description of argument order above). So, if the input order should be preserved or if the input order should be recovered later, we strongly recommend to always assign unique names to the input sequences. As noted in the description of the inputSeqs argument above, all functions, msa(), msaClustalW, msaClustalOmega, and msaMuscle, also allow for direct reading from FASTA files. This is mainly for the reason of memory efficiency if the sequence data set is very large. Otherwise, we want to encourage users to first read the sequences into the R workspace. If sequences are read from a FASTA file directly, the order of output sequences is completely under the control of the respective algorithm and does not allow for checking whether the sequences are named uniquely in the FASTA file. The preservation of the input order works also for sequence data read from a FASTA file, but only for ClustalW and ClustalOmega; MUSCLE does not support this (see also argument order above and msaMuscle).

Depending on the type of sequences for which it was called, msa returns a MsaAAMultipleAlignment, MsaDNAMultipleAlignment, or MsaRNAMultipleAlignment object. If called with help=TRUE, msa returns an invisible NULL.

Enrico Bonatesta and Christoph Horejs-Kainrath <msa@bioinf.jku.at>

http://www.bioinf.jku.at/software/msa

U. Bodenhofer, E. Bonatesta, C. Horejs-Kainrath, and S. Hochreiter (2015). msa: an R package for multiple sequence alignment. Bioinformatics 31(24):3997-3999. DOI: 10.1093/bioinformatics/btv494.

http://www.clustal.org/download/clustalw_help.txt

http://www.clustal.org/omega/README

http://www.drive5.com/muscle/muscle.html

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673-4680. DOI: 10.1093/nar/22.22.4673.

Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Soeding, J., Thompson, J. D., and Higgins, D. G. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7:539. DOI: 10.1038/msb.2011.75.

Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797. DOI: 10.1093/nar/gkh340.

Edgar, R. C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113. DOI: 10.1186/1471-2105-5-113.

msaClustalW, msaClustalOmega, msaMuscle, msaPrettyPrint, MsaAAMultipleAlignment, MsaDNAMultipleAlignment, MsaRNAMultipleAlignment, MsaMetaData

## read sequences
filepath <- system.file("examples", "exampleAA.fasta", package="msa")
mySeqs <- readAAStringSet(filepath)

## call unified interface msa() for default method (ClustalW) and
## default parameters
msa(mySeqs)

## call ClustalOmega through unified interface
msa(mySeqs, method="ClustalOmega")

## call MUSCLE through unified interface with some custom parameters
msa(mySeqs, method="Muscle", gapOpening=12, gapExtension=3, maxiters=16,
    cluster="upgmamax", SUEFF=0.4, brenner=FALSE,
    order="input", verbose=FALSE)

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colMeans, colSums, colnames, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

use default substitution matrix
CLUSTAL 2.1  

Call:
   msa(mySeqs)

MsaAAMultipleAlignment with 9 rows and 456 columns
    aln                                                    names
[1] MAAVVLENGVLSRKLSDFGQETSYIE...QLKILADSINSEVGILCNALQKIKS PH4H_Rattus_norve...
[2] MAAVVLENGVLSRKLSDFGQETSYIE...QLKILADSINSEVGILCHALQKIKS PH4H_Mus_musculus
[3] MSTAVLENPGLGRKLSDFGQETSYIE...QLKILADSINSEIGILCSALQKIK- PH4H_Homo_sapiens
[4] MSALVLESRALGRKLSDFGQETSYIE...QLKILADSISSEVEILCSALQKLK- PH4H_Bos_taurus
[5] --------------------------...LNAGDRQGWADTEDV---------- PH4H_Chromobacter...
[6] --------------------------...LNAGTREGWADTADI---------- PH4H_Ralstonia_so...
[7] --------------------------...LTRGT-QAYATAGGRLAGAAAG--- PH4H_Caulobacter_...
[8] --------------------------...------------------------- PH4H_Pseudomonas_...
[9] --------------------------...------------------------- PH4H_Rhizobium_loti
Con --------------------------...??????????????IL??A???--- Consensus 
using Gonnet
ClustalOmega 1.2.0 

Call:
   msa(mySeqs, method = "ClustalOmega")

MsaAAMultipleAlignment with 9 rows and 467 columns
    aln                                                    names
[1] MSALVLESRALGRKLSDFGQETSYIE...LKI-LADSISSEVEILCSALQKLK- PH4H_Bos_taurus
[2] MSTAVLENPGLGRKLSDFGQETSYIE...LKI-LADSINSEIGILCSALQKIK- PH4H_Homo_sapiens
[3] MAAVVLENGVLSRKLSDFGQETSYIE...LKI-LADSINSEVGILCNALQKIKS PH4H_Rattus_norve...
[4] MAAVVLENGVLSRKLSDFGQETSYIE...LKI-LADSINSEVGILCHALQKIKS PH4H_Mus_musculus
[5] --------------------------...------------------------- PH4H_Pseudomonas_...
[6] --------------------------...------------------------- PH4H_Rhizobium_loti
[7] --------------------------...YATAGGRLAGAAAG----------- PH4H_Caulobacter_...
[8] --------------------------...GWADTEDV----------------- PH4H_Chromobacter...
[9] --------------------------...GWADTADI----------------- PH4H_Ralstonia_so...
Con --------------------------...???-?AD???????----------- Consensus 

*** WARNING *** *Warning* Cannot open /proc/meminfo errno=13 Permission denied
MUSCLE 3.8.31   

Call:
   msa(mySeqs, method = "Muscle", gapOpening = 12, gapExtension = 3,     maxiters = 16, cluster = "upgmamax", SUEFF = 0.4, brenner = FALSE,     order = "input", verbose = FALSE)

MsaAAMultipleAlignment with 9 rows and 456 columns
    aln                                                    names
[1] MSTAVLENPGLGRKLSDFGQETSYIE...QLKILADSINSEIGILCSALQKIK- PH4H_Homo_sapiens
[2] MAAVVLENGVLSRKLSDFGQETSYIE...QLKILADSINSEVGILCNALQKIKS PH4H_Rattus_norve...
[3] MAAVVLENGVLSRKLSDFGQETSYIE...QLKILADSINSEVGILCHALQKIKS PH4H_Mus_musculus
[4] --------------------------...NAGDRQGWADTEDV----------- PH4H_Chromobacter...
[5] --------------------------...------------------------- PH4H_Pseudomonas_...
[6] MSALVLESRALGRKLSDFGQETSYIE...QLKILADSISSEVEILCSALQKLK- PH4H_Bos_taurus
[7] --------------------------...NAGTREGWADTADI----------- PH4H_Ralstonia_so...
[8] --------------------------...TRGTQAYATAGGRLAGAAAG----- PH4H_Caulobacter_...
[9] --------------------------...------------------------- PH4H_Rhizobium_loti
Con --------------------------...?????A?????E??????A?----- Consensus