knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(orthGS)
Multiple sequence alignment (MSA) is a fundamental tool in biology. Accurate alignment is crucial for downstream analyses in genomics, proteomics, and evolutionary studies, providing insights into biological processes and molecular interactions essential for understanding life.
Some of the current programs for sequence alignment include, but are not limited to:
Although MSA is not on the focus of the orthGS package, it is an important tool for phylogenetic reconstructions and orthology/paralogy analyses. Thus, in the current vignette we offer some indications to perform multiple alignments without leaving the R context.
The current version of MUSCLE is v5. Nevertheless, since most R packages are generally designed for earlier versions of MUSCLE (like v3), you may want to dowload an install MUSCLE v3.8.31 (https://drive5.com/muscle/downloads_v3.htm). In my computer I have renamed the binary to ´muscle3´ because I have also installed MUSCLE v5 (https://github.com/rcedgar/muscle/releases/tag/v5.3). Both versions of the software are accessible via my system's PATH environment variable and can be invoked as ´muscle3´ (v3.8.31) or ´muscle´ (v5).
To download and install Clustal Omega you can go to http://www.clustal.org/omega/ and follows the instructions. Nevertheless, if you have a linux based OS you may benefit of the suitable package manager. For instance, for macOS I use Homebrew
, which can be installed via CLI:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Then:
brew install brewsci/bio/clustal-omega
brew install brewsci/bio/muscle
will do the work for you.
For linux OS you may use either apt
, dnf
, pacman
or whatever the package manager i suitable for your system.
As an example, we will use a small sample of sequences that can be obtained from the orthGS package.
# ara <- subsetGS("Arabidopsis thaliana") # muscle_aln <- msa(sequences = ara$prot, ids = ara$phylo_id, method = "muscle3") # muscle_aln
If we opt to use Clustal Omega instead of MUSCLE:
# clustalo_aln <- msa(sequences = ara$prot, ids = ara$phylo_id, method = "clustalo") # clustalo_aln
The ape
package in R provides a number of functions, which call their respective programs (executables) from R to align multiple nucleotide (DNAbin) or amino acid (AAbin) sequences. The binaries for the programs must be installed separately (as explained above) and it is highly recommended to do this so that the execubables are in a directory located on the PATH of the system. As a note of interest ape
allows to call MUSCLE v5 and T-Coffee.
Note that the input sequences must be in format AAbin or DNAbin, which store sequences as binary vectors (less human-readable but more memory-efficient).
Type in the RStudio terminal ?ape::tcoffee
, for instance, to obtain help in the use of these functions.
The msa
package in R provides an interface to multiple sequence alignment algorithms (including Clustal Omega, ClustalW and Muscle), allowing users to perform alignments directly from within the R environment.
The msa
R package is available via Bioconductor. The simplest way to install the package is the following (uncomment the lines before executing the chunk):
# if (!requireNamespace("BiocManager", quietly=TRUE)) # install.packages("BiocManager") # BiocManager::install("msa") # clustalw_aln <- msa::msa(inputSeqs = ara$prot, method = "ClustalW", type = "protein") # clustalw_aln
The muscle
package in R is an interface to the MUSCLE (Multiple Sequence Comparison by Log-Expectation) algorithm, which is widely used for performing multiple sequence alignments. The package allows users to run the MUSCLE algorithm from within R, simplifying the process of aligning biological sequences such as DNA, RNA, or protein sequences.
# if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("muscle") # # if (!require("Biostrings", quitely = TRUE)) # BiocManager::install("Biostrings") # seqs <- Biostrings::AAStringSet(ara$prot) # mus_aln <- muscle::muscle(seqs) # mus_aln
We can still integrate MUSCLE v5 into our R workflow by calling it as an external command-line tool. Just for the sake of fun:
# # if (!require("BiocManager", quietly = TRUE)) # # install.packages("BiocManager") # # BiocManager::install("Biostrings") # # sq <- ara$prot # names(sq) <- ara$phylo_id # seqs <- Biostrings::AAStringSet(sq) # # ## --- Save the input sequences in a temporary file # destfile <- tempfile(pattern = "input_sequences", # tmpdir = tempdir(), # fileext = ".fasta") # Biostrings::writeXStringSet(seqs, filepath = destfile) # # # ## --- Run MUSCLE on the input sequences file # system(paste("muscle -align ", destfile, " -output aligned_sequences.fasta", sep = "")) # # ## --- Read fasta alignment as dataframe # m5 <- seqinr::read.fasta(file = "./aligned_sequences.fasta") # m5df <- as.data.frame(matrix(rep(NA, length(m5[[1]]) * length(m5)), nrow = length(m5)) ) # for (i in 1:length(m5)){ # m5df[i, ] <- m5[[i]] # }
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.