ggmsa: Plot multiple sequence alignment using ggplot2

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

CRANpkg <- function(pkg) {
    cran <- "https://cran.r-project.org/package"
    fmt <- "[%s](%s=%s)"
    sprintf(fmt, pkg, cran, pkg)
}

Biocpkg <- function(pkg) {
    sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}

library(ggmsa)
library(ggplot2)

Introduction

Plot multiple sequence alignment using r CRANpkg("ggplot2") with multiple color schemes supported.

Supports visualizing multiple sequence alignment of DNA and protein sequences using ggplot2 It supports a number of colour schemes, including Chemistry, Clustal, Shapely, Taylor and Zappo. Multiple sequence alignment can easily be combined with other 'ggplot2' plots, such as aligning a phylogenetic tree produced by 'ggtree' with multiple sequence alignment.

Installation

This R package (r CRANpkg("ggmsa"), current version: r packageVersion('ggmsa')) is avalable via CRAN. The way to install the package is the following:

## installing the package
install.packages("ggmsa")
## loading the package
library("ggmsa")

The development version from github:

if (!requireNamespace("devtools", quietly=TRUE))
    install.packages("devtools")
devtools::install_github("YuLab-SMU/ggmsa")

Load sample data

Three sample data are shipped with the ggmsa package. Note that ggmsa supports not only fasta files but other objects as well. available_msa()can be used to list MSA objects currently available.

 available_msa()

 protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
 miRNA_sequences <- system.file("extdata", "seedSample.fa", package = "ggmsa")
 nt_sequences <- system.file("extdata", "LeaderRepeat_All.fa", package = "ggmsa")

Visualizing Multiple Sequence Alignments

The most simple code to use ggmsa:

ggmsa(protein_sequences, start = 265, end = 300)

Colour Schemes

Several classic color schemes for MSA are shipped in the package. In the same ways, you can use available_msa() to list colour schemes currently available. Note that Amino acids and nucleotides have different colour schemes.

 available_colors()
knitr::include_graphics("man/figures/NT_color.png")
knitr::include_graphics("man/figures/AA_color.png")

Clustal X Colour Scheme

This is an emulation of the default colourscheme used for alignments in Clustal X, a graphical interface for the ClustalW multiple sequence alignment program. Each residue in the alignment is assigned a colour if the amino acid profile of the alignment at that position meets some minimum criteria specific for the residue type.

ggmsa(protein_sequences, start = 320, end = 360, color = "Clustal")

Color by Chemistry(Default)

Amino acids are colored according to their side chain chemistry:

ggmsa(protein_sequences, start = 320, end = 360, color = "Chemistry_AA")

Color by Shapely

This color scheme matches the RasMol amino acid and RasMol nucleotide color schemes, which are, in turn, based on Robert Fletterickā€™s "Shapely models".

ggmsa(protein_sequences, start = 320, end = 360, color = "Shapely_AA")

Color by Taylor

This color scheme is taken from Taylor[@Taylor1997Residual] and is also used in JalView[@Waterhouse2009Jalview].

ggmsa(protein_sequences, start = 320, end = 360, color = "Taylor_AA")

Color by Zappo

This scheme colors residues according to their physico-chemical properties, and is also used in JalView[@Waterhouse2009Jalview].

ggmsa(protein_sequences, start = 320, end = 360, color = "Zappo_AA")

Color by LETTER

Thanks to ShixiangWang for the color scheme suggestions.

https://en.wikipedia.org/wiki/Help:Distinguishable_colors

ggmsa(protein_sequences, start = 320, end = 360, color = "LETTER")

Font

Several classic font for MSA are shipped in the package. In the same ways, you can use available_fonts() to list font currently available

 available_fonts()
ggmsa(protein_sequences, start = 320, end = 360, font = "helvetical", color = "Chemistry_AA")
ggmsa(protein_sequences, start = 320, end = 360, font = "TimesNewRoman", color = "Chemistry_AA")
ggmsa(protein_sequences, start = 320, end = 360, font = "DroidSansMono", color = "Chemistry_AA")

If you specify font = NULL, only tiles will be plot.

ggmsa(protein_sequences, start = 320, end = 360, font = NULL, color = "Chemistry_AA")

Other params

Characters width can be specified by char_width. Defaults is 0.9.

ggmsa(protein_sequences, start = 320, end = 360, char_width = 0.5, color = "Chemistry_AA")

Background can be specified by none_bg. If none_bg = TRUE, only the character will be plot.

ggmsa(protein_sequences, start = 320, end = 360, none_bg = TRUE) + theme_void()

Position Highligthed can be specified by posHighligthed. The none_bg = FALSE when you specified position Highligthed by posHighligthed

ggmsa(protein_sequences, 164, 213, color = "Chemistry_AA", 
      posHighligthed = c(185, 190))

Sequence names Defaults is 'NULL' which indicates that the sequence name is displayed when font = NULL, but 'font = char' will not be displayed.

If seq_name = TRUE the sequence name will be displayed when you need it.

ggmsa(protein_sequences, 164, 213, color = "Chemistry_AA", seq_name = TRUE)

If seq_name = FALSE the sequence name will not be displayed in any case.

ggmsa(protein_sequences, 164, 213, font = NULL, color = "Chemistry_AA", seq_name = FALSE)

Reference



Try the ggmsa package in your browser

Any scripts or data that you put into this service are public.

ggmsa documentation built on Feb. 2, 2021, 5:14 p.m.