knitr::opts_chunk$set(echo = TRUE, fig.path = "man/figures/")
R package for creating sequence logo plots
gglogo
is available from CRAN (version 0.1.4):
install.packages("gglogo")
The development version is available from Github (0.1.9000):
# install.packages("devtools") devtools::install_github("heike/gglogo", build_vignettes = TRUE)
Load the library
library(gglogo)
Load a dataset containing a set of peptide sequences
data(sequences) head(sequences)
Now plot the sequences in a(n almost) traditional sequence plot, the Shannon information is shown on the y axis.
library(ggplot2) ggplot(data = ggfortify(sequences, peptide, method="shannon")) + geom_logo(aes(x = position, y = info, group = element, label = element, fill = interaction(Polarity, Water)), alpha = 0.6, position = "classic") + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom")
(Sequence) Logo plots (Schneider & Stephens 1990) are typically used in bioinformatics as a way to visually demonstrate how well a sequence of nucleotides or amino acids are preserved in a certain region.
A cognitively better version of the plot is the default, i.e. without specifying the position
parameter, the plot defaults to aligning the largest contributor in each position along the y axis and showing all other variants in each position by a tail hanging below the axis. Longer tails indicate more variability in a position.
ggplot(data = ggfortify(sequences, peptide, method="shannon")) + geom_logo(aes(x = position, y = info, group = element, label = element, fill = interaction(Polarity, Water)), alpha = 0.6) + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom")
Besides the Shannon information, we could also visualize the frequencies of peptides in each position. We can either set method = frequency
, or calculate the (relative) frequency information ourselves as:
ggplot(data = ggfortify(sequences, peptide, method="shannon")) + geom_logo(aes(x = position, y = freq/total, group = element, label = element, fill = interaction(Polarity, Water)), alpha = 0.6) + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom")
Using the classic variant of alignment results in a stacked barchart of amino acids by position:
ggplot(data = ggfortify(sequences, peptide, method="shannon")) + geom_logo(aes(x = position, y = info, group = element, label = element, fill = interaction(Polarity, Water)), alpha = 0.6, position="classic") + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom")
This implementation of sequence logos is a two-step process of data prepping/wrangling followed by the visualization.
The data prepping happens in the function ggfortify
:
library(dplyr) seq_info <- sequences %>% # data pipeline for processing ggfortify( peptide, # variable in which the sequences are stored treatment = class, method = "shannon", missing_encode = c(".", "*", NA) )
sequences
specifies the variable of the sequences in the data set, treatment
is a (list) of grouping variables for which the (Shannon) information will be calculated in each position. For peptide sequences, the data set aacids
is used to provide additional information on properties.
head(seq_info)
By specifying the treatment
parameter, the corresponding information methods are now calculated for treatments as well, and we can assess the variability/conservation of the sequence by the treatment:
seq_info %>% ggplot() + geom_logo(aes(x = class, y = info, group = element, label = element, fill = interaction(Polarity, Water)), alpha = 0.6) + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom") + facet_wrap(~position, ncol = 12)
By default, the font used for logo plots is Helvetica, available as dataset alphabet
. Each letter is implemented in form of a polygon with x
and y
coordinates. The variable group
contains the corresponding letter.
alphabet %>% filter(group == "B") %>% ggplot(aes(x = x, y = y)) + geom_polygon() + theme(aspect.ratio = 1)
Besides the default alphabet, the fonts Comic Sans, xkcd, and braille (for 3d printing) are implemented:
alphabet_comic %>% filter(group %in% c(LETTERS, 0:9)) %>% ggplot(aes(x = x, y = y)) + geom_polygon() + theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) + ggtitle("Comic Sans") alphabet_xkcd %>% dplyr::filter(group %in% c(LETTERS, 0:9)) %>% ggplot(aes(x = x, y = y)) + geom_polygon() + theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) + ggtitle("xkcd font") alphabet_braille %>% dplyr::filter(group %in% c(LETTERS, 0:9)) %>% ggplot(aes(x = x, y = y)) + geom_polygon() + theme(aspect.ratio = 1) + facet_wrap(~group, ncol = 11) + ggtitle("Braille (use in 3d prints)")
Schneider, TD, Stephens, RM (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res, 18, 20:6097-100.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.