We present an elaborate guided tutorial of how to use the Logolas R package.
knitr::opts_chunk$set(dpi = 90,dev = "png")
Logolas offers two new features that makes logo visualization a more generic tool with potential applications in a much wider scope of problems.
Enrichment Depletion Logo (EDLogo) : General logo plotting softwares (seqLogo) primarily highlight only enrichment of symbols at each position of the logo, but Logolas allows the user to highlight both enrichment and depletion of symbols at any position, leading to more parsimonious and visually appealing representation.
String symbols : General logo plot softwares have limited library of symbols, typically restricted to English alphabets. Logolas allows the user to plot symbols for any alphanumeric string, and also allows for combination of strings and charcaters.
The most recent version of Logolas is available from Github using devtools R package.First, you would require to install the following Bioconductor packages.
source("https://bioconductor.org/biocLite.R") biocLite(c("Biostrings","BiocStyle","Biobase","seqLogo","ggseqlogo"))
Then install Logolas as follows
library(devtools) install_github("kkdey/Logolas",build_vignettes = TRUE)
Once you have installed the package, load the package in R by entering
library(Logolas)
To get an overview of the package, enter
help(package = "Logolas")
Logolas accepts two data formats as input
a vector of aligned character sequences (may be DNA, RNA or amino acid sequences), each of same length (see Example 1 below)
a positional frequency (weight) matrix, termed PFM (PWM), with the symbols to be plotted along the rows and the positions of aligned sequences, from which the matrix is generated, along the columns. (see Example 2)
Consider aligned strings of characters
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT", "CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTTTAT", "CTATAGT", "CTATTTT", "CTTATAT", "CTATATT", "CTCATTT", "CTTATTT", "CAATAGT", "CATTTGA", "CTCTTAT", "CTATTAT", "CTTTTAT", "CTATAAT", "CTTAGGT", "CTATTGT", "CTCATGT", "CTATAGT", "CTCGTTA", "CTAGAAT", "CAATGGT")
The standard logo plot, akin to seqLogo R package can be plotted using the
\textbf{logomaker()} function with type=Logo
.
logomaker(sequence, type = "Logo")
To plot the EDLogo plot for the same example, use the type=EDLogo
option instead.
logomaker(sequence, type = "EDLogo")
Instead of DNA.RNA sequence as above, one can also use amino acid character sequences.
library(ggseqlogo) data(ggseqlogo_sample) sequence <- seqs_aa$AKT1 logomaker(sequence, type = "EDLogo")
We now see an example of positional weight matrix (PWM) as input to logomaker().
data("seqlogo_example") seqlogo_example
We plot the logo plots for this PWM matrix.
logomaker(seqlogo_example, type = "Logo", return_heights = TRUE) logomaker(seqlogo_example, type = "EDLogo", return_heights = TRUE)
The r return_heights = TRUE
outputs the information content at each position for the standard logo plot (type = "Logo") and the heights of the stacks along the positive and negative Y axis, along
with the breakdown of the height due to different characters for the EDLogo plot (type = "EDLogo").
The logomaker() function provides three arguments to set the colors for the logos, a color_type specifying the scheme of coloring used,
colors denoting the cohort of colors used and a color_seed argument determining how sampling is done from this cohort.
The color_type argument can be of three types, per_row
, per_column
and per_symbol
.
colors
element is a cohort/library of colors (chosen suitably large) from which distinct colors are chosen based on distinct color_type
. The number of colors chosen is of same length as number of rows in table for per_row
(assigning a color to each string), of same length as number of columns in table for per_column
(assuming a color for each column), or a distinct color for a distinct symbol in per_symbol
. The length of colors
must be as large as the number of colors to be chosen in each scenario.
The default color_type
is per-row
and default colors
comprises of a large cohort of nearly 70 distinct colors from which colors are sampled using the color_seed
argument. The user can ignore the colors
option unless he is not happy with the set of colors used, and play around with the color_seed
instead and choose the one that presents a set of colors of his liking.
An example of using default colors
and just specifying color_seed
.
logomaker(seqlogo_example, type = "EDLogo", color_seed = 2000)
An alternative example of using an user-specified colors
.
logomaker(seqlogo_example, color_type = "per_row", colors = c("#7FC97F", "#BEAED4", "#FDC086", "#386CB0"), type = "EDLogo")
Besides the default style with filled symbols for each character, one can also use
characters with border styles. For the standard logo plot, this is accomplished by the
tofill
control argument.
logomaker(seqlogo_example, type = "Logo", logo_control = list(control = list(tofill= FALSE)), color_seed = 4000)
For an EDLogo plot, the arguments tofill_pos
and tofill_neg
represent the coloring scheme for the positive and the negative axes in an EDLogo plot.
logomaker(seqlogo_example, type = "EDLogo", logo_control = list(control = list(tofill_pos = TRUE, tofill_neg = FALSE)))
Logolas allows the user to scale the data based on a specified background information - specified
through the input bg
. The default value is NULL, in which case equal probability is assigned to each symbol.
Alternatively, the user can specify a vector (equal to in length to the number of symbols) which specifies the background probability for each symbol and assumes this same background probability for each position (column of PWM matrix). See examplw below.
bg <- c(0.05, 0.90, 0.03, 0.05) names(bg) <- c("A", "C", "G", "T") logomaker(seqlogo_example, bg=bg, type = "EDLogo")
As a second alternative, the user can specify bg
as a matrix of same dimension as the positional
freuqncy matrix underlying the input data, which is flexible in having separate background
probabilities for each position.
logomaker(seqlogo_example, bg=(seqlogo_example+1e-02), type = "EDLogo")
Logolas allows the user to plot symbols not just for characters as in previous examples, but for any alphanumeric string. We present two such examples below.
In the first example, demonstrating the relative frequencies of different mutation signatures (mismatch type at the center and bases flanking it), we see how Logolas handles data with a mix of strings and characters, with flanking positions being purely characters and the center position being purely string based (data from Shiraishi et al 2015).
data("mutation_sig") logomaker(mutation_sig, type = "Logo", color_seed = 2000)
The user may want to have distinct colors for distinct symbols.characters in this case
than a string. This is where we make use of the per_symbol
option for color_type.
logomaker(mutation_sig, type = "EDLogo", color_type = "per_symbol", color_seed = 2000)
In the second example, we use EDLogo plots to highlight enrichment and depletion of
histone marks in different regions of the genome, with respect to a specific
background bg
(data from Koch et al 2007).
data("histone_marks") logomaker(histone_marks$mat, bg=histone_marks$bgmat, type = "EDLogo")
Logolas plots can be plotted in multiple panels, as depicted below.
sequence <- c("CTATTGT", "CTCTTAT", "CTATTAA", "CTATTTA", "CTATTAT", "CTTGAAT", "CTTAGAT", "CTATTAA", "CTATTTA", "CTATTAT") Logolas::get_viewport_logo(1, 2, heights.val = 20) library(grid) seekViewport(paste0("plotlogo", 1)) logomaker(sequence, type = "Logo", logo_control = list(newpage = FALSE)) seekViewport(paste0("plotlogo", 2)) logomaker(sequence, type = "EDLogo", logo_control = list(newpage = FALSE))
In the same way, ggplot2 graphics can also be combined with Logolas plots.
While logomaker() takes a PFM, PWM or a set of aligned sequences as input, sometimes, some position specific scores are only available to the user. In this case, one can use the logo_pssm() in Logolas to plot the scoring matrix.
data(pssm) logo_pssm(pssm, control = list(round_off = 0))
The round_off
comtrol argument specifies the number of points after decimal allowed in the axes of the plot.
The authors would like to acknowledge Oliver Bembom, the author of seqLogo
for acting as an inspiration and providing the foundation on which this package is created. We also thank Peter Carbonetto, Edward Wallace and John Blischak for helpful feedback and discussions.
\section{Session Info}
sessionInfo()
Thank you for using Logolas !
If you have any questions, you can either open an issue in our Github page or write to Kushal K Dey (kkdey@uchicago.edu). Also please feel free to contribute to the package. You can contribute by submitting a pull request or by communicating with the said person.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.