msavisr: Multiple sequence alignment (MSA) visualization and...

View source: R/msavisr.R

msavisrR Documentation

Multiple sequence alignment (MSA) visualization and manipulation thereof.

Description

Given a fasta formatted MSA, msavisr will attempt to produce a visualization of the MSA with matches, mismatches, gaps, and (optional) regions of interest (ROIs) being highlighted in different colors. The matches, mismatches, and gaps are calculated by comparing all other sequences to a designated reference sequence within the MSA. ROIs are specified manually (see below).

Usage

msavisr(mymsa = NULL, myref = NULL, mypath = NULL,
refontop = TRUE, myroi = NULL, hnon = NULL, hmat = NULL,
hroi = NULL, wnon = NULL, wmat = NULL, wroi = NULL,
anon = NULL, amat = NULL, aroi = NULL, basecolors = NULL,
roicolors = NULL, cbfcols = TRUE)

Arguments

mymsa

(character string, mandatory) the name of the fasta-formatted file containing the multiple sequence alignment (MSA). The full path to the file can also be provided here (in which case, the mypath argument must be set to NULL).

myref

(character string, mandatory) the fasta header (not including the ">") of one of the sequences in the MSA which act as the reference sequence.

mypath

(character string, optional) the path to the directory containing the MSA. (Default : looks in the current directory.)

refontop

(boolean, optional) should the reference sequence be placed at the top of the MSA plot (default) or at the bottom? (Set to FALSE for bottom.)

myroi

(list of vectors, optional) the user can manually specify regions of interest (ROI; i.e., positions in the MSA) that they wish to highlight manually via myroi. This must be supplied as a list of vectors wherein each vector is of the format c(seqname, pos, desc), or c(seqname, pos), or c(pos, desc), or c(pos). The seqname (name of a specific sequence in the MSA) and the desc (description of the feature) are optional. pos (i.e., the positions that denote the ROI) can be a single integer (e.g., 100), or a range of integers (1:10). Each such ROI vector can also hold one or more (and combinations) of single integer pos values and integer range pos values (e.g., something like c("Seq1", 1:10, 100, "Helix")). Thus, for example, if a SNP at position 100 in "seq2" and a the region of nucleotides/amino-acids 20:30 representing a domain in all sequences were to be ROIs, they would be supplied like so list(c("seq2", 100, "SNP"), c(20:30, "domain")). ROIs can be assigned any color by the user (see argument roicolors); colors are assigned automatically otherwise (optionally from a color blind-friendly palette; see argument cbfcols).

hnon

(double, optional) the height of the feature "bar(s)" for all mismatches and gaps in the MSA. (Default: 0.4.)

hmat

(double, optional) the height of the feature "bar(s)" for all matches in the MSA. (Default: 0.4.)

hroi

(double, optional) the height of the feature "bar(s)" for all ROIs in the MSA. (Default: 0.4.)

wnon

(double, optional) the width of the feature "bar(s)" for all mismatches and gaps in the MSA. (Default: 2.0.)

wmat

(double, optional) the width of the feature "bar(s)" for all matches in the MSA. (Default: 1.0.)

wroi

(double, optional) the width of the feature "bar(s)" for all ROIs in the MSA. (Default: 4.0.)

anon

(double, optional) the transparency of the feature "bar(s)" for all mismatches and gaps in the MSA. (Default: 1.0.)

amat

(double, optional) the transparency of the feature "bar(s)" for all matches in the MSA. (Default: 1.0.)

aroi

(double, optional) the transparency of the feature "bar(s)" for all ROIs in the MSA. (Default: 1.0.)

basecolors

(vector of 3 character strings, optional) the colors for the matches, mismatches, and gaps can optionally be supplied by the user via this argument. Defaults are c("gray", "black", "white") when no colors are supplied and cbfcols (see below) is set to FALSE. If cbfcols == TRUE, and no colors are supplied by the user then a palette from viridis is chosen.

roicolors

(vector of n character strings, optional) user-specified colors for the ROI features can be supplied via this argument. As many colors as there are ROIs need to be supplied, and the order of the colors should correspond to the order of the ROIs in the input list. If too few colors are supplied, colors are reused; if too many are supplied, the last few colors will not be used. Defaults are chosen automacially from grDevices::color() if cbfcols == FALSE, and from viridis otherwise.

cbfcols

(boolean, optional) allows for the user to choose whether the automatic coloring scheme used should be color-blind friendly. (Default: TRUE; set to FALSE to use non-color blind-friendly colors.)

Details

msavisr plots the matches as a separate geom_tile() layer, the gaps + mismatches as a geom_tile() layer, and the ROIs as a separate geom_tile() layer (if ROIs are supplied).

The user will have to tweak the values for the widths and heights of these layers (via the arguments outlined above) to achieve the desired visualization "effects". In general, it is advisable to set the widths of the mismatches + gaps (and/or ROIs, if any; so wnon and wroi respectively) larger than that of the matches (wmat). The heights can be increased if necessary. Altering the transparency levels does not seem to be very useful. Note: altering the transparency levels does not update the transparency of the colors shown in the legend!!

ROIs are especially useful for visualizing features such as single nucleotide polymorphisms (SNPs) in nucleotide MSAs and other such isolated features that might normally become "buried" within the bulk of the sequence. This can be easily achieved by indicating the SNPs position as an ROI and jacking up its width (wroi) and/or height (hroi) values.

Value

A ggplot2 object is returned to the parent environment for plotting and/or further downstream processing/manipulation.

Note

The only issue with specifying ROIs in the manner implemented here is that, if for instance, a ROI in "seq2" at position 100 and in "seq4" at the same position need to be highlighted, it cannot be supplied like so c("seq2", "seq4", 100), and instead must be supplied as two separate vectors c("seq2", 100), c("seq4", 100). The lowdown: specifying common ROIs in a SUBSET of the MSA can be a tedious process. Unfortunately, as of now, no internal workarounds have been implemented.

Examples

## Not run: 
#Input data
testmsa <- system.file("extdata", "testaln_mrna.fasta", package = "seqvisr", mustWork = TRUE)

#Basic visualization
msavisr(mymsa = testmsa, myref = "Ref0") #No ROIs

#Defining an example ROI
testroi <- list(c("Ref0", 100:110, "Ref0 Domain1"), c(14, "Pseudouridine"),
c(20:30, "Domain2"), c("Seq2", 55, "SNP"))

#MSA with ROIs
msavisr(mymsa = testmsa, myref = "Ref0", myroi = testroi)

## End(Not run)


vragh/seqvisr documentation built on April 20, 2024, 10:06 a.m.