Informative and efficient visualization of genomic structural variation (SV) is an important step to evaluate structurally complex regions of the genome and helps us to begin drawing biological conclusions. With an advances in long-read sequencing technologies such as HiFi (high-fidelity) PacBio [@Wenger2019-tn] and ONT (Oxford Nanopore) [@Deamer2002-hq] we are now able to fully assemble even the most complex regions of the genome such as segmental duplications (SDs) [@Vollger2022-bb] and centromeres [@Logsdon2023-dz]. A large part of our understanding of complex biological systems comes from direct visual observations. Therefore, in order to better grasp and understand structurally complex genomic regions an efficient visualization is of paramount importance. We have developed SVbyEye exactly for this purpose such that we are able to directly observe complexity of a human (or other organisms) genome in question with respect to a linear genome reference. SVbyEye is inspired by the previously developed tool called Miropeats [@Parsons1995-nc] and brings its visuals to a popular scripting language R and visualization paradigm using ggplot2 [@ggplot2-book].
\newpage
In order to create an input alignments for SVbyEye visualisation we usually use minimap2 (version >=2.24) [@Li2016-ha], however any sequence-to-sequence aligner that can export alignemnts in PAF format should be sufficient. We note, however, we have tested our tool only using minimap2 alignments.
When running minimap2 alignments please make sure that parameter -a is not set because it would export alignment in SAM format. Also make sure that parameter -c is set in order to export CIGAR strings as 'cg' tag in the output PAF. Lastly, consider to set parameter '--eqx' in order to output CIGAR string using '=/X' operators for sequence match/mismatch.
minimap2 -x asm20 -c --eqx --secondary=no {reference.fasta} {query.fasta} > {output.alignment}
minimap2 -x asm20 -c --eqx --secondary=no {target.fasta} {query.fasta} > {output.alignment}
minimap2 -x asm20 -c --eqx -D -P --dual=no {input.multi.fasta} {input.multi.fasta} > {output.ava.alignment}
r BiocStyle::Biocpkg("SVbyEye")
package expects as an input sequence alignments in PAF format. Such alignments can be read using readPaf()
function. Subsequently, such alignments can be filtered using filterPaf()
function. Lastly, one can also orient PAF alignments based on desired majority orientation (majority.strand = '+' or '-'
) or can force to flip all alignments in the PAF file to opposite orientation with parameter force = TRUE
.
First load the r BiocStyle::Biocpkg("SVbyEye")
package.
## Load the SVbyEye package library(SVbyEye)
## Get PAF file to read paf.file <- system.file("extdata", "test1.paf", package = "SVbyEye" ) ## Read in PAF paf.table <- readPaf( paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg" ) ## Example PAF table, SVbyEye expects given column names paf.table ## Filter alignment based on size filterPaf(paf.table = paf.table, min.align.len = 100000) ## Force to flip orientation of PAF alignments flipPaf(paf.table = paf.table, force = TRUE)
The main function of this package is called plotMiro()
and can be used for visualization of a single sequence aligned to the reference or two any to DNA sequences aligned to each other. SVbyEye is visualizing such alignments in a horizontal layout with the target sequence being on at the top and query at the bottom.
To create a simple miropeat style plot follow instructions below.
## Get PAF to plot paf.file <- system.file("extdata", "test1.paf", package = "SVbyEye" ) ## Read in PAF paf.table <- readPaf( paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg" ) ## Make a plot colored by alignment directionality plotMiro(paf.table = paf.table, color.by = "direction")
User have a control over a number of visual features that are documented within a plotMiro()
function documentation that can be invoked ?plotMiro
.
For instance, user can specify own color scheme to color alignments that are in plus (forward) or minus (reverse) direction.
## Use custom color palette to color alignment directionality plotMiro(paf.table = paf.table, color.palette = c("+" = "azure3", "-" = "yellow3"))
Next, user can decide to color alignments by their identity instead of direction.
## Color alignments by sequence identity plotMiro(paf.table = paf.table, color.by = "identity")
We advise that coloring alignments by sequence identity is the most useful in connection with binning the alignments into separate bins. This is useful when investigating regions of high and low sequence identity between the target and the query sequence. We note that when the parameter 'binsize' is defined the parameter 'color.by' is set to 'identity' by default.
## Color by alignment by sequence identity plotMiro(paf.table = paf.table, binsize = 1000)
Currently there are preset breaks of sequence identity (see ?plotMiro
) that can be changed by setting the parameter 'perc.identity.breaks'.
## Color by alignment by sequence identity plotMiro(paf.table = paf.table, binsize = 1000, perc.identity.breaks = c(99.5, 99.6, 99.7, 99.8, 99.9))
Often it is needed to mark certain regions between the query and target alignments such as gene position, position of segmental duplications (SDs) or other DNA functional elements. This can be done by adding extra layers of annotation ranges either on top of the target or query alignments.
Annotation ranges are expected to be submitted to addAnnotation()
function as a GenomicRanges object and are visualized as either arrowhead (can reflect range orientation) or rectangle.
## Make a plot plt <- plotMiro(paf.table = paf.table) ## Load target annotation file target.annot <- system.file("extdata", "test1_target_annot.txt", package = "SVbyEye") target.annot.df <- read.table(target.annot, header = TRUE, sep = "\t", stringsAsFactors = FALSE) target.annot.gr <- GenomicRanges::makeGRangesFromDataFrame(target.annot.df) ## Add target annotation as arrowhead addAnnotation(ggplot.obj = plt, annot.gr = target.annot.gr, coordinate.space = "target") ## Load query annotation file query.annot <- system.file("extdata", "test1_query_annot.txt", package = "SVbyEye") query.annot.df <- read.table(query.annot, header = TRUE, sep = "\t", stringsAsFactors = FALSE) query.annot.gr <- GenomicRanges::makeGRangesFromDataFrame(query.annot.df) ## Add query annotation as rectangle addAnnotation(ggplot.obj = plt, annot.gr = query.annot.gr, shape = "rectangle", coordinate.space = "query")
SVbyEye also offers extra functionalities to lift between query and target coordinates and vice versa based on the underlying PAF alignment file.
## Lift target annotation to query and plot target.gr <- as("target.region:19100000-19150000", "GRanges") lifted.annot.gr <- liftRangesToAlignment(paf.table = paf.table, gr = target.gr, direction = "target2query") plt1 <- addAnnotation( ggplot.obj = plt, annot.gr = target.gr, shape = "rectangle", coordinate.space = "target" ) addAnnotation( ggplot.obj = plt1, annot.gr = lifted.annot.gr, shape = "rectangle", coordinate.space = "query" )
For instance, inversions are often flanked by long segments of SDs that are highly identical to each other. These can be visualized as follows. Direction of these segments can be reflected as well in case the strand is defined in the GenomicRanges object.
## Load segmental duplication annotation sd.annot <- system.file("extdata", "test1.sd.annot.RData", package = "SVbyEye") sd.annot.gr <- get(load(sd.annot)) ## Create a custom discrete levels based on sd.categ <- findInterval(sd.annot.gr$fracMatch, vec = c(0.95, 0.98, 0.99)) sd.categ <- dplyr::recode(sd.categ, "0" = "<95%", "1" = "95-98%", "2" = "98-99%", "3" = ">=99%") sd.categ <- factor(sd.categ, levels = c("<95%", "95-98%", "98-99%", ">=99%")) sd.annot.gr$sd.categ <- sd.categ ## Define a custom color palette color.palette <- c( "<95%" = "gray72", "95-98%" = "gray47", "98-99%" = "#cccc00", ">=99%" = "#ff6700" ) ## Add annotation to the plot addAnnotation( ggplot.obj = plt, annot.gr = sd.annot.gr, fill.by = "sd.categ", color.palette = color.palette, coordinate.space = "target" )
Importantly, each annotation layer can be given its own name such that one can distinguish different layers of information added to the plot.
## Give an annotation layer its own label ## Add target annotation as arrowhead plt1 <- addAnnotation(ggplot.obj = plt, annot.gr = target.annot.gr, coordinate.space = "target", annotation.label = 'Region') ## Add SD annotation addAnnotation( ggplot.obj = plt1, annot.gr = sd.annot.gr, fill.by = "sd.categ", color.palette = color.palette, coordinate.space = "target", annotation.label = 'SDs' )
User has also control over the level where an annotation track should appear. By Default each annotation layer is displayed at the level defined by adding 0.05 fraction of the total size of y-axis in the target or query direction. This can of course be increased or decreased to a zero what means the annotation will be added at the same level as the target or query alignments.
## Add target annotation at the increased y-axis level addAnnotation(ggplot.obj = plt, annot.gr = target.annot.gr, coordinate.space = "target", annotation.level = 0.2) ## Add target annotation at the zero level addAnnotation(ggplot.obj = plt, annot.gr = target.annot.gr, coordinate.space = "target", annotation.level = 0)
Gene annotation such exons or split genomic alignments can be visualized as set of genomic ranges that are connected by a horizontal line. This is defined based on a shared identifier (ID) that marks genomic ranges that are supposed to be grouped by such horizontal line.
## Create gene-like annotation test.gr <- GenomicRanges::GRanges( seqnames = 'target.region', ranges = IRanges::IRanges(start = c(19000000,19030000,19070000), end = c(19010000,19050000,19090000)), ID = 'gene1') ## Add single gene annotation addAnnotation(ggplot.obj = plt, annot.gr = test.gr, coordinate.space = "target", shape = 'rectangle', annotation.group = 'ID', fill.by = 'ID')
Sometimes there is a need to highlight the whole genomic alignments with unique color. This can be done by overlaying the original alignment with an extra alignment.
## Make a plot plt1 <- plotMiro(paf.table = paf.table, add.alignment.arrows = FALSE) ## Highlight alignment as a filled polygon addAlignments(ggplot.obj = plt1, paf.table = paf.table[3,], fill.by = 'strand', fill.palette = c('+' = 'red')) ## Highlight alignment as outlined polygon by dashed line addAlignments(ggplot.obj = plt1, paf.table = paf.table[4,], linetype = 'dashed')
Another important feature of SVbyEye is its ability to break PAF alignments at the positions of insertions and deletions. This can by done by setting the minimum size of insertion and deletion to be reported and setting the way how those are marked within the plot, either outlined or filled. By default, deletions are colored by red and insertion by blue color. Deletions and insertions are defined as a sequence that is either missing or is extra within query with respect to the target.
## Load the data to plot paf.file <- system.file("extdata", "test3.paf", package = "SVbyEye") paf.table <- readPaf(paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg") ## Make plot and break alignment at the position where there are deletions >=50bp plotMiro(paf.table = paf.table, min.deletion.size = 50, highlight.sv = "outline") ## Highlight detected deletion by filled polygons plotMiro(paf.table = paf.table, min.deletion.size = 50, highlight.sv = "fill")
For convenience one can also opt to break the PAF alignments at insertions and/or deletions and just report these as a data table.
## Break PAF alignment at deletions >=50bp alns <- breakPaf(paf.table = paf.table, min.deletion.size = 50) ## Print out detected deletions alns$SVs
SVbyEye also allows to visualize alignments between more than two sequences. This can be done by aligning multiple sequences to each other using so called all-versus-all (AVA) alignments. In such a way alignment are visualized in subsequent order where alignment of first sequence are shown with respect to the second and then second sequence to the third etc. Alternatively, if the order of sequence in such a progressive alignment is know one can avoid AVA alignment and align each sequence to next in a defined order and merged these into a single PAF file. This is useful when one want to visualize alignment relationship between multiple sequences of the same or different species.
## Get PAF to plot paf.file <- system.file("extdata", "test_ava.paf", package = "SVbyEye") ## Read in PAF paf.table <- readPaf(paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg") ## Make a plot colored by alignment directionality plotAVA(paf.table = paf.table, color.by = "direction")
Many of the same parameter settings as for plotMiro applies to plotAVA as well. Briefly, user can color alignments based on their orientation or identity, define desired color palette, and bin the alignments into user defined bins.
## Color by fraction of matched bases in each alignment plotAVA(paf.table = paf.table, color.by = "identity", perc.identity.breaks = c(85, 90, 95)) ## Use custom color palette to color alignment directionality plotAVA(paf.table = paf.table, color.by = "direction", color.palette = c("+" = "azure3", "-" = "yellow3")) ## Bin PAF alignments into user defined bin and color them by sequence identity (% of matched bases) plotAVA(paf.table = paf.table, binsize = 10000)
In addition to the above listed parameters user can also define desired sequence order in which progressive alignments will be reported. Importantly, only sample defined in 'seqnames.order' parameter will be plotted.
## Define custom sample/sequence order seqnames.order <- c("HG00438_2", "HG01358_2", "HG02630_2", "HG03453_2") plotAVA(paf.table = paf.table, color.by = "direction", seqnames.order = seqnames.order) ## Only samples present in custom sample order are being plotted seqnames.order <- c("HG00438_2", "HG01358_2", "HG03453_2") plotAVA(paf.table = paf.table, color.by = "direction", seqnames.order = seqnames.order)
Adding annotation to AVA alignments slightly differs from what we learned with plotMiro function. Due to the fact that there are more than two sequences annotation levels are defined based on the sequence/sample identifier they belong to. This means user have to define an extra column that contains sequence identifiers to which each genomic range belongs to.
## Add annotation to all-versus-all alignments ## plt <- plotAVA(paf.table = paf.table, color.by = 'direction') annot.file <- system.file("extdata", "test_annot_ava.RData", package="SVbyEye") annot.gr <- get(load(annot.file)) addAnnotation(ggplot.obj = plt, annot.gr = annot.gr, coordinate.space = 'self', y.label.id = 'ID') ## Add annotation to the same level as the sequence alignments
Again user can set the annotation level to zero in order to plot each annotation directly on the line corresponding to each unique sequence/sample.
## Add annotation to the same level as are the sequence/sample alignments plt <- plotAVA(paf.table = paf.table, color.by = 'direction') annot.file <- system.file("extdata", "test_annot_ava.RData", package="SVbyEye") annot.gr <- get(load(annot.file)) addAnnotation(ggplot.obj = plt, annot.gr = annot.gr, coordinate.space = 'self', y.label.id = 'ID', annotation.level = 0)
When plotting the annotation user can define a custom color palette that needs to matched to the unique meta column in annotation ranges.
## Add annotation to the same level as are the sequence/sample alignments plt <- plotAVA(paf.table = paf.table, color.by = 'direction') annot.file <- system.file("extdata", "test_annot_ava.RData", package="SVbyEye") annot.gr <- get(load(annot.file)) ## Define color palette colors <- setNames(annot.gr$color, annot.gr$Repeat) ## Set fill.by variable to color each range using defined color palette addAnnotation(ggplot.obj = plt, annot.gr = annot.gr, fill.by = 'Repeat', color.palette = colors, coordinate.space = 'self', y.label.id = 'ID', annotation.level = 0)
Lastly, there is again a possibility to break PAF alignments at the insertion and deletions and visualize them as an outlined or filled regions between alignments.
## Break PAF alignments at insertions and deletions and highlight them by an outline plotAVA(paf.table = paf.table, color.by = "direction", min.deletion.size = 1000, min.insertion.size = 1000, color.palette = c("+" = "azure3", "-" = "yellow3"), highlight.sv = 'outline') ## Break PAF alignments at insertions and deletions and highlight them by filled polygons plotAVA(paf.table = paf.table, color.by = "direction", min.deletion.size = 1000, min.insertion.size = 1000, color.palette = c("+" = "azure3", "-" = "yellow3"), highlight.sv = 'fill')
There are cases when one wants to visualize regions that are homologous to each other we offer functions to visualize alignments of a given sequence to itself. Such self alignments are typical for segmental duplications that are paralogous sequences that have high identity.
## Get PAF to plot paf.file <- system.file("extdata", "test2.paf", package = "SVbyEye") ## Read in PAF paf.table <- readPaf(paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg") ## Make a plot ## Plot alignment as horizontal dotplots and color by alignment directionality plotSelf(paf.table = paf.table, color.by = "direction", shape = "segment")
## Plot alignment as arcs and color by alignment directionality plotSelf(paf.table = paf.table, color.by = "direction", shape = "arc")
## Plot alignment as arrows and color by alignment directionality plotSelf(paf.table = paf.table, color.by = "direction", shape = "arrow")
Again some of the previous parameter apply for the self-alignments as well. For example user can define the color palette, bin the alignments as well as break alignments at the insertion and deletions.
## Plot alignment as arcs and color by alignment directionality plotSelf(paf.table = paf.table, color.by = "direction", shape = "arc", color.palette = c("+" = "azure3", "-" = "yellow3"))
## Bin PAF alignments into user defined bin and color them by sequence identity (% of matched bases) plotSelf(paf.table = paf.table, binsize = 1000)
When breaking the self-alignments at insertions and deletions proximal duplication is considered as query and the distal duplication as a target. It means that sequence missing in query is considered as deletion and sequence inserted in query with respect to target as insertion.
## Highlight structural variants within self-alignments plotSelf(paf.table = paf.table, min.deletion.size = 50, min.insertion.size = 50, highlight.sv = "outline", shape = "arc")
## Get PAF to plot paf.file <- system.file("extdata", "test_getFASTA.paf", package = "SVbyEye" ) ## Read in PAF paf.table <- readPaf( paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg" ) ## Make a plot colored by alignment directionality plt <- plotMiro(paf.table = paf.table, color.by = "direction") ## Get query sequence composition fa.file <- paf.file <- system.file("extdata", "test_getFASTA_query.fasta", package = "SVbyEye" ) ## Calculate GC content per 5kbp bin gc.content <- fasta2nucleotideContent(fasta.file = fa.file, binsize = 5000, nucleotide.content = 'GC') ## Add GC content as annotation heatmap addAnnotation(ggplot.obj = plt, annot.gr = gc.content, shape = 'rectangle', fill.by = 'GC_nuc.freq', coordinate.space = 'query', annotation.label = 'GCcont') ## Calculate AT dinucleotide frequency dinuc.content <- fasta2nucleotideContent(fasta.file = fa.file, binsize = 5000, nucleotide.content = 'AT') ## Add AT dinucleotide frequency as annotation heatmap addAnnotation(ggplot.obj = plt, annot.gr = dinuc.content, shape = 'rectangle', fill.by = 'AT_nuc.freq', coordinate.space = 'query', annotation.label = 'ATdinuc')
\newpage
Report any issues here:
\newpage
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.