Introduction

Informative and efficient visualization of genomic structural variation (SV) is an important step to evaluate structurally complex regions of the genome and helps us to begin drawing biological conclusions. With an advances in long-read sequencing technologies such as HiFi (high-fidelity) PacBio [@Wenger2019-tn] and ONT (Oxford Nanopore) [@Deamer2002-hq] we are now able to fully assemble even the most complex regions of the genome such as segmental duplications (SDs) [@Vollger2022-bb] and centromeres [@Logsdon2023-dz]. A large part of our understanding of complex biological systems comes from direct visual observations. Therefore, in order to better grasp and understand structurally complex genomic regions an efficient visualization is of paramount importance. We have developed SVbyEye exactly for this purpose such that we are able to directly observe complexity of a human (or other organisms) genome in question with respect to a linear genome reference. SVbyEye is inspired by the previously developed tool called Miropeats [@Parsons1995-nc] and brings its visuals to a popular scripting language R and visualization paradigm using ggplot2 [@ggplot2-book].

\newpage

Generation of input alignements in PAF format

In order to create an input alignments for SVbyEye visualisation we usually use minimap2 (version >=2.24) [@Li2016-ha], however any sequence-to-sequence aligner that can export alignemnts in PAF format should be sufficient. We note, however, we have tested our tool only using minimap2 alignments.

When running minimap2 alignments please make sure that parameter -a is not set because it would export alignment in SAM format. Also make sure that parameter -c is set in order to export CIGAR strings as 'cg' tag in the output PAF. Lastly, consider to set parameter '--eqx' in order to output CIGAR string using '=/X' operators for sequence match/mismatch.

Generate assembly-to-reference minimap alignments

minimap2 -x asm20 -c --eqx --secondary=no {reference.fasta} {query.fasta} > {output.alignment}

Generate sequence-to-sequence minimap alignments

minimap2 -x asm20 -c --eqx --secondary=no {target.fasta} {query.fasta} > {output.alignment}

Generate all-versus-all minimap alignments

minimap2 -x asm20 -c --eqx -D -P --dual=no {input.multi.fasta} {input.multi.fasta} > {output.ava.alignment}

Reading and filtering alignements in PAF format

r BiocStyle::Biocpkg("SVbyEye") package expects as an input sequence alignments in PAF format. Such alignments can be read using readPaf() function. Subsequently, such alignments can be filtered using filterPaf() function. Lastly, one can also orient PAF alignments based on desired majority orientation (majority.strand = '+' or '-') or can force to flip all alignments in the PAF file to opposite orientation with parameter force = TRUE.

First load the r BiocStyle::Biocpkg("SVbyEye") package.

## Load the SVbyEye package
library(SVbyEye)
## Get PAF file to read
paf.file <- system.file("extdata", "test1.paf",
    package = "SVbyEye"
)
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file,
    include.paf.tags = TRUE, restrict.paf.tags = "cg"
)
## Example PAF table, SVbyEye expects given column names
paf.table

## Filter alignment based on size
filterPaf(paf.table = paf.table, min.align.len = 100000)

## Force to flip orientation of PAF alignments
flipPaf(paf.table = paf.table, force = TRUE)

Visualization of sequence-to-sequence alignemnts

The main function of this package is called plotMiro() and can be used for visualization of a single sequence aligned to the reference or two any to DNA sequences aligned to each other. SVbyEye is visualizing such alignments in a horizontal layout with the target sequence being on at the top and query at the bottom.

To create a simple miropeat style plot follow instructions below.

## Get PAF to plot
paf.file <- system.file("extdata", "test1.paf",
    package = "SVbyEye"
)
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file,
    include.paf.tags = TRUE, restrict.paf.tags = "cg"
)
## Make a plot colored by alignment directionality
plotMiro(paf.table = paf.table, color.by = "direction")

User have a control over a number of visual features that are documented within a plotMiro() function documentation that can be invoked ?plotMiro.

For instance, user can specify own color scheme to color alignments that are in plus (forward) or minus (reverse) direction.

## Use custom color palette to color alignment directionality
plotMiro(paf.table = paf.table, 
         color.palette = c("+" = "azure3", "-" = "yellow3"))

Next, user can decide to color alignments by their identity instead of direction.

## Color alignments by sequence identity
plotMiro(paf.table = paf.table, color.by = "identity")

We advise that coloring alignments by sequence identity is the most useful in connection with binning the alignments into separate bins. This is useful when investigating regions of high and low sequence identity between the target and the query sequence. We note that when the parameter 'binsize' is defined the parameter 'color.by' is set to 'identity' by default.

## Color by alignment by sequence identity
plotMiro(paf.table = paf.table, binsize = 1000)

Currently there are preset breaks of sequence identity (see ?plotMiro) that can be changed by setting the parameter 'perc.identity.breaks'.

## Color by alignment by sequence identity
plotMiro(paf.table = paf.table, binsize = 1000, 
         perc.identity.breaks = c(99.5, 99.6, 99.7, 99.8, 99.9))

Adding annotations to the miropeat style plot

Often it is needed to mark certain regions between the query and target alignments such as gene position, position of segmental duplications (SDs) or other DNA functional elements. This can be done by adding extra layers of annotation ranges either on top of the target or query alignments. Annotation ranges are expected to be submitted to addAnnotation() function as a GenomicRanges object and are visualized as either arrowhead (can reflect range orientation) or rectangle.

## Make a plot
plt <- plotMiro(paf.table = paf.table)
## Load target annotation file
target.annot <- system.file("extdata", "test1_target_annot.txt", package = "SVbyEye")
target.annot.df <- read.table(target.annot, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
target.annot.gr <- GenomicRanges::makeGRangesFromDataFrame(target.annot.df)
## Add target annotation as arrowhead
addAnnotation(ggplot.obj = plt, 
              annot.gr = target.annot.gr, coordinate.space = "target")

## Load query annotation file
query.annot <- system.file("extdata", "test1_query_annot.txt", package = "SVbyEye")
query.annot.df <- read.table(query.annot, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
query.annot.gr <- GenomicRanges::makeGRangesFromDataFrame(query.annot.df)
## Add query annotation as rectangle
addAnnotation(ggplot.obj = plt, 
              annot.gr = query.annot.gr, shape = "rectangle", coordinate.space = "query")

SVbyEye also offers extra functionalities to lift between query and target coordinates and vice versa based on the underlying PAF alignment file.

## Lift target annotation to query and plot
target.gr <- as("target.region:19100000-19150000", "GRanges")
lifted.annot.gr <- liftRangesToAlignment(paf.table = paf.table,
                                         gr = target.gr, direction = "target2query")
plt1 <- addAnnotation(
    ggplot.obj = plt, annot.gr = target.gr, shape = "rectangle",
     coordinate.space = "target"
 )
 addAnnotation(
     ggplot.obj = plt1, annot.gr = lifted.annot.gr, shape = "rectangle",
     coordinate.space = "query"
 )

For instance, inversions are often flanked by long segments of SDs that are highly identical to each other. These can be visualized as follows. Direction of these segments can be reflected as well in case the strand is defined in the GenomicRanges object.

## Load segmental duplication annotation
sd.annot <- system.file("extdata", "test1.sd.annot.RData", package = "SVbyEye")
sd.annot.gr <- get(load(sd.annot))
## Create a custom discrete levels based on 
sd.categ <- findInterval(sd.annot.gr$fracMatch, vec = c(0.95, 0.98, 0.99))
sd.categ <- dplyr::recode(sd.categ, 
                          "0" = "<95%", "1" = "95-98%", 
                          "2" = "98-99%", "3" = ">=99%")
sd.categ <- factor(sd.categ, levels = c("<95%", "95-98%", "98-99%", ">=99%"))
sd.annot.gr$sd.categ <- sd.categ
## Define a custom color palette
color.palette <- c(
   "<95%" = "gray72", "95-98%" = "gray47", "98-99%" = "#cccc00",
   ">=99%" = "#ff6700"
)
## Add annotation to the plot
addAnnotation(
   ggplot.obj = plt, annot.gr = sd.annot.gr, fill.by = "sd.categ",
   color.palette = color.palette, coordinate.space = "target"
)

Importantly, each annotation layer can be given its own name such that one can distinguish different layers of information added to the plot.

## Give an annotation layer its own label
## Add target annotation as arrowhead
plt1 <- addAnnotation(ggplot.obj = plt, 
              annot.gr = target.annot.gr, coordinate.space = "target", 
              annotation.label = 'Region')
## Add SD annotation
addAnnotation(
   ggplot.obj = plt1, annot.gr = sd.annot.gr, fill.by = "sd.categ",
   color.palette = color.palette, coordinate.space = "target", 
   annotation.label = 'SDs'
)

User has also control over the level where an annotation track should appear. By Default each annotation layer is displayed at the level defined by adding 0.05 fraction of the total size of y-axis in the target or query direction. This can of course be increased or decreased to a zero what means the annotation will be added at the same level as the target or query alignments.

## Add target annotation at the increased y-axis level
addAnnotation(ggplot.obj = plt, annot.gr = target.annot.gr, 
              coordinate.space = "target", annotation.level = 0.2)

## Add target annotation at the zero level
addAnnotation(ggplot.obj = plt, annot.gr = target.annot.gr,  
              coordinate.space = "target", annotation.level = 0)

Adding gene annotations

Gene annotation such exons or split genomic alignments can be visualized as set of genomic ranges that are connected by a horizontal line. This is defined based on a shared identifier (ID) that marks genomic ranges that are supposed to be grouped by such horizontal line.

## Create gene-like annotation
test.gr <- GenomicRanges::GRanges(
              seqnames = 'target.region',
              ranges = IRanges::IRanges(start = c(19000000,19030000,19070000),
                                        end = c(19010000,19050000,19090000)),
                                        ID = 'gene1')
## Add single gene annotation
addAnnotation(ggplot.obj = plt, annot.gr = test.gr, coordinate.space = "target",
             shape = 'rectangle', annotation.group = 'ID', fill.by = 'ID')

Highlighting whole alignments

Sometimes there is a need to highlight the whole genomic alignments with unique color. This can be done by overlaying the original alignment with an extra alignment.

## Make a plot
plt1 <- plotMiro(paf.table = paf.table, add.alignment.arrows = FALSE)
## Highlight alignment as a filled polygon
addAlignments(ggplot.obj = plt1, paf.table = paf.table[3,], fill.by = 'strand', 
              fill.palette = c('+' = 'red'))
## Highlight alignment as outlined polygon by dashed line
addAlignments(ggplot.obj = plt1, paf.table = paf.table[4,], linetype = 'dashed')

Detection and visualization of insertions and deletions

Another important feature of SVbyEye is its ability to break PAF alignments at the positions of insertions and deletions. This can by done by setting the minimum size of insertion and deletion to be reported and setting the way how those are marked within the plot, either outlined or filled. By default, deletions are colored by red and insertion by blue color. Deletions and insertions are defined as a sequence that is either missing or is extra within query with respect to the target.

## Load the data to plot
paf.file <- system.file("extdata", "test3.paf", package = "SVbyEye")
paf.table <- readPaf(paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg")

## Make plot and break alignment at the position where there are deletions >=50bp 
plotMiro(paf.table = paf.table, min.deletion.size = 50, highlight.sv = "outline")

## Highlight detected deletion by filled polygons  
plotMiro(paf.table = paf.table, min.deletion.size = 50, highlight.sv = "fill")

For convenience one can also opt to break the PAF alignments at insertions and/or deletions and just report these as a data table.

## Break PAF alignment at deletions >=50bp
alns <- breakPaf(paf.table = paf.table, min.deletion.size = 50)
## Print out detected deletions
alns$SVs

Visualization of all-versus-all sequence alignments

SVbyEye also allows to visualize alignments between more than two sequences. This can be done by aligning multiple sequences to each other using so called all-versus-all (AVA) alignments. In such a way alignment are visualized in subsequent order where alignment of first sequence are shown with respect to the second and then second sequence to the third etc. Alternatively, if the order of sequence in such a progressive alignment is know one can avoid AVA alignment and align each sequence to next in a defined order and merged these into a single PAF file. This is useful when one want to visualize alignment relationship between multiple sequences of the same or different species.

## Get PAF to plot
paf.file <- system.file("extdata", "test_ava.paf", package = "SVbyEye")
## Read in PAF
paf.table <- readPaf(paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg")
## Make a plot colored by alignment directionality
plotAVA(paf.table = paf.table, color.by = "direction")

Many of the same parameter settings as for plotMiro applies to plotAVA as well. Briefly, user can color alignments based on their orientation or identity, define desired color palette, and bin the alignments into user defined bins.

## Color by fraction of matched bases in each alignment
plotAVA(paf.table = paf.table, color.by = "identity", perc.identity.breaks = c(85, 90, 95))
## Use custom color palette to color alignment directionality
plotAVA(paf.table = paf.table, color.by = "direction", 
        color.palette = c("+" = "azure3", "-" = "yellow3"))
## Bin PAF alignments into user defined bin and color them by sequence identity (% of matched bases)
plotAVA(paf.table = paf.table, binsize = 10000)

In addition to the above listed parameters user can also define desired sequence order in which progressive alignments will be reported. Importantly, only sample defined in 'seqnames.order' parameter will be plotted.

## Define custom sample/sequence order
seqnames.order <- c("HG00438_2", "HG01358_2", "HG02630_2", "HG03453_2")
plotAVA(paf.table = paf.table, color.by = "direction", seqnames.order = seqnames.order)
## Only samples present in custom sample order are being plotted
seqnames.order <- c("HG00438_2", "HG01358_2", "HG03453_2")
plotAVA(paf.table = paf.table, color.by = "direction", seqnames.order = seqnames.order)

Adding annotation to AVA alignments slightly differs from what we learned with plotMiro function. Due to the fact that there are more than two sequences annotation levels are defined based on the sequence/sample identifier they belong to. This means user have to define an extra column that contains sequence identifiers to which each genomic range belongs to.

## Add annotation to all-versus-all alignments ##
plt <- plotAVA(paf.table = paf.table, color.by = 'direction')
annot.file <- system.file("extdata", "test_annot_ava.RData", package="SVbyEye")
annot.gr <- get(load(annot.file))
addAnnotation(ggplot.obj = plt, annot.gr = annot.gr, coordinate.space = 'self', 
              y.label.id = 'ID')
## Add annotation to the same level as the sequence alignments

Again user can set the annotation level to zero in order to plot each annotation directly on the line corresponding to each unique sequence/sample.

## Add annotation to the same level as are the sequence/sample alignments
plt <- plotAVA(paf.table = paf.table, color.by = 'direction')
annot.file <- system.file("extdata", "test_annot_ava.RData", package="SVbyEye")
annot.gr <- get(load(annot.file))
addAnnotation(ggplot.obj = plt, annot.gr = annot.gr, coordinate.space = 'self', 
              y.label.id = 'ID', annotation.level = 0)

When plotting the annotation user can define a custom color palette that needs to matched to the unique meta column in annotation ranges.

## Add annotation to the same level as are the sequence/sample alignments
plt <- plotAVA(paf.table = paf.table, color.by = 'direction')
annot.file <- system.file("extdata", "test_annot_ava.RData", package="SVbyEye")
annot.gr <- get(load(annot.file))
## Define color palette
colors <- setNames(annot.gr$color, annot.gr$Repeat)
## Set fill.by variable to color each range using defined color palette
addAnnotation(ggplot.obj = plt, annot.gr = annot.gr, fill.by = 'Repeat', 
              color.palette = colors, coordinate.space = 'self', y.label.id = 'ID',
              annotation.level = 0)

Lastly, there is again a possibility to break PAF alignments at the insertion and deletions and visualize them as an outlined or filled regions between alignments.

## Break PAF alignments at insertions and deletions and highlight them by an outline
plotAVA(paf.table = paf.table, color.by = "direction", min.deletion.size = 1000, 
        min.insertion.size = 1000, color.palette = c("+" = "azure3", "-" = "yellow3"), 
        highlight.sv = 'outline')

## Break PAF alignments at insertions and deletions and highlight them by filled polygons
plotAVA(paf.table = paf.table, color.by = "direction", min.deletion.size = 1000, 
        min.insertion.size = 1000, color.palette = c("+" = "azure3", "-" = "yellow3"), 
        highlight.sv = 'fill')

Visualization of self alignments

There are cases when one wants to visualize regions that are homologous to each other we offer functions to visualize alignments of a given sequence to itself. Such self alignments are typical for segmental duplications that are paralogous sequences that have high identity.

## Get PAF to plot
paf.file <- system.file("extdata", "test2.paf", package = "SVbyEye")
## Read in PAF
paf.table <- readPaf(paf.file = paf.file, include.paf.tags = TRUE, restrict.paf.tags = "cg")
## Make a plot
## Plot alignment as horizontal dotplots and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "segment")
## Plot alignment as arcs and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "arc")
## Plot alignment as arrows and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "arrow")

Again some of the previous parameter apply for the self-alignments as well. For example user can define the color palette, bin the alignments as well as break alignments at the insertion and deletions.

## Plot alignment as arcs and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "arc", 
         color.palette = c("+" = "azure3", "-" = "yellow3"))
## Bin PAF alignments into user defined bin and color them by sequence identity (% of matched bases)
plotSelf(paf.table = paf.table, binsize = 1000)

When breaking the self-alignments at insertions and deletions proximal duplication is considered as query and the distal duplication as a target. It means that sequence missing in query is considered as deletion and sequence inserted in query with respect to target as insertion.

## Highlight structural variants within self-alignments
plotSelf(paf.table = paf.table, min.deletion.size = 50, min.insertion.size = 50,
         highlight.sv = "outline", shape = "arc")

Summarizing sequence composition as annotation heatmaps

## Get PAF to plot
paf.file <- system.file("extdata", "test_getFASTA.paf",
    package = "SVbyEye"
)
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file,
    include.paf.tags = TRUE, restrict.paf.tags = "cg"
)
## Make a plot colored by alignment directionality
plt <- plotMiro(paf.table = paf.table, color.by = "direction")
## Get query sequence composition
fa.file <- paf.file <- system.file("extdata", "test_getFASTA_query.fasta",
    package = "SVbyEye"
)
## Calculate GC content per 5kbp bin
gc.content <- fasta2nucleotideContent(fasta.file = fa.file, binsize = 5000, nucleotide.content = 'GC')
## Add GC content as annotation heatmap
addAnnotation(ggplot.obj = plt, annot.gr = gc.content, shape = 'rectangle', fill.by = 'GC_nuc.freq', coordinate.space = 'query', annotation.label = 'GCcont')
## Calculate AT dinucleotide frequency
dinuc.content <- fasta2nucleotideContent(fasta.file = fa.file, binsize = 5000, nucleotide.content = 'AT')
## Add AT dinucleotide frequency as annotation heatmap
addAnnotation(ggplot.obj = plt, annot.gr = dinuc.content, shape = 'rectangle', fill.by = 'AT_nuc.freq', coordinate.space = 'query', annotation.label = 'ATdinuc')

Narrowing alignment to a user defined target region

Create disjoint alignments

\newpage

Report any issues here:

\newpage

References

sessionInfo()

sessionInfo()


daewoooo/SVbyEye documentation built on May 12, 2024, 7:45 a.m.