SVAlignR and Virus-Associated Cancer"
In SVAlignR: Recovering Structure of Long Molecules from Structural Variation Data

knitr::opts_chunk$set(fig.width=8, fig.height=5)
options(width = 88)

cat('
<style type="text/css">
.figure { text-align: center; }
.caption { font-weight: bold; }
</style>
')

Introduction

Some cancers are associated with the presence of viruses that disrupt chromosomal DNA. Among the best known examples are cervical cancers and head-and-neck cancers that are associated with human papilloma virus (HPV). Using whole genome short-read sequencing, it is possible to identify breakpoints, including those where viral DNA is merged with human DNA. Each breakpoint defines a local instance of "structural variation" (SV) where segments of chromosomes are joined together in non-normal ways. It remains an open question whether those fragments are fully integrated into a human chromosome or just exist as circular extrachromosomal DNA. Using newer long-read sequencing technology, one can obtain information about the order of multiple breakpoints in the original DNA molecules in the tumor sample. In this vignette, we describe SVAlignR, an R package intended to use these long-read sequences to reconstruct the original molecules.

Getting Started

We begin by loading the SVAlignR package, and a sample data set.

library(SVAlignR)
data(longreads)
head(longreads[, 1:3])

Each of the r nrow(longreads) rows in this spreadsheet represents a different Oxford Nanopore long read. The qname column is a unique identifier; the qlen column contains the length of the read in bytes; and the connections column lists the breakpoints that the read contains, each of which has been assigned a numeric identifier. We are interested in this column. We have two main goals:

Assemble the long-reads (which should be thought of as fragmentary) into complete molecules.
Start constructing a phylogenetic tree showing the evolution of these HPV-human combining molecules.

As with many sequence alignment problems, we start by removing any duplicate "reads"; here that means duplication of the break-point sequence summaries of the actual reads.

LR0 <- longreads$connection
names(LR0) <- rownames(longreads)
LR <- LR0[!duplicated(LR0)]
weights <- sapply(LR, function(lr) sum(LR0 == lr))
length(LR) # expect 158

Clustering Sequences

We want to cluster the break-point sequences.

Recoding

In order to make use of existing algorithms, we need to convert the numeric identifiers into single characters as part of an "alphabet". The Cipher class takes in a character vector containing the "words" (that is, our break-point sequences) that need to be rewritten. By default, it assumes (as in our example) that "letters" are separated by hyphens, identifies the unique symbols, and creates a tool to convert back-and-forth from one "alphabet" to another.

alphabet <- Cipher(LR)
alphabet

Now we are able to convert the original sequences of numbers into words in our specialized alphabet.

head(LR)
pseudo <- encode(alphabet, LR)
head(pseudo)

Clusters

The recoding step is already included in the SequenceCluster constructor and will happen automatically. So, to cluster the long-read break point sequences, we just need to call the constructor on the (named and de-duplicated) sequences. The core algorithm is the classic Needelman-Wunsch global alignment algorithm, as implemented in the NameNeedle R package. As used in the SVAlignR package, it computes an alignment score between any two sequences based on a match score of 2, a mismatch penalty of -6, and a gap penalty of -2. We fill a square matrix with scores for every pair (I, J) of sequences. The resulting scores can be either positive or negative. For each row X, we rescale into the interval [0,1] by computing $$Y_{IJ} = 1 - \frac{X_{IJ} - min(X_{I.})}{max(X_{I.})-min(X_{I.})}.$$ Since the scores need not be symmetric, we force symmetry by averaging the scaled score matrix and its transpose.

We then perform hierarchical clustering using the symmetrized, rescaled score matrix to define distances. We cut the dendrogam to cluster the sequences into 20 groups. The number 20 is arbitrary. In our data set, it should yield about 8 long-read break-point sequences per cluster. For later steps, we also want to make sure that each cluster contains at leeast 2 sequences.

seqclust <- SequenceCluster(LR0, NC = 15)
table(seqclust@clusters)

Now we see why 20 clusters is appropriate. Each cluster is a relatively small size, but they all contain more than one sequence so that aligning those sequences will be meaningful. Figure 1 displays the resulting dendrogram, with each of the 20 clusters shown in a different color.

plot(seqclust)

Figure 2 contains a simplified view of the same dendrogram, cutting off the clutter below the major splits into clusters.

plot(seqclust, type = "clipped")

Figure 3 displays an alternate view of the dendrogram, emphasizing the relationships between clusters.

plot(seqclust, type = "unrooted")

Both Figure 2 and, more strongly, Figure 3 indicate that there are approximately four major clusters.

We can also show the dendrogram attached to a heatmap containing the distance matrix.

heat(seqclust)

Aligning Clustered Sequences

Each of the clusters uses a small enough set of symbols that we can safely rewrite them as though they are sequences of amino acids. We then use the align function from the SVAlignR package to perform multiple sequence alignments, using the "ClustalW" algorithm on each cluster. Because the components are not amino acids, however, we do not need to use one of the PAM or BLOSUM matrices. Instead, we define a substitution matrix that gives the same match and mismatch scores to each pair of symbols.

if (!requireNamespace("msa", quietly = TRUE)) {
  stop("Cluster alignment only works if the `msa` package is available.")
}
MYSUB <- makeSubsMatrix(match = 2, mismatch = -6)
consensus <- alignAllClusters(seqclust, mysub = MYSUB, gapO = 2, gapE = 0.1)
class(consensus)
length(consensus)

Figure 5 contains an example of a multiple sequence alignment.

opar <- par(mai = c(1.02, 0.82, 0.82, 0.82))
image(consensus[[15]])
par(opar)
rm(opar)

Alignments By Cluster

Figure 6 shows all 20 alignments.

opar <- par(mfrow = c(4,3))
for (K in 1:length(consensus)) {
  image(consensus[[K]], col = SVAlignR:::myColorSet[K])
}
par(opar)

Consensus Sequences

allmodcons <- sapply(consensus, function(X) X@consensus)
allmodcons <- gsub("\\?", "q", allmodcons) # make question mark a letter
allmodcons <- gsub("\\:", "z", allmodcons) # make rare colon a letter
while (length(grep("^[qz]", allmodcons) > 0)) {
  allmodcons <- sub("^q-", "", allmodcons)
  allmodcons <- sub("^z-", "", allmodcons)
}
while (length(grep("[qz]$", allmodcons) > 0)) {
  allmodcons <- sub("-q$", "", allmodcons)
  allmodcons <- sub("-z$", "", allmodcons)
}
names(allmodcons) <- paste("C", c(rep(0,9), rep("", 11)), 1:20, sep = "")

newsub <- makeSubsMatrix(match = 2, mismatch = -6)
ab <- alignCluster(allmodcons, mysub = newsub, gapO = 0.1, gapE = 0.2)
image(ab)

library(dendextend)
myColorSet <- SVAlignR:::myColorSet
NC <- seqclust@NC
mycut <- mean(rev(seqclust@hc$height)[NC + -1:0])
hcd <- as.dendrogram(seqclust@hc)
dend2 <- cut(hcd, h = mycut)
DU <- dend2$upper
blocks <- sapply(consensus, function(X) {
  X@consensus
})
labels(DU) <- paste("   [", 1:NC, "]  ",  blocks, "   ", sep = "")

opar <- par(bg = "gray90")
plot(ape::as.phylo(DU), type = "unrooted", rotate.tree = -20,
     edge.color = "black", edge.width = 2,
     lab4ut = "axial", 
     tip.color = myColorSet[1:NC], cex = 1.2)
par(opar)

resn <- 300
for (K in 1:NK) {
  temp <- consensus[[K]]
  png(file = file.path(paths$figs, paste("cons-", K, ".png", sep = "")),
      width = 6*resn, height = 4*resn, res = resn, bg = "white")
  showme(temp, col = SVAlignR:::myColorSet[K], cex = 0.5)
  dev.off()
}

De Bruijn Graphs

We also provide the ability to compute de Bruijn graphs from any set of sequences. These graphs are frequently used for "de novo" sequencing, and thus can provide an alternative source of information about "consensus sequence" alignments. In order to work with graphs in general, you should load the igraph package.

library("igraph")
DB <- deBruijn(seqclust@rawSequences, 4)
G <- DB@G
loci <- layout_with_fr(G)
plot(G, layout = loci)


count_components(G)
comp <- components(G)
sort(comp$csize)

GC <- largest_component(G)
gloc <- layout_with_gem(GC)
plot(GC, layout = gloc)

sort(table(E(GC)$weight))
GE <- delete_edges(GC, E(GC)[E(GC)$weight  < 6])
GE <- largest_component(GE)
#GE <- set_edge_attr(GE, "width", value = sqrt(edge_attr(G, "weight")))
gloc <- layout_with_fr(GE)
plot(GE, layout = gloc)

dbset <- lapply(1:seqclust@NC, function(J) {
  rs <- seqclust@rawSequences[seqclust@clusters == J]
  db <- deBruijn(rs, 4)
}) 
L <- length(dbset)
for (I in 1:L) {
  G <- dbset[[I]]@G
  lyt <- layout_with_fr(G)
  plot(G, layout =lyt, main = paste("Cluster", I))
}

Appendix

sessionInfo()

Any scripts or data that you put into this service are public.

SVAlignR documentation built on Sept. 4, 2025, 3:01 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

SVAlignR
Recovering Structure of Long Molecules from Structural Variation Data

SVAlignR and Virus-Associated Cancer"
In SVAlignR: Recovering Structure of Long Molecules from Structural Variation Data

Introduction

Getting Started

Clustering Sequences

Recoding

Clusters

Aligning Clustered Sequences

Alignments By Cluster

Consensus Sequences

De Bruijn Graphs

Appendix

Try the SVAlignR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

SVAlignR Recovering Structure of Long Molecules from Structural Variation Data

SVAlignR and Virus-Associated Cancer" In SVAlignR: Recovering Structure of Long Molecules from Structural Variation Data

Introduction

Getting Started

Clustering Sequences

Recoding

Clusters

Aligning Clustered Sequences

Alignments By Cluster

Consensus Sequences

De Bruijn Graphs

Appendix

Try the SVAlignR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

SVAlignR
Recovering Structure of Long Molecules from Structural Variation Data

SVAlignR and Virus-Associated Cancer"
In SVAlignR: Recovering Structure of Long Molecules from Structural Variation Data