knitr::opts_chunk$set(fig.width=8, fig.height=5) options(width = 88)
cat(' <style type="text/css"> .figure { text-align: center; } .caption { font-weight: bold; } </style> ')
Some cancers are associated with the presence of viruses that disrupt chromosomal DNA. Among the best known examples are cervical cancers and head-and-neck cancers that are associated with human papilloma virus (HPV). Using whole genome short-read sequencing, it is possible to identify breakpoints, including those where viral DNA is merged with human DNA. Each breakpoint defines a local instance of "structural variation" (SV) where segments of chromosomes are joined together in non-normal ways. It remains an open question whether those fragments are fully integrated into a human chromosome or just exist as circular extrachromosomal DNA. Using newer long-read sequencing technology, one can obtain information about the order of multiple breakpoints in the original DNA molecules in the tumor sample. In this vignette, we describe SVAlignR, an R package intended to use these long-read sequences to reconstruct the original molecules.
We begin by loading the SVAlignR
package, and a sample data set.
library(SVAlignR) data(longreads) head(longreads[, 1:3])
Each of the r nrow(longreads)
rows in this spreadsheet represents a different Oxford
Nanopore long read. The qname
column is a unique identifier; the qlen
column contains
the length of the read in bytes; and the connections
column lists the breakpoints that
the read contains, each of which has been assigned a numeric identifier. We are interested
in this column. We have two main goals:
As with many sequence alignment problems, we start by removing any duplicate "reads"; here that means duplication of the break-point sequence summaries of the actual reads.
LR0 <- longreads$connection names(LR0) <- rownames(longreads) LR <- LR0[!duplicated(LR0)] weights <- sapply(LR, function(lr) sum(LR0 == lr)) length(LR) # expect 158
We want to cluster the break-point sequences.
In order to make use of existing algorithms,
we need to convert the numeric identifiers into single characters as part of an "alphabet".
The Cipher
class takes in a character vector containing the "words" (that is, our
break-point sequences) that need to be rewritten. By default, it assumes (as in our example)
that "letters" are separated by hyphens, identifies the unique symbols, and creates a tool
to convert back-and-forth from one "alphabet" to another.
alphabet <- Cipher(LR) alphabet
Now we are able to convert the original sequences of numbers into words in our specialized alphabet.
head(LR) pseudo <- encode(alphabet, LR) head(pseudo)
The recoding step is already included in the SequenceCluster
constructor and will happen automatically.
So, to cluster the long-read break point sequences, we just need to call the constructor on the
(named and de-duplicated) sequences. The core algorithm is the classic
Needelman-Wunsch global alignment algorithm, as implemented in the NameNeedle
R package. As
used in the SVAlignR
package, it computes an alignment score between any two sequences based
on a match score of 2, a mismatch penalty of -6, and a gap penalty of -2. We fill a square
matrix with scores for every pair (I, J) of sequences. The resulting scores can be either
positive or negative. For each row X, we rescale into the interval [0,1] by computing
$$Y_{IJ} = 1 - \frac{X_{IJ} - min(X_{I.})}{max(X_{I.})-min(X_{I.})}.$$
Since the scores need not be symmetric, we force symmetry by averaging the scaled score matrix and
its transpose.
We then perform hierarchical clustering using the symmetrized, rescaled score matrix to define distances. We cut the dendrogam to cluster the sequences into 20 groups. The number 20 is arbitrary. In our data set, it should yield about 8 long-read break-point sequences per cluster. For later steps, we also want to make sure that each cluster contains at leeast 2 sequences.
seqclust <- SequenceCluster(LR0, NC = 15) table(seqclust@clusters)
Now we see why 20 clusters is appropriate. Each cluster is a relatively small size, but they all contain more than one sequence so that aligning those sequences will be meaningful. Figure 1 displays the resulting dendrogram, with each of the 20 clusters shown in a different color.
plot(seqclust)
Figure 2 contains a simplified view of the same dendrogram, cutting off the clutter below the major splits into clusters.
plot(seqclust, type = "clipped")
Figure 3 displays an alternate view of the dendrogram, emphasizing the relationships between clusters.
plot(seqclust, type = "unrooted")
Both Figure 2 and, more strongly, Figure 3 indicate that there are approximately four major clusters.
We can also show the dendrogram attached to a heatmap containing the distance matrix.
heat(seqclust)
Each of the clusters uses a small enough set of symbols that we can safely rewrite them as
though they are sequences of amino acids. We then use the align
function from the SVAlignR
package to perform multiple sequence alignments, using the "ClustalW" algorithm on each cluster.
Because the components are not amino acids, however, we do not need to use one of the PAM or
BLOSUM matrices. Instead, we define a substitution matrix that gives the same match and mismatch
scores to each pair of symbols.
if (!requireNamespace("msa", quietly = TRUE)) { stop("Cluster alignment only works if the `msa` package is available.") } MYSUB <- makeSubsMatrix(match = 2, mismatch = -6) consensus <- alignAllClusters(seqclust, mysub = MYSUB, gapO = 2, gapE = 0.1) class(consensus) length(consensus)
Figure 5 contains an example of a multiple sequence alignment.
opar <- par(mai = c(1.02, 0.82, 0.82, 0.82)) image(consensus[[15]]) par(opar) rm(opar)
Figure 6 shows all 20 alignments.
opar <- par(mfrow = c(4,3)) for (K in 1:length(consensus)) { image(consensus[[K]], col = SVAlignR:::myColorSet[K]) } par(opar)
allmodcons <- sapply(consensus, function(X) X@consensus) allmodcons <- gsub("\\?", "q", allmodcons) # make question mark a letter allmodcons <- gsub("\\:", "z", allmodcons) # make rare colon a letter while (length(grep("^[qz]", allmodcons) > 0)) { allmodcons <- sub("^q-", "", allmodcons) allmodcons <- sub("^z-", "", allmodcons) } while (length(grep("[qz]$", allmodcons) > 0)) { allmodcons <- sub("-q$", "", allmodcons) allmodcons <- sub("-z$", "", allmodcons) } names(allmodcons) <- paste("C", c(rep(0,9), rep("", 11)), 1:20, sep = "") newsub <- makeSubsMatrix(match = 2, mismatch = -6) ab <- alignCluster(allmodcons, mysub = newsub, gapO = 0.1, gapE = 0.2) image(ab)
library(dendextend) myColorSet <- SVAlignR:::myColorSet NC <- seqclust@NC mycut <- mean(rev(seqclust@hc$height)[NC + -1:0]) hcd <- as.dendrogram(seqclust@hc) dend2 <- cut(hcd, h = mycut) DU <- dend2$upper blocks <- sapply(consensus, function(X) { X@consensus }) labels(DU) <- paste(" [", 1:NC, "] ", blocks, " ", sep = "") opar <- par(bg = "gray90") plot(ape::as.phylo(DU), type = "unrooted", rotate.tree = -20, edge.color = "black", edge.width = 2, lab4ut = "axial", tip.color = myColorSet[1:NC], cex = 1.2) par(opar)
resn <- 300 for (K in 1:NK) { temp <- consensus[[K]] png(file = file.path(paths$figs, paste("cons-", K, ".png", sep = "")), width = 6*resn, height = 4*resn, res = resn, bg = "white") showme(temp, col = SVAlignR:::myColorSet[K], cex = 0.5) dev.off() }
We also provide the ability to compute de Bruijn graphs from any set of sequences. These graphs are
frequently used for "de novo" sequencing, and thus can provide an alternative source of information
about "consensus sequence" alignments. In order to work with graphs in general, you should load the
igraph
package.
library("igraph") DB <- deBruijn(seqclust@rawSequences, 4) G <- DB@G loci <- layout_with_fr(G) plot(G, layout = loci) count_components(G) comp <- components(G) sort(comp$csize) GC <- largest_component(G) gloc <- layout_with_gem(GC) plot(GC, layout = gloc) sort(table(E(GC)$weight)) GE <- delete_edges(GC, E(GC)[E(GC)$weight < 6]) GE <- largest_component(GE) #GE <- set_edge_attr(GE, "width", value = sqrt(edge_attr(G, "weight"))) gloc <- layout_with_fr(GE) plot(GE, layout = gloc)
dbset <- lapply(1:seqclust@NC, function(J) { rs <- seqclust@rawSequences[seqclust@clusters == J] db <- deBruijn(rs, 4) }) L <- length(dbset) for (I in 1:L) { G <- dbset[[I]]@G lyt <- layout_with_fr(G) plot(G, layout =lyt, main = paste("Cluster", I)) }
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.