Introduction

The trellis package was developed to identify somatic structural variants in tumor-only analyses of whole genome sequencing data. The types of somatic structural alterations identified by trellis include deletions, amplifications, intra- and inter-chromosomal rearrangements, inversions, and in-frame fusions. High copy amplicons are linked using the mate information from read pair analyses, simplifying the interpretation of the likely drivers. Additional packages required for this vignette are listed below, including the r Githubpkg("rscharpf/svfilters.hg19") package that contains various sequence filters for structural variant analyses and the r Rpackage("svbams") package that contains subsampled data used to illustrate various aspects of data processing.

library(Rsamtools)
library(magrittr)
library(Rsamtools)
library(ggplot2)
library(svbams)
library(svfilters.hg19)
library(tidyverse)
library(gridExtra)
library(trellis)
library(ggnet)
library(graph)
library(Rgraphviz)
library(network)
library(sna)
library(kableExtra)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.refGene)

Prerequisites

We assume that a collection of sorted and indexed BAM files are available with PCR duplicates marked. For this vignette, we provide a heavily subsampled bam file in the svbams package that has already been indexed and sorted.

Batch jobs

This section provides code for extracting all read pairs with aberrant orientation or spacing with respect to the reference genome (improper read pairs) and for computing read-depth in non-overlapping bins along the genome. These two aspects of preprocessing can be performed in parallel and are ideally suited for submission as batch jobs on a large computing cluster with one core dedicated to the processing of each sample.

Improperly paired reads

The following code chunk extract all improperly paired reads in a bam file, returning a GAlignmentPairs object where the first and last slots store the alignment of the first read and its mate. The time required to extract these improperly paired reads depends on the size of the bam file.

extdata <- system.file("extdata", package="svbams")
bamfile <- file.path(extdata, "cgov44t_revised.bam")
what <- c("flag", "mrnm", "mpos", "mapq")
iparams <- improperAlignmentParams(what=what)
improper_rp <- getImproperAlignmentPairs(bamfile,
                                         param=iparams,
                                         build="hg19")

Normalized read counts

To compute normalized read depth along the genome, we first define a set of non-overlapping 1kb bins that have a average mappability of at least 0.75. These bins are provided for builds hg18 and hg19 of the reference genome in the \R{} package sv.filters.<build>. For computational speed, we focus on one part of chromosome 15. The function binnedCounts is a wrapper for the countBam function provided in the r Biocpkg("Rsamtools") package.

data(bins1kb, package="svfilters.hg19")
bins <- keepSeqlevels(bins1kb, "chr15",
                      pruning.mode="coarse")
flags <- scanBamFlag(isDuplicate=FALSE,
                     isPaired=TRUE,
                     isUnmappedQuery=FALSE,
                     hasUnmappedMate=FALSE,
                     isProperPair=TRUE)
bviews <- BamViews(bamPaths=bamfile, bamRanges=bins)
bins$cnt <- binnedCounts(bviews)

Bins with zero counts are a consequence of the artificially small bam file included in the svbams package and we exclude these bins from further analysis.

bins <- bins[ bins$cnt > 0 ]

Next, we log-transform the raw counts and correct for GC content by loess. As a random subset of the genome is selected to model GC biases, we suggest setting a seed prior to GC correction.

bins$std_cnt <- binNormalize(bins)
set.seed(123)
bins$log_ratio <- binGCCorrect(bins)

If one or more normal controls is available, the normalized bin-counts can be further corrected to reduce autocorrelation. Since much of the bin-level technical variation is batch-specific, we advise against the use of historical controls for additional normalization. Below, we segment the $\log_2$ ratios by circular binary segmentation (CBS). See ?DNAcopy::segment for a complete list of parameters available in the SegmentParam object. Note, segmentBins assumes a variable called log_ratio in the GRanges object. The object returned by segmentBins is a GRanges object with segment means of log-normalized coverage in the seg.mean column.

g <- segmentBins(bins, param=SegmentParam())
g

As the above steps were restricted to a very small region of the genome, the log ratios obtained from the preceding normalization and GC correction are not interpretable as fold changes from the modal genome ploidy. Below, we load log ratios and the segmented data obtained from processing the full genome and then restrict to chromosomes 5, 8, and 15 for illustrating downstream analyses.

data(bins1kb, package="svfilters.hg19")
ddir <- system.file("extdata", package="svbams",
                    mustWork=TRUE)
lr <- readRDS(file.path(ddir, "preprocessed_coverage.rds"))/1000
seqlevels(bins1kb, pruning.mode="coarse") <- paste0("chr", c(1:22, "X"))
bins1kb$log_ratio <- lr
bins <- keepSeqlevels(bins1kb, c("chr5", "chr8", "chr15"),
                      pruning.mode="coarse")
path <- system.file("extdata", package="svbams")
segs <- readRDS(file.path(path, "cgov44t_segments.rds"))
seqlevels(segs, pruning.mode="coarse") <- seqlevels(bins)

The data for chromosomes 5, 8, and 15:

segs.df <- as_tibble(segs)
as_tibble(bins) %>%
  filter(!is.na(log_ratio)) %>%
  ggplot  +
  geom_point(aes(start/1e6, log_ratio), shape = ".", color="gray") +
  xlab("Coordinate") +
  ylab("log2 normalized counts") +
  coord_cartesian(ylim=c(-4, 2.5)) +
  facet_grid(~seqnames, space="free", scales="free_x") +
  xlab("") +
  theme_bw() +
  geom_segment(data=segs.df, aes(x=start/1e6,
                                 xend=end/1e6,
                                 y=seg.mean,
                                 yend=seg.mean),
               color="black")

Deletions

Candidate somatic deletions are identified from the improperly paired reads and the normalized read depth. Below, we create an object of class DeletionParm containing a list of parameters needed for the deletion analysis. In addition, we delineate a list of the candidate hemizygous and homozygous deletions from the segmented data. For each candidate regions, we extract properly paired reads and improperly paired reads within 2kb of the breakpoints.

dp <- DeletionParam(remove_hemizygous=FALSE)
dp
del.gr <- IRanges::reduce(segs[segs$seg.mean < hemizygousThr(dp)],
                          min.gapwidth=2000)
proper_rp <- properReadPairs(bamfile, gr=del.gr, dp)
improper_rp <- keepSeqlevels(improper_rp, seqlevels(segs),
                             pruning.mode="coarse")
read_pairs <- list(proper_del=proper_rp, improper=improper_rp)

The bin-level summaries ($\log_2$ ratios), the segmentation data, and read pair data are collected in a single object by the function preprocessData.

pdata <- preprocessData(bam.file=bamfile,
                        genome="hg19",
                        bins=bins,
                        segments=segs,
                        read_pairs=read_pairs)

The sv_deletions function creates an object of class StructuralVariant that contains the deletion classification of each segment. Possible classifications include homozygous deletion (homozygous), homozygous deletion supported by improperly paired reads (homozygous+), hemizygous deletion (hemizygous), and hemizygous deletion supported by improperly paired reads (hemizygous+). For identifying somatic deletions without a matched normal, we typically exclude hemizygous deletions not supported by improperly paired reads.

if(FALSE){
    temp <- list(pdata=pdata, dp=dp)
    saveRDS(temp, file="../tests/testthat/temp.rds")
}
deletions <- sv_deletions(preprocess=pdata, param=dp)
variant(deletions)
calls(deletions)

The improperly-paired reads supporting the homozygous+ call can be extracted as a GAlignmentPairs object by the improper function (Figure \@ref(fig:deletionplot)).

improper(deletions[[4]])
roi <- variant(deletions)[4]
roi2 <- roi + 200e3
bins.del <- subsetByOverlaps(bins1kb, roi2)
ylim <- c(-3, 2)
df <- tibble(logr=bins.del$log_ratio,
             start=start(bins.del))
df$logr <- trellis::threshold(df$logr, ylim, amount=0.4)
brks <- pretty(df$start, n=8)
region <- subsetByOverlaps(segs, roi)
region <- region[region$seg.mean < -1]
region <- as.tibble(region)
segs.df <- subsetByOverlaps(segs, roi2) %>%
  as.tibble
xlim <- c(start(roi2), end(roi2))
C <- ggplot(df, aes(start, logr)) +
  geom_point(size=0.7, color="gray50") +
  scale_x_continuous(expand=c(0,0), breaks=brks, labels=brks/1e6)+
  scale_y_continuous(expand=c(0,0)) +
  geom_segment(data=segs.df,
               aes(x=start, xend=end, y=seg.mean, yend=seg.mean),
               size=1) +
  coord_cartesian(xlim=xlim, ylim=ylim) +
  ylab(expression(log[2]~ratio)) +
  geom_rect(data=region,
            aes(xmin=start, xmax=end, ymin=-Inf, ymax=+Inf),
            fill="steelblue", color="transparent", alpha=0.3,
            inherit.aes=FALSE) +
  theme(axis.text=element_text(size=10),
        axis.line=element_line(color="black"),
        panel.background=element_rect(fill="white")) +
  xlab("Mb") +
  annotate("text", x=xlim[1] + 15e3, y=-8, label="chr15", size=3)
rps <- thinReadPairs(deletions[4])
rps <- trellis:::meltReadPairs(rps)
colors <- c("#0072B2", "#009E73")
p <- ggplot(rps, aes(ymin=readpair-0.2, ymax=readpair+0.2,
                xmin=start/1e6, xmax=end/1e6, color=read,
                fill=read, group=readpair)) +
  geom_rect() +
  xlim(c(min(rps$start), max(rps$end))/1e6) +
  geom_line(aes(x=start/1e6, y=readpair), size=0.5) +
  ylab("Read pair index") +
  scale_x_continuous(breaks=scales::pretty_breaks(5)) +
  geom_rect(data=region,
            aes(xmin=start/1e6, xmax=end/1e6, ymin=-Inf, ymax=+Inf),
            fill="steelblue", color="transparent", alpha=0.2,
            inherit.aes=FALSE) +
  scale_color_manual(values=colors) +
  scale_fill_manual(values=colors) +
  xlab("Mb") +
  theme(panel.background=element_rect(fill="white"),
        axis.line=element_line(color="black")) +
  geom_vline(xintercept=
               c(start(roi), end(roi))/1e6,
             linetype="dashed",
             color="gray")
p <- p + guides(fill=FALSE, color=FALSE)
grid.arrange(C, p, ncol=1)

Amplifications

Trellis identifies high copy amplifications and then links the individual amplicons by improperly paired reads previously extracted from the bam file. The sv_amplicons2 function constructs an AmpliconGraph where the nodes are the individual amplicons and an edge between two amplicons indicates multiple paired reads that link the two amplicons in the cancer genome (Figure \@ref(fig:amplicongraph)). The default parameter settings require at least 5 paired reads to support an edge. See ?ampliconParams for customing these settings.

ag <- sv_amplicons2(pdata, params=ampliconParams())
ag

The ampliconRanges accessor can be used to extract the genomic intervals (a GRanges object) of the individual amplicons. The mcols of the GRanges object has the HUGO gene symbols of the genes spanned by the intervals (column hgnc), the amplicon group (column groups), and the names of known drivers (column driver). Note, the genes in hgnc are at the level of the amplicon but the listed drivers are at the level of the amplicon group. The amplicon group for this cancer had the known drivers FGFR4 and MYC.

granges(ampliconRanges(ag))
unique(ampliconRanges(ag)$driver)
n.passengers <- strsplit(ampliconRanges(ag)$hgnc, ", ") %>%
  unlist %>%
  "["(! . %in% c("MYC", "FGFR4")) %>%
  "["(!is.na(.)) %>%
  length

The amplicon groups simplify the interpretation of the likely drivers as the r n.passengers other genes in this network are likely passengers.

plot_amplicons <- function (ag, colors){
    if (length(ag) == 0) {
        message("No amplicons -- nothing to plot")
        return(invisible())
    }
    dark_colors <- c("#332288", "#661100", "#882255", "#AA4499")
    ar <- nodes(ag)
    chroms <- sapply(strsplit(ar, ":"), "[", 1)
    sl <- unique(chroms)
    L <- length(sl)
    L <- ifelse(L > 12, 12, L)
    sl <- factor(chroms, levels = sl)
    ##color_nodes <- col_list[[L]][as.integer(sl)]
    color_nodes <- colors[as.integer(sl)]
    g1 <- graph(ag)
    color_nodes <- setNames(color_nodes, nodes(g1))
    nodenames <- setNames(nodes(g1), nodes(g1))
    text_col <- setNames(rep("black", length(nodes(g1))), nodes(g1))
    text_col[color_nodes %in% dark_colors] <- "gray90"
    nodeRenderInfo(g1) <- list(label = nodenames, fill = color_nodes, 
        textCol = text_col)
    nodeAttrs <- list(fillcolor = color_nodes)
    attrs <- list(node = list(shape = "rectangle", fixedsize = FALSE), 
        graph = list(rankdir = "LR"))
    graph_object <- layoutGraph(g1, attrs = attrs, nodeAttrs = nodeAttrs)
    graph_object
}
colors <- c("#4477AA", "#CC6677")
B <- plot_amplicons(ag, colors)
## adjacency matrix
B1 <- as(B, "graphAM")
am <- B1@adjMat
net <- network(am, directed=FALSE)
chroms <- sapply(strsplit(colnames(am), ":"), "[", 1)
ar <- ampliconRanges(ag)
data(transcripts, package="svfilters.hg19")
hits <- findOverlaps(ar, transcripts, maxgap=5000)
cancer.con <- split(transcripts$cancer_connection[subjectHits(hits)],
                    queryHits(hits))
is.driver <- sapply(cancer.con, any)
is.driver2 <- rep(FALSE, ncol(am))
is.driver2[as.integer(names(is.driver))] <- is.driver
net %v% "chrom" <- chroms
net %v% "driver" <- is.driver2
B <- ggnet2(net, color="chrom",
            palette="Dark2",
            shape="driver",
            size="degree") +
            ##legend.size=1)  +
  guides(size=FALSE, shape=FALSE,
         color=guide_legend(title="",
                            override.aes=list(size=5))) 
print(B)

Rearrangement analysis

Candidate somatic rearrangements

The function findCandidates2 identifies clusters of reads belonging to improper pairs and links these clusters using the mate-pair information of the reads. Parameters for identifying improperly paired read clusters are specified in the RearrangementParams function. With 30x coverage and a clonal ovarian cancer cell line, we set this value to 5. For the identification of likely somatic variants without a matched normal, we required the read pairs to be at least 10kb apart with respect to the reference genome. Using these parameters, we identify two candidate rearrangements. The data supporting these rearrangements are encapsulated in a RearrangementList object called rlist.

rparam <- RearrangementParams(min_number_tags_per_cluster=5,
                              rp_separation=10e3)
minNumberTagsPerCluster(rparam)
rpSeparation(rparam) %>% "/"(1000) %>% paste0("kb")
rlist <- findCandidates2(pdata, rparam)
rlist

Each improperly paired read flanking a candidate rearrangement is typed according to the orientation and spacing of the paired reads and the type of rearrangement (inversion, translocation, etc.) is classified according to the modal type. To extract all the improperly paired reads supporting the first rearrangement, we use the function improper, and the "[[" method defined for subsetting a RearrangementList:

improper(rlist[[1]])

The rearranged read pair clusters link potentially distant regions within a chromosome (intra-chromosomal) or between chromosomes (inter-chromosomal). The regions linked in the cancer genome to create the novel adjacency can be accessed by the linkedBins accessor:

linkedBins(rlist)

The linked.to value in the mcols of the linked bins is also a GRanges object.

Below, we use helper functions reduceGenomeFilters and rFilters to assemble data needed to exclude possible germline rearrangements as well as somatic copy number alterations identified above. The GRangesList of filters includes somatic deletions and amplifications identified from the read-depth analyses, outliers in read depth coverage, germline CNVs, and germline rearrangements. For rearrangements, the two regions that are joined in the cancer genome to create a novel adjacency are stored in a GRanges object. The outliers, germline CNVs, and germline rearrangements were identified from a set of 10 lymphoblastoid cell lines and are provided by the svfilters.hg19 package in the germline_filters object.

data(germline_filters, package="svfilters.hg19")
genome_filters <- reduceGenomeFilters(germline_filters,
                                      seqlevels(bins1kb))
rf <- rFilters(amplicons=ampliconRanges(ag),
               deletions=variant(deletions),
               rear=germline_rear,
               germline=genome_filters)
rdat <- rearrangementData(rlist=rlist,
                          read_pairs=list(improper=improper_rp),
                          filters=rf)
rlist <- filterRear(rdat, rparam)

Confirmation by BLAT

The rearangements identified thus far are required to have a physical separation of at least rp_separation with respect to the reference genome, must be supported by at least min_number_tags_per_cluster paired reads, at least prop_modal_rearrangement of the read pairs must be consistent with the modal rearrangement type, and the size of the read clusters must be at least min_cluster_size basepairs. In this section, we realign all improperly paired reads supporting a candidate rearrangement (mapped-mapped) with BLAT and extract unmapped reads with aligned mate near candidate sequence junctions (mapped-unmapped) that may correspond to split reads. Since split reads contain the actual sequence junction, these reads can usually determine the exact breakpoint within a few basepairs.

Mapped-mapped

For each rearrangement identified in the above analyses, we have saved the set of improperly paired reads supporting the new sequence junction. Because both reads belonging to the improper pairs have been mapped to the reference genome, this section describes the realignment of 'mapped-mapped' pairs. In the following code chunk, we extract the tag sequence of all improperly paired reads in a RearrangementList and write these sequences to file in fasta format. Note, the MAX argument to getSequenceOfReads below indicates the maximum number of sequences for a specific rearrangement to extract from the BAM file. For example, the first rearrangement in the rlist object had r length(improper(rlist[[1]])) improperly paired reads spanning the sequence junction. Setting the MAX parameter to 25 (MAX=25), only 25 of the 50 read pairs are randomly selected for re-alignment by BLAT. Again, we suggest setting a seed for reproducibility.

set.seed(123)
tags <- getSequenceOfReads(rlist, bamfile,
                           MAX=25L, build = "hg19")
dir <- tempdir()
fa.file <- file.path(dir, paste0(basename(bamfile), ".fa"))
writeToFasta(tags, fa.file)

The unevaluated code below is a system call to run the command-line version of BLAT. In addition to requiring installation of BLAT, we must also have a copy of the appropriate reference genome available.

blatpath <- "~/bin/x86_64/blat"
refgenome <- "~/Dropbox/reference_genome/hg19.fa"
outfile <- tempfile()
cmd <- paste(blatpath, refgenome, fa.file, outfile)
system(cmd)
file.copy(outfile, "../../svalignments/inst/extdata/blat_alignment.txt")

Here, we read the previously saved BLAT alignments for this data. We require that each read have only one near perfect match in the genome (score > 90) and, if there is a near-perfect match, the near-perfect match must be consistent with the whole genome aligner (Figure \@ref("blatscores")).

extdata <- system.file("extdata", package="svbams")
blat.file <- file.path(extdata, "blat_alignment.txt")
blat_aln <- readBlat(blat.file)
records <- annotateBlatRecords(blat_aln, tags)
blatScores(records, tags, id="CGOV44T",
           min.tags=5, prop.pass=0.8)
ggplot(records, aes(Qname, match)) +
  geom_jitter(width=0.05, aes(color=is_overlap), size=0.5) +
  geom_hline(yintercept=90, linetype="dashed") +
  scale_color_manual(values=c("gray", "black")) +
  facet_wrap(~rid, ncol=1, scales="free_x") +
  theme(axis.text.x=element_text(angle=90, size=5),
        panel.background=element_rect(fill="white")) +
  guides(color=guide_legend(title="Concordance of BLAT\nand original algnment")) +
  ylab("BLAT score")

Mapped-unmapped

In addition to using BLAT to confirm the whole genome aligner for improperly paired reads supporting a rearrangement, we also use BLAT to identify split read alignments that directly span the sequence junction. Here, we query the BAM file for all read pairs in which a read was mapped near the putative rearrangement but its mate was not aligned (mapped-unmapped). If the unmapped mate spans the sequence junction, we will assess whether BLAT aligns a subsequence of the read to one cluster and the complement of the subsequence to the second cluster. Such reads further improve the specificity of the rearrangement and establish basepair resolution of the sequence junction. First, we construct a GRanges object of all the linked read clusters identified in our rearrangement analyses. This GRanges object will be used to query the BAM file for read pairs in which only one read of a pair was aligned near the putative rearrangement. Again, we write the sequence of the unmapped reads to a fasta file and call BLAT by a system call.

query <- uncouple(linkedBins(rlist))
unmapped <- unmapped_read(bamfile, query, yield_size=200000)
length(unmapped)
dir <- tempdir()
mapped_unmapped.fa <- file.path(dir, "mapped-unmapped.fa")
writeUnmappedToFasta(unmapped, mapped_unmapped.fa)

The object unmapped is a GRanges object of 87 reads that were not mapped by ELAND to the reference genome and that have a mate mapped to one of the intervals in query. Again, we use a system call in the following unevaluated code chunk to realign the unmapped reads:

outfile <- tempfile()
cmd <- paste(blatpath, refgenome, mapped_unmapped.fa, outfile)
system(cmd)
file.copy(outfile, "../../svalignments/inst/extdata/blat_unmapped.txt")

TODO: create a script in data-raw/ that creates the above blat_unmapped.txt object.

unmap.file <- file.path(extdata, "blat_unmapped.txt")
blat_unmap <- readBlat(unmap.file)

Recall that our QC analysis of the mapped-mapped read pairs focused on whether the original whole genome alignment was the only high-scoring BLAT alignment. Here, our goal is to assess whether BLAT splits the alignment of reads that are initially unmapped by the whole genome aligner and have a mate that is aligned to the approximate location of the rearrangement. The locations in the first rearrangement (with respect to the reference genome) that we think are joined in the somatic genome are given by the linked bins of the RearrangementList object. We want to assess whether BLAT aligns part of a subsequence of a read to the interval given by

granges(linkedBins(rlist)[1])

and the complement of the subsequence to the interval given by

linkedTo(rlist)[1]

We use the function rearrangedReads to identify BLAT records that correspond to split reads supporting a rearrangement. Depending on the size of the sequenced DNA fragments, the improperly paired reads can only approximate the location of a sequence junction (likely to within 100 basepairs). Finally, we check that each rearrangement is supported by one or more reads that map to both sides of the sequence junction using the function is_valid_splits and we assign the split reads to the RearrangementList object using the splitReads<- method.

split_reads <- rearrangedReads(linkedBins(rlist), blat_unmap, 500)
elementNROWS(split_reads)
split_reads
splitReads(rlist) <- split_reads
is_valid <- is_valid_splits(rlist, maxgap=50)
at_least_one <- lengths(splitReads(rlist)) >= 1
rlist2 <- rlist[ is_valid & at_least_one ]

5- to 3-prime orientation

n.split <- elementNROWS(splitReads(rlist2))

The above BLAT analysis identifies r n.split[1] and r n.split[2] split reads for the two rearrangements named 1-2 and 3-4, respectively. Each of these rearrangements have two possible 5-prime to 3-prime orientations. The function fiveTo3List places the linked bins in their 5-prime to 3-prime orientation. Note, in the code below that rlist2 is now twice the length of the original rlist object as each rearrangement has been placed in two possible orientations. Because some of the split reads may be filtered when evaluating the orientation of the sequence junction, we again filter rearrangements that are not supported by at least one split read.

rlist3 <- fiveTo3List(rlist2, build="hg19")
rlist3 <- rlist3[ is_valid_splits(rlist3, maxgap=50) ]
rlist3

To visualize the improperly paired reads and the split reads supporting a rearrangement, we first collect the supporting reads in a tibble. The function rearDataFrameList takes a RearrangementList as input and extracts the supporting reads belonging to the first two elements of the list that correspond to the two possible 5-prime to 3-prime orientations. Next, we use ggRearrange, a wrapper to ggplot, to visualize the supporting reads. Because sequence junctions do not overlap a transcript, we have arbitarily labeled the two genomic regions that are joined in the somatic genome as noncoding1 and noncoding2.

df <- rearDataFrame(rlist3[[1]], build="hg19")
grobs <- ggRearrange(df)
grobs[["arranged.grobs"]]

The second rearrangement in the original rlist object now corresponds to elements 3 and 4 of the rlist3 object. Again, we call rearDataFrame and ggRearrange to organize and then plot the supporting reads.

df2 <- rearDataFrame(rlist3[[3]], build="hg19")
ggRearrange(df2)[["arranged.grobs"]]

Fusions

We begin by selecting only the rearrangements in which both ends of a sequence junction are within 5kb of a known transcript.

near.coding <- seqJunctionNearTx(rlist=rlist3, build='hg19')
rlist4 <- rlist3[ near.coding ]
length(rlist4)

Next, we make the definition of the sequence junctions more precise.

jxns <- seqJunctions_Rlist(rlist4)
jxns

In addition, we require the 3-prime genomic region of the junction to lie strictly within a transcript, while the 5-prime genomic region forming the junction can occur either in the promoter or within the 5-prime transcript. Operationally, we define the promoter as the genomic region 5kb upstream of the transcription start site of the 5-prime gene. We refer to the remaining junctions as coding_jxns. As the coding junctions are named by the rearrangement ids, we can use these names to subset the rlist object, excluding those rearrangements that do not meet the above criteria.

coding_jxns <- codingJunctions(jxns, "hg19")
coding_jxns
rlist4 <- rlist4[ names(coding_jxns) ]

Next, we use the fuseCDS_Rlist function to extract for each rearrangement the full CDS of the fused sequence in the somatic genome (fusions), the partial CDS of the 5-prime (tum.5p) and 3-prime (tum.3p) transcripts that are involved in the fusion, and the full CDS of the 5-prime (ref.5p) and 3-prime transcripts in the reference genome. Note, we use a single bracket [ when subsetting the RearrangementList so that the resulting object is still an instance of RearrangementList (albeit a length-one list).

cds.list <- fuseCDS_Rlist(rlist4, coding_jxns)
cds.list

For each fusion, we translate the CDS in each of the three possible reading frames and partition the amino acid sequence of the fusion protein into the sequences derived from the 5-prime and 3-prime transcripts. By partioning the amino acid sequence of the fusion into its 5-prime and 3-prime parts, we can compare the 5-prime partition to the amino acid sequence of the 5-prime reference protein and the 3-prime partition to the amino acid sequence of the 3-prime reference protein. The function translateCDS translates the CDS in the 3 possible reading frames.

proteins <- translateCDS(cds.list)
proteins

The amino acid sequence obtained by translating the CDS involved in the 5-prime and 3-prime transcripts are represented as AAStringSet objects. Since a single gene can have multiple transcripts, we translate every possible combination of 5-prime and 3-prime transcripts in each of the 3 frames. Here, we show the amino acid sequences for the two possible combinations of transcripts involving TLN2 and TPM1:

proteins$fusion[[1]]

We can also list the amino acid sequences derived from the 5-prime gene and the 3-prime gene:

proteins$tum5p[[1]]
proteins$tum3p[[1]]

The function partitionAASequence reorganizes the list of amino acid sequences by frame to facilitate downstream computation.

partition.fusion <- partitionAASequence(proteins)
partition.fusion

For a fusion to be in-frame, we require that the amino acid sequence derived from the 5- and 3-prime transcripts to be subsequences of the full amino acid sequence of the reference genome that was translated in the same frame. In addition, we require that there be no premature stop codons. The following code accomplishes these two tasks, with the ref.frames object below containing the full amino acid sequence of the transcripts associated with IKAROS and ERBB4.

nostop.list <- noPrematureStop(partition.fusion)
nostop.list
ref.frames <- organizeReferenceByFrame(proteins)
inframe.list <- inFrameList(fusion.frames=partition.fusion,
                            ref.frames=ref.frames)
inframe.list

Each element of the nostop.list and inframe.list lists are an object of class LogicalList, each evaluated from a different reading frame. The LogicalList objects are named by the rearrangement identifier. The helper function inFrameNoStop combines these to lists to create a single list that is TRUE for rearrangements that are in-frame and have no premature stop codons.

inframe.nostop <- inFrameNoStop(nostop.list, inframe.list)
inframe.nostop

Finally, validFusions returns a list object containing only the fusions that are in-frame and have no premature stops. For each such fusion, the amino acid sequence, CDS of the fusion, partial CDS of the 5-prime and 3-prime transcripts, and the full CDS of the reference transcripts are available.

valid.fusions <- validFusions(partition.fusion,
                              cds.list,
                              inframe.nostop,
                              ref.frames)
valid.fusions

The above data can be summarized in tabular format using the fusionTable2 function:

tab <- fusionTable2(valid.fusions)
tab %>%
  kable("html",  format.args=list(big.mark=",")) %>%
  kable_styling(bootstrap_options=c("striped", "hover"))

Note, this table provides both the genomic coordinates of the fusion (junction.5prime, junction.3prime), but also the portion of the amino acid sequence that was retained by the fusion (aa.5prime.start, aa.5prime.end, aa.3prime.start, aa.3prime.end). In the following code chunk, we construct amino acid ranges for the genes involved in the fusion and the protein domains for these genes in the reference genome.

tumor_aa_ranges <- aa_granges(tab)
extdata2 <- system.file("extdata", package="svbams")
up <- readRDS(file.path(extdata2, "uniprot.rds"))
up2 <- uniprotFeatures(up, tab, strwrap.width=20)
domain_aa_ranges <- GRanges(up2$hugo, IRanges(up2$start, up2$end),
                            chromosome=up2$seqnames,
                            description=up2$description,
                            short.desc=up2$short.desc,
                            aa_len=up2$aa_len)
domains <- subsetByOverlaps(domain_aa_ranges, tumor_aa_ranges, type="within")
domains

Session information

sessionInfo()


cancer-genomics/trellis documentation built on Feb. 2, 2023, 7:04 p.m.