knitr::opts_chunk$set(collapse = TRUE, fig.width = 8, fig.height = 8, message = FALSE, warning = FALSE) library(gTrack) library(rtracklayer) library(kableExtra) library(magrittr) library(tidyr)
The gGnome package provides a flexible, queriable R
interface to graphs and walks of genomic intervals. gGnome is written in the R6
object oriented standard and built around a powerful GenomicRanges
, data.table
, and igraph
backend, and thus supports agile interaction with graphs consisting of hundreds of thousands of nodes and edges. Because gGnome classes are written in R6, their methods, variables, and "active bindings" are referenced using the $
symbol. This includes methods that (similar to other object oriented languages) enable the object to be modified "in place". Please see here for more details about the R6
standard.
Our key interest in developing gGnome is to develop a framework to analyze the sorts of graphs that arise in the study of whole genome cancer structural variation. However, we believe that the package can be useful in any context where graphs and paths of reference genomic intervals are generated, such as through the analysis of complex germline haplotypes, population genetics of structural variation, partially assembled genome drafts, comparative genomics, and transcriptome assembly.
For installation instructions, please visit the gGnome github page. For background, it may help to have some familiarity with data.table
, GenomicRanges
, gUtils
, igraph
packages.
gGnome is currently in an alpha release and a manuscript in preparation. If you use gGnome in your work, please contact us so that we can update you when the citation becomes available.
The key classes in gGnome comprise the following: the Junction
, the genome graph
(gGraph
) (and its nodes (gNode
) and edges (gEdge
)), and the genomic walk
(gWalk
). The nodes in a genome graph represent reference genomic intervals, also known in
BioConductor parlance as GRanges
. Each gNode
corresponds to a chunk of reference
genome, and represents a double-stranded DNA sequence. Each gEdge
represents a pair of complementary 3-5' phosphodiester bonds (+/- some foreign
sequence insertion), each termed an adjacency.
Since DNA is double stranded, every gNode
in the gGraph
is actually a pair of
stranded GRanges
, one on the "+" and the other on the "-" reference strand.
Similarly, every gEdge
is a pair of directed adjacencies. This makes the
gGraph
a skew-symmetric directed graph.
In this directed graph, every interval has an "anti-interval", every adjacency has an
"anti-adjacency", and every path (i.e. sequence of intervals or adjacencies) has
an "anti-path".
The vertices in a standard directed graph do not formally have left and right "sides" that edges can both enter and exit. This limits the applicability of standard directed graphs to the modeling of DNA, since double-stranded DNA segments can be rearranged with each other in any of four (left/left, right/left, right/right, left/left) possible orientations. Though, in a directed graph, we may arbitrarily label nodes that send edges to our vertex as being to its "left", and edges that receive edges from our vertex to its "right", it is impossible to then connect two vertices in our directed graph on their mutual left.
However, since gGraph
nodes represent interval / anti-interval pairs, we
can talk about their "right-side" (i.e. the side with adjacencies leaving the
"+" interval and entering the "-" interval) and their "left side" (i.e. the side with adjacencies leaving the
"-" interval and entering the "+" interval). Adjacencies associated with the
left side of one node can connect to the left or right side of any other node.
Though we retain the underlying structure of a directed graph (convenient for
applying standard graph algorithms) we are able to model the "sidedness" of DNA
sequence. We contrast this skew-symmetric structure to the traditional
"bidirected graph" formulation
of DNA assemblies, which uses nodes to represent double-stranded sequence "ends"
and two flavors of (intrasegment, intersegment) edges to represent sequences and their
adjacencies, respectively.
Just to review, strands of any (double-stranded) reference genome are arbitrarily labeled "+" and "-", where the "+" strand goes from 5' to 3' in the direction of increasing genomic coordinates. Conversely, the "-" strand of the genome goes from 5' to 3' in the direction of decreasing genomic coordinates. As a convention we refer to the direction of increasing coordinates on the reference as "right" and the direction of decreasing coordinates as "left". In the human genome, the 1q telomere is thus on the "right" side of chromosome 1, and the 1p telomere is on its "left".
Using this system of orientation, there is an intrinsic direction associated
with every node, i.e. there are edges associated with its "left" and "right"
side. Some of those edges may be "REF" edges, i.e. connecting to that node's
reference-adjacent neighbor. The REF edges connecting a node to its right
reference neighbor consist of two adjacencies: one leaving the "+" GRanges
associated with the node and entering the "+" GRanges
of its neighor, and a
second (complementary) adjacency leaving the "-" GRanges
of the neighbor and
entering the node. Other ("ALT") edges may connect our node to a distant locus
that is non-adjacent in the reference, i.e. an adjacency created through
rearrangement.
Real genomes are not graphs, but consist of linear (and possibly circular)
alleles. Each allele is a string of nodes and edges, which we term a gWalk
. When
these alleles are superimposed (i.e. added), the resulting structure is a graph.
The gGraph
is thus only an abstraction, representing a summary of our (best)
knowledge about the latent allelic state of the genomic sequence (also called
its phase). When the nodes and edges on the gGraph
are associated with an
allelic dosage, this notion becomes quantitative - i.e. the graph represents an
estimate of the total dosage of sequences and adjacencies in the given genome.
Such an inference is produced by a handful of rearrangement callers, including
Weaver
,
PREGO
, RemiXT
,
CouGaR
, and (our tool)
JaBbA
Though a gGraph
may be our "best guess" at the variant structure of a genome
(i.e. from short-read whole genome sequencing (WGS)), certain biological
questions require us to make inferences about linear (or circular) alleles.
These require us to query features of gWalk
s generated from our gGraph
,
representing possible alleles in the genome. For example, we may want to
query the genome for the existence of a possible high-copy allele (i.e. path)
joining EML4 and ALK resulting in a protein-coding gene fusion. We may want
to determine whether a particular regulatory element (i.e. "super-enhancer") is
connected via rearrangement to the gene MYC, resulting in its overexpression.
We would like to identify the frequency with which EGFR amplified through the
formation of circular extra-chromosomal (episomal) DNA. We may want to
identify rearrangement signatures (e.g. BFBs) creating recurrent patterns in a
compendium of gGraph
s.
We may also be able to further deconvolve (i.e. phase) alleles in the graph
using long-range sequence or mapping data (e.g. from optical mapping,
linked-read sequencing, or long-read sequencing). Given the complexity of
certain aneuploid cancer genomes this deconvolution will most likely be partial,
i.e. local and probabilistic. gWalk
s and gGraph
s can help us represent and
analyze partially phased genomes, either as probability distributions over
gWalk
s or complex gGraphs
with "branches" and "bubbles" representing
alternate paths over the same reference intervals.
Similarly, the intermediate stages of *de novo * (i.e. bottom up) genome
assembly pipelines produce graphs, which are often broken or arbitrarily
linearized at some stage of analysis, losing possibly important genomic signal.
Alignment, visualization, and analysis of assembly graphs in reference genomic
coordinates can yield significant biological insight and technical intuition,
without the need to generate a "reference-grade" linear assembly. We believe
that gGraph
and gWalk
s can serve as useful abstraction to help annotate and
refine these partial views of altered genome structure.
Our goals in building gGnome are to enable agile analysis of structural variation (rearrangement and copy number alterations) in complex and highly rearranged genomes. Though our work relates to the recent movement in the sequence mapping community to build reference-grade "graph genomes" and "variation graphs", our goal is to serve a different (though related) purpose. Firstly, these frameworks have been designed primarily with the goal of improving alignment and (single or multi-nucleotide) variant calling. Secondly, these frameworks aim to replace the linear reference genome with a novel graph based data strucutre.
In contrast, the vision for gGnome is to sit downstream of read mapping,
assembly, and basic variant calling. We have designed gGnome to enable
interactive exploration, annotation, and visualization of complex variants for
biological discovery by the genomic data scientist. In addition, gGraph
s are
still firmly tethered to "linear reference" coordinates, where many / most
datasets (epigenetic, transcriptome) live. The goal is to enable vetting of
inferred patterns against raw alignment data, both to verify the quality of
variant calls / assemblies / phases and interrogate their impact on biology. gGnome
achieves this by leveraging several excellent packages in the R / Bioconductor
universe for its back-end (GenomicRanges
, data.table
, igraph
) as well as
several useful mskilab
(gTrack
, gUtils
) packages for data manipulation and
visualization. We demonstrate examples below.
## Load the package library(gGnome)
Before we get to graph, we have to define the Junction
. A Junction
is a
signed pair of reference genomic locations that are brought together through
rearrangement. Each junction connecting locations a
and b
has one of four
possible orientations (a-|b+
, a+|b+
, a-|b-
, a+|b-
) that represent the
"sides" of the breakpoint that are being joined. By convention we refer to -
orientation as connecting the left side of the given breakpoint, while the +
orientation connects the right side of the breakpoint.
Junctions calls are usually stored and shared in .bedpe
or .vcf
files.
Though most callers usually provide additional "event" annotations to each
junction (e.g. DEL, DUP, INV, TRA), such annotations lose their meaning in
complex genomes where many junctions cluster near each other on the reference
genome. Many complex (and even some simple variants e.g. inversions, balanced
translocations) arise through multiple junctions, therefore the mapping of
Junctions and "events" is likely not 1-1. Since "event calling" is an open
question in structural variant analysis, the most fundamental unit of structural
variant calling is the Junction - a signed genomic location pair.
We can load Junctions into gGnome from a variety of input formats from
common junction / SV callers, including SvaBa, DELLY, Novobreak using the
function jJ
.
## SvAbA svaba = jJ(system.file('extdata', "HCC1143.svaba.somatic.sv.vcf", package = "gGnome")) ## DELLY delly = jJ(system.file('extdata', "delly.final.vcf.gz", package = "gGnome")) ## novobreak novobreak = jJ(system.file('extdata', "novoBreak.pass.flt.vcf", package = "gGnome")) ## BEDPE bedpe = jJ(system.file('extdata', "junctions.bedpe", package = "gGnome"))
In general we support BND
style .vcf
files and .bedpe
files, though many
callers vary in their details of implementing these formats.
In gGnome, a Junction
is a vectorized object that can be subsetted and
queries on the basis of its metadata using convenient data.table
style (with
= TRUE
) syntax. Junction metadata is caller dependent.
## can use both row and column subsetting on Junction metadata head(novobreak[1:2, 1:10]) ## can use data.table style expressions on metadata to subset Junctions ## here, filter novobreak translocations with quality greater than 50 novobreak[ALT == "<TRA>" & QUAL>50, 1:10][1:2, 1:5] ## subsetting SvAbA junctions with >10 bases of homologous sequence (short templated-sequence insertions or STSI as defined in the SvABA paper) svaba[nchar(INSERTION)>10, ][1:2, 1:5] ## subsetting SVabA junctions with same sign and nearby breakpoints (i.e. small $span) svaba[svaba$sign>0 & svaba$span<1e5][1:2, 1:5] ## subsetting junctions with infinite span (ie different chromosome) and homology length >5 delly[is.infinite(delly$span) & HOMLEN>5, ][1:2,1:5]
Note that we are showing off the $span
and sign
active bindings of the
Junction
class, which return the "span" (i.e. distance between breakpoints)
and "sign" (i.e. product of the breakpoint signs, where "+" = 1, "-" = 1) as a
vector.
Yes, a Junction
is just a glorified GRangesList
, which we can extract using the $grl
accessor
svaba$grl[1:2]
However, unlike GRangesList
we can perform junction specific queries which
only allow for overlap between junctions if both breakpoint locations and
orientations match.
## subset svaba by those intersect with DELLY using gUtils subset %&% operator length(svaba %&% delly) ## increase the overlap substanntially by padding delly calls with 100bp (using + operator) length(svaba %&% (delly+100)) ## basic set operations also work length(setdiff(svaba, delly+100)) ## length(union(svaba, delly+100))
We can do more comprehensive coordinate-based merges / joins of junctions using
the merge
function. The output is a "consensus" junction set, i.e. a
Junction
object representing the union of the input junctions with logical
metadata columns prefixed by $seen.by
that keep track of which junction was "seen" by
which caller.
## any names work in the arguments to merge, these will be reflected in metadata as a $seen.by column ## (using padding of 1kb and c() to remove existing metadata) res = merge(svaba = svaba[, c()], delly = delly[, c()], novo = novobreak[, c()], anynameworks = bedpe[,c()], pad = 1e3) head(res) ## here we can use the $dt (data.table) accessor quickly make an UpSetR plot library(UpSetR) ## munge res$dt into upset friendly format df = as.data.frame(sign(as.matrix(res$dt[,.(seen.by.anynameworks, seen.by.delly, seen.by.novo, seen.by.svaba)]))) upset(df)
We can create a gGraph
from several input sources, including junctions, node
and edges, and a a host of "graph SV callers" (Weaver
,
PREGO
, RemiXT
,
CouGaR
, and
JaBbA
) using the gG
constructor.
This constructor will "break" the genome (defined using seqinfo
of the
Junction
object) at each connection end and create an ALT
gEdge
from 2 of
the 4 resulting interval pairs. It will also leave a REF
gEdge
connecting
the reference adjacent nodes.
### gGraph from svaba input gg = gG(juncs = svaba)
The gGraph
contains 1059 nodes and 1475 edges. The majority of edges are REF
edges connecting reference adjacent nodes, while the remaining are ALT
edges
that inherit metadata from the input SvAbA junctions. The graph also contains 168 "loose ends",
of which all are terminal. Loose ends represent sequences "ends" that lack a left or right
neighbor. These occur at the ends of the contigs that make up the Seqinfo
or seqlengths
of this object. Since this hg19
derived seqinfo(gg)
has 84 "chromosomes", this gGnome
object
has 168 terminal loose ends ends. In theory, loose ends can represent terminal 3' and 5' phosphates at
real DNA "ends" (i.e. teleomeres). More generally, loose ends represent possible sites of attachment
of missing or annotated sequence, e.g. unknown or repetitive sequence. Technically, even the annotated
telomere represents an "end".
We can plot the resulting graph as well as the junctions using gTrack
, a useful Marcin Imielinski Lab
package for visualizing GRanges
-based genomic tracks. Each gGraph
object
has an accessor $gt
that generates a gTrack
object, which can be
concatenated with other gTrack
objects, e.g. those representing genome-wide
coverage, ChIP-seq, gene annotations, etc.
## we use gTrack to plot the gTrack associated with this gGraph ## the second argument to gTrack plot is a string or GRanges representing the ## window to plot, the links argument enables drawing of junctions from GRangesList plot(gg$gt, '1', links = svaba$grl)
Each sequence on chromosome 1 is drawn as a gray rectangle. Rectangles are
connected by gray REF
and red ALT
edges. Most red ALT
connect different
rectangles on chromosome 1 (i.e. are intrachromosomal), though some edges go off the graph
and only one end is visible. (are interchromosomal). The topmost track demonstrates the input SvAbA junctions,
which distributed across the gGraph
. If you look close you can see two "hooked"
blue loose ends, one at the left terminus and the other at the right terminus of chromosome 1.
In this plot, the y axis position of each rectangle has no meaning - the vertical stacking is purely for visual purposes, i.e. to minimize collisions.
We can create a graph with additional REF
edges by supplying a "breaks"
GRanges
to the gG
constructor. This will break the genome and add new gray REF
edges (but no ALT
edges) at each location.
### generate breaks using gUtils function to tile genome at evenly spaced 1MB intervals breaks = gr.tile(seqinfo(svaba), 1e6) ### gGraph from svaba input gg2 = gG(breaks = breaks, juncs = svaba) ### set gGraph metadata gg2$set(name = 'with breaks') ### compare graphs towards the beginning of chromosome 1 ### (gTracks can be concatenated to plot multiple tracks) plot(c(gg$gt, gg2$gt), '1:1-5e7', links = svaba$grl)
To create custom graphs we provide a constructor that takes a GRanges
of interval nodes
and
a data.table
of edges
specifying their connectivity. The edges data.table
contains 4
essential columns: n1
and n2
specifies the nodes
indices that are being joined and n1.side
and n2.side
specify which side of each node is being joined. The value of side can be integer
(0
for left, 1
for right) or character (left
, right
).
## tiling of chromosome 1 nodes = gr.tile(seqlengths(svaba)["1"], 1e7) ## generate 20 random edges (n1, n2, n1.side, n2.side) edges = data.table( n1 = sample(length(nodes), 20, replace = TRUE), n2 = sample(length(nodes), 20, replace = TRUE)) edges[, n1.side := ifelse(runif(.N)>0.5, 'right', 'left')] edges[, n2.side := ifelse(runif(.N)>0.5, 'right', 'left')] gg3 = gG(nodes = nodes, edges = edges) plot(gg3$gt, '1')
Note that all edges in the graph by default are `ALT` (red) and loose ends (blue) are automatically
placed at any interval end that lacks an edge. In this case all the nodes are
non-overlapping and equal size but nothing in gGraph requires this.
wzxhzdk:11
### From external tools
We support several "graph SV caller" formats for `gGraph` import.
wzxhzdk:12
These are graphs built from the analysis of the Her2+ breast cancer cell line
HCC1954. These plots, unlike the previous, have a y-axis, which in this case
represents copy number. The objects / tracks named by the caller that generated them.
These`gGraph` metadata features were set during object instantiation and can be
accessed using the `$meta`active binding and set using using the `$set` method.
These meta fields are used as defaults for certain gGraph methods e.g. `$gt`,
`$simplify`, `$reduce`, and others (some of which are described below).
The `y.field` meta data field, in particular, tells the `$gt` active binding to plot
the `gNode` metadata `"cn"` on the y.axis. As a result the y axis position of
each node tells us how many copies of that interval is predicted by
that algorithm to exist in HCC1954.
If we set `y.field` to `NULL` then `$gt` will not create a y axis and stack the
intervals, similar to the previous examples where created a `gGraph` from
`breaks` or directly from `nodes` and `edges`.
wzxhzdk:13
## Manipulating and browsing gGraphs
One useful feature of the `gGraph`, enabled by the `R6` system, is the ability
to easily navigate nodes and edges of the graph by subsetting, querying on metadata, and
quickly selecting neighboring nodes and edges of a given query.
### Navigating nodes and edges
Every `gGraph` object contains `gNode` object (accessed by `$nodes`) and a `gEdge` object
accessed by `$edges`. A `gNode` can be converted to `GRanges` via the `$gr` acitve
binding and to `data.table` via the `dt` active binding. Similarly `gEdge` can be
converted to `GRangesList` via the `$grl` active binding and to `data.table`
via the `dt` active binding. A `gEdge` can also be converted to a `Junction`
via the `$junctions` active binding.
wzxhzdk:14
Though nodes and edges can be subsetted using numeric indices, it is often useful to subset them
using metadata. For example we may be interested in a high copy interval in the JaBbA
`gGraph` or an aberrant junction annotated with a long range sequence insertion. `gNode` and `gEdge`
both support subsetting using expressions based on metadata.
Two default metadata that exist for gEdge is `type` and `class`.
Edges can be of `type = "REF"` (reference) or `"ALT"` (variant). `"ALT"` edges are further
classified (according to the `class` metadata field) as `"DEL-like"`, `"DUP-like"`,
`"INV-like"`, and `"TRA-like"` depending on the junction orientation and reference
chromosome locations. We use the suffix `-like`
to emphasize the fact that we don't actually know whether a `DEL-like` junction represents a deletion
without further modeling / analysis. Depending on a given of model of locus evolution and
phase, a duplication event may appear as a `DEL` on the reference, and a late duplication may appear
as a `TRA-like`, i.e. translocation-like. Nevertheless, this initial classification is useful
for subsetting and analysis, using a `data.table` style syntax.
wzxhzdk:15
We can use `gNode` `$left` and `$right` accessors to navigate the "left" and "right" neighbors of `gGraph`
nodes and edges. (See "The Basics" section above for left / right defintitions). We can
also use the `$nodes` accessor on a `gEdge` object to get all of the nodes connecting to any of the subsetted
edges, and (vice-versa), we can use the `$edges` accessor of a `gNode` object to get all the edges connected to
at least one of those nodes.
wzxhzdk:16
Since `gTrack::plot` takes `GRanges` as well as character strings for it's second argument, we can
easily plot the JaBbA and RemiXT model in the vicinity of the `highcopy` nodes. We also
use the `track.gencode()` function in `gTrack` to plot the GENCODE genes in these regions.
wzxhzdk:17
We can see here that JaBbA finds the known high-level amplification of `ERBB2` aka HER2 in
this cell line, which appears to be driven by a cluster of fold back inversions known as
BFBs.
One useful feature of `gNode` and `gEdge` is that they take `gUtils` operations such as `%&%`
(strand-agnostic subset by overlaps). For example, we may want to locate the high-copy JaBbA
nodes in the RemiXT graph.
wzxhzdk:18
### Marking edges and node metadata
Since `gNode` and `gEdge` objects are pointers to part of a `gGraph`, their methods can be used to
modify the metadata of the `gGRaph` in place. This is done using the `$mark()` method.
It may be useful for example for us to highlight where the junctions harboring templated insertions
in the JaBbA graph sit in the ReMiXT graph.
wzxhzdk:19
We can see this plot shows two blue edges in three (reference) discontiguous
windows. One of the blue edges (on the left) is a `biginsert` edge that is
found only in the JaBbA model, while the the second blue edge (in the right
window on chromosome 4 is found by both JaBbA and ReMiXT). The nodes attached
to these edges have been highlighted, which in the bottom track mark the nodes
where we would expect to find the first blue `biginsert` edge.
### Trimming and subsetting gGraphs
In certain cases we may want to isolated pieces of `gGraphs` for further
analysis. For example, we may want to pull the subgraph associated with a
particular subset of nodes, trim a graph around a set of `GRanges`, or pull a
subgraph that is within some (base pair) distance of a seed region on the
`gGraph`.
wzxhzdk:20
The first two (bottom) tracks generate identical graphs, subsetted to neighbors
of `highcopy` nodes. The third track provides a window that includes 100kbp on each
side of high copy nodes. The `trim()` function trims the boundaries of the nodes in the graph to the
provided window. The fourth track does the same to the RemiXT graph.
Another useful task in `gGraph` browsing is to define a subgraph in a given "neighborhood" around a given seed.
Though this can be sort of done using the combinations of `$nodes` and `$edges` accessors on `gNodes`,
this limits us to seed that are themselves nodes (or edges) in the graph only allows us to expand a certain
order (i.e. `ego` from the given seed).
Though we provide an `$ego` query on nodes (shown below), a more useful way to define local subgraph is around a
specific `GRanges` seed and some *base pair distance* away from that seed. Let's go back to the `highcopy` region
around *ERBB2* and try to zoom out a certain distance in the graph.
wzxhzdk:21
Here, we see several low copy junctions to regions of chromosome 1, 11, and 12 internal to locations internal to
the *ERBB2* amplicon (rightmost window). This suggests late incorporation of distant genetic material into this
(likely episomal) amplicion, or (alternatively) possible chromosomal integration sites
of 1 or more copies of this amplicon e.g. into chromosome 12.
## Advanced gGraph manipulation
We can edit `gGraph` objects by adding edges
or nodes, replicating nodes or paths (i.e. introducing "bubbles"), and merging
graphs with each other. We may want to "chop up" (or "disjoin") the nodes of
a graph across implicit internal reference connections, or simplifying the
graph by merging reference adjacent nodes (+/- if they share a particular
metadata feature). We may also want to combine graphs that represent different
(partially phased) haplotypes into a union graph, and reduce or disjoin this
graph into one that contains a single set of non-overlapping nodes.
Let's begin with a simple graph
wzxhzdk:22
### Concatenate, disjoin, and simplify graphs
We can concatenate any graphs, which takes the union of their nodes and junctions. We can also apply
"disjoin", "simplify", and "reduce" to split, collapse, and merge graphs
wzxhzdk:23
### Add nodes and junctions
Disjoining by `GRanges` is a quick way to add intervals to a graph without changing
any connectivity (since we are just "instantiating" the reference edges that are
internal to a node and implicitly exist).
wzxhzdk:24
We can use the `$add` method to add edges to the graph
wzxhzdk:25
A simpler syntax is to use `$connect`.
wzxhzdk:26
We can also `$add` junctions, however since a `Junction` is a pair of coordinates, they they create
ALT edges between every intersecting node pair in the graph. If we want to add these same locations
to just one "haplotype" of the input graph, then we will need to use `$connect` or `$add` with a specific
node pair.
We illustrate using `gg3`.
wzxhzdk:27
### Replicate nodes and paths
An important concept in representing individual or population variation in genome graphs
is the concept of a "bubble", or a variant / ALT path through the graph that contains one or
more sequence differences like SNV or indels. We can instantiate such paths in an existing graph using
the `$rep` operation, which "replicates" a given node or path in the graph.
wzxhzdk:28
In addition to replicating individual nodes we can create bubbles from paths in the graph.
wzxhzdk:29
Now that we have a complex haplotype structure with alternate paths, we may want to add
an allele specific rearrangement edges, in this case a shortcut from the right side of node 5
to the left side of node 10.
wzxhzdk:30
## Components and communities
### Weakly and strongly connected components of nodes
We can label weakly and strongly connected components in a `gGraph` with the `$clusters` method. These allow us to group nodes and define genomic subgraphs based on the connectivity patterns of nodes. This method modifies the `$cluster` node metadata data field, which can then be used to query and subset the gGraph.
wzxhzdk:31
Strongly connected components require that every node is reachable from every other node via a directed path. As seen above, multi-node strongly connected components result in cyclic structures. In genome graphs, these represent complex amplicons.
Weakly connected components may not reveal interesting structures for a full genome graph representing a highly rearranged cancer genome, but may reveal interesting structure once we remove very wide nodes. The `gNode` version of the `$clusters` method allows us to conveniently mark up clusters in our original graph after clustering the subgraph only involving the selected set of nodes
wzxhzdk:32
### Edge clusters
Clusters of quasi-reciprocal ALT edge can reveal "cycles" or "chains" of rearrangements in genome graphs.
Such patterns have been dubbed "chromoplexy" (multi-way balanced rearrangements) and also linked to
TIC (templated insertion chains). We enable detection of these edge clusters with the `gGraph` method `$eclusters`. This takes
an argument `thresh` which is a distance threshold with which to group nearby quasi-reciprocal junctions - i.e.
if `thresh=0` then we only consider clusters of exactly reciprocal junctions. The default value of `thresh` is
1 Kbp.
wzxhzdk:33
# The gWalk
The `gWalk` allows us to represent and analyze paths of genomic intervals in gGraphs, and annotate these paths with important metadata.
Most fundamentally, a `gWalk` represents a possible linear
(or circular) allele that can be inferred from our current graph model. More generally, the analysis of gWalks
can help us study and annotate the connectivity of the graph, with the goal of identifying protein-coding gene fusions
and non-coding regulatory element "proximities" (see Applications section below). Once we generate a `gWalk` object, it is very
easy to query the `gGraph` that it belongs to and extract information about nodes and edges belonging to that `gWalk` or more loosely associated
with it.
## (Shortest) paths between node pairs
A simple way to generate `gWalk` objects is to search for paths connecting distant nodes in our graphs. These paths represent *possible* alleles or haplotypes that may exist in the genome that our graph is modeling. The `$paths` method of `gGraph` takes as input vectors of signed node ids or `gNode` objects corresponding the start (i.e. "source") and terminus (i.e. "sink") of paths that we are searching for. If no such paths exist, `$paths` will return an empty `gWalk` object. The default behavior of `$paths` is to operate in a strand-inspecific fashion (demonstrated below), but if we set `ignore.strand = FALSE` then the query will only return paths that start and end on the specified strand. This is important when examining connections between single-stranded genetic features (e.g. exons).
wzxhzdk:34
The `gWalk` object is vectorized, and can be subsetted much in the same way as a `gEdge` or `gWalk` is, either by
using indices or `data.table` style expressions on `gWalk` metadata. Since every gWalk represents a sequence of
double-stranded genomic intervals in a given orientation (+ vs -) it is biologically equivalent to its reverse complement
(i.e. the reverse sequence of intervals in the opposite orientation). To reverse complement a `gWalk` we just
index it using a negative integer (this is different from the standard interpretation of R vector indexing,
where negative integers means removal of the provided indices).
wzxhzdk:35
## Walk decompositions of (sub)graphs (TBC)
Genome graphs represent a summary of the all the possible linear or circular alleles that are possible
given our model of a rearranged genome. We can generate such walks for a gGraph using the `$walks` method,
though for a large and complicated graph the number of such possible walks can become untenable to compute
due to combinatoric issues. However, we can do this reliably for subgraphs, e.g. those defined through the analysis
of clusters and commmunities (see previous section).
wzxhzdk:36
## Creating gWalks *de novo*
Underlying each`gWalk` object is a list of signed node ids and a list of signed edge ids in a `gGraph`.
These can be retrieved using the `$snode.id` and `$sedge.id` active binding. Each
list item represents an ordered set of nodes and edges comprising the walk.
A `gWalk` can be instantiated
directly from signed node id or signed edge id lists and a `gGraph` input via the `gW` function.
The provided node or edge sequence will have to exist in the provided `gGraph`, otherwise `gW` will
error out or (if `drop = TRUE` is set) will ignore invalid inputs and return a `gWalk` object whose
length is smaller than the provided lists.
wzxhzdk:37
Alternatively, we can use the `gW` function to instanatiate a `gWalk` from a `GRangesList`. This can be done on top of
a provided `gGraph`, however the connectivity of the `gGraph` must support the input `GRangesList`. As above, this
requires that all reference and non-reference edges implied by the provided `GRangesList` must exist in the graph.
However this does not require the intervals in the input `GRangesList` exactly match those in the graph.
For example the `GRangesList` can contain a hyper-segmented or hypo-segmented set of nodes relative to the
`gGraph`, i.e. one that explicitly instantiates the implicit reference edges that are internal to an interval,
or collapses one or more reference adjacent nodes into a larger node.
wzxhzdk:38
A `gGraph` does not have to be provided during `gWalk` instantiation from `GRangesList`. However, every `gWalk` object
must be associated with a `gGraph`. If a `gGraph` is not provided, a brand new `gGraph` is created either by treating
each interval in each `GRangesList` as a separate node (if `disjoin = FALSE`) or by collapsing all input
intervals (if `disjoin = TRUE`) into a single `gGraph` of non-overlapping nodes.
wzxhzdk:39
## Disjoin and simplify gWalks
Just like a `gGraph`, a `gWalk` can be simplified and disjoined. These operations are first applied to the underlying `gGraph`
and then the walks are re-threaded on this altered graph. Since simplify, disjoin, and reduce do not restrict the connectivity of the graph
(i.e. every path that exists in the original graph will exist in the altered graph), all the walks on the original graph will be "legal" on the altered graph.
wzxhzdk:40
## Analyzing gWalks
Each gWalk refers to a string of `gNode` and `gEdge` objects, each which contain their own metadata. When analyzing a collection
of walks (e.g. those representing rearranged haplotypes connecting regulatory elements and genes), we may want to annotate the walks
on the basis of the features of the nodes and edges that contribute to them. This is enabled through the `gWalk` method `$eval`, which
takes a scalar-valued expression on node or edge metadata (default is to test both) which will return a vector whose length is the number of walks
where that expression has been evaluated once per walk. You can think of this as a fast lapply. It useful for doing "feature-engineering"
of walks to identify those with interesting or biologically relevant properties.
wzxhzdk:41
# Applications
OK, now we've gotten through all the basics of `gGnome`, including a survey of the key functionality of
`gGraph`, `gWalk`, `gEdge`, `gNode`, and `Junction`.
Now we're ready to do some analysis with `gGnome`. We present two use cases: the
first involves searching for protein-coding gene fusions, and the second
involves identifying regulatory element-gene proximities induced through genomic rearrangement. Each involve the analysis of a `gGraph` object and
and return a `gWalk` object for the analyst to inspect, plot, annotate, and/or manipulate.
## Protein fusions
A very important reason to be interested in (cancer) genomic rearrangements is that they may
bringing together two (or more) coding sequence (CDS) regions into a neomorphic protein fusion.
Such fusions may result from simple (i.e. single ALT junction) alleles but may employ more than one
junction in *cis*. We can derive gene fusions as a path search on a `gGraph` using the `fusions()`
function.
Each output `gWalk` represents the *genomic* coordinates of a possible protein-coding gene fusion.
The metadata field `$gene.pc` is a character vector, which describes each protein fusion as a string
of protein coordinates delimited by ';', where protein interval is representing using the character
notation for a `GRanges` (seqnames:start-pos). Since `fusions()` does careful book-keeping of protein frame,
certain nodes in the walk may represent "out-of-frame" regions. These are represented by
brackets around the gene name, and a `fs` suffix (e.g. `[BRAF]fs`). The `$gene.pc` string can be
converted to a bona-fide protein coordinate `GRanges` via the `gUtils::parse.grl` function.
Some fusions may duplicate or delete genetic material. The protein coordinates of any deleted or duplicated
material is encapsulated in `$amp.pc` and `$del.pc` metadata fields of the output. Only a subset of fusions
outputted by `gGnome` are completely in-frame (denoted by metadata field `$in.frame == TRUE`). Others may
start and end at in-frame node, but have one or more out-of-frame regions in the middle - we label these
using the metadata field `$frame.rescue == TRUE`.
Finally, though we usually identify protein fusions by their gene, fusion call is actually made on the coordinates
of a specific GENCODE transcript. The "transcript analogue" of the `$gene.pc` metadata field is `tx.cc`, which
represents the fusion as a string of CDS coordinates, with the same `GRanges` style character string that
can be converted to a cds `GRangesList` via `grl.string`. The same naming convention
is used to denote in-frame and out-of-frame segments of the neomorphic transcript. This field can be
used to reconcile fusion predictions between RNA-seq and WGS.
The output of `fusions()` also contains out-of-frame fusions (i.e. that have both `$in.frame == FALSE` and `$frame.rescue == FALSE`).
We include these because searching for in-frame fusions is not perfectly sensitive for complex multi-junction
gene fusions. When a fusion `gWalk` includes a piece of intergenic sequence, the (fast, greedy) graph search
algorithm underlying `fusions()` may not be able to optimize for "frameness". We thus include these
"out-of-frame" fusions to allow users to look harder for more optimal paths that might be in-frame and also
to allow for the (perhaps exotic) possibility that out-of-frame fusions may undergo RNA editing
or alternate splicing to yield an alternate fusion product. Finally, some users may have interest in *bona-fide* truncating fusions, though many of these can be discovered through a simpler analysis (e.g. somatic copy number analysis through the search for recurrent deletions).
Though the `fusions()` function may take many minutes to run in "unbiased mode" on a highly rearranged
genome like HCC1954, we can also use it in a targeted fashion using the `genes=` argument to quickly query for
specific gene fusions. The examples below employ this usage for speed, though most users will want to do
this analysis transcriptome-wide (i.e. without setting the `genes=` argument). Use of `mc.cores` will also
speed up computation considerably for those with multicore machines.
wzxhzdk:42
We can also find frame-rescues that begin at *ASAP1*, then incorporate a frameshifted chunk of the gene *NSD1*, and end up back at *ASAP1* in frame. Such variants are a bit strange to interpret, since the frameshifted
chunk may likely contain a stop codon (currently not annotated), and may require further investigation
to determine whether they produce functioning protein products.
wzxhzdk:43
## Proximity analysis
Rearrangements can bring regulatory elements in (1D and 3D) proximity with each other and with genes, perturbing gene expression and chromatin states in *cis*. To detect / annotate such noncoding consequences of complex rearrangeents, we have developed `proximity()` analysis, which takes two sets of `GRanges` (`query` and `subject`) and identifies pairs of these that have been brought near each other through rearrangement. For each such "proximity" it emits a `gWalk` representing the path connecting these genetic elements. We illustrate this analysis by identifying proximities between super-enhancers (also called locus control regions) and certain genes in HCC1954.
The `gWalk` object outputted by the `proximity()` function will have several standard metadata fields. These
include: `$reldist` which is the fractional distance of the distance in the `ALT` graph (i.e. the `$altdist`)
relative to the reference distance (i.e. `$refdist`). The remaining metadata features are inherited from the
`GRanges` metadata from the pairs of subject and query arguments provided to `proximity()`. The returned `gWalk`
object is sorted by its `$altdist`.
wzxhzdk:44
In this case, the superenhancer is on the same chromosome as BRD9 (chromosome 5).
We may be specifically interested in proximities involving superenhancers that were previously
very distant to their target gene (e.g. `Inf` base pairs away on the reference). In
addition we may want to identify those that use many ALT junctions.
We can use the `gWalk` subsetting and analysis features to quickly find these.
wzxhzdk:45
## Classifying SV events
There are many mutagenesis pathways producing various patterns of SVs in the cancer genomes. By identifying the specific motif that are recurrent in the cancer genoem graphs, we can classify subgraphs that satisfy the set of criteria corresponding to a particular class. These classes range from simple like deletions and tandem duplications, to complex like chromothripsis and breakage-fusion-bridge cycles. As of [Hadi et al, Cell 2020](https://www.biorxiv.org/content/10.1101/836296v2), we support the identification of 5 simple event and 8 complex:
* simple
+ [Deletion (del)](#deletions-and-rigma)
+ [Tandem duplication (dup)](#tandem-duplications-and-pyrgo)
+ [Inversion (inv)](#simple)
+ [Inverted duplication (invdup)](#simple)
+ [Translocations (tra)](#simple)
* complex
+ [Rigma](#deletions-and-rigma)
+ [Pyrgo](#tandem-duplications-and-pyrgo)
+ [Templated insertion chains (tic)](cp-and-tic)
+ [Chromoplexy](#cp-and-tic)
+ [Chromothripsis](#chromothripsis)
+ [Breakage-fusion-bridge cycles (bfb)](#amplicons)
+ [Double minute (dm)](#amplicons)
+ [Tyfonas](#amplicons)
For ease of use, all the callers are wrapped into a single function `events`:
wzxhzdk:46
Below we describe each function in detail.
### Deletions and Rigma
When simple deletions accumulate in the same locus more than expected from a null distribution (explained below), their cluster is labeled rigma.
wzxhzdk:47
### Tandem duplications and pyrgo
Similar to deletions and rigma, when a particular genomic region attain more than expected tandem duplications, the cluster of duplications is classified as a pyrgo.
wzxhzdk:48
### Chromothripsis
Chromothripsis has been a stereotypical example of complex SV event, since its inception in [Stephens et al., Cell 2011](https://doi.org/10.1016/j.cell.2010.11.055), in which it is proposed to originate from a single catastrophic shattering of a relatively focused chromosome region (e.g. arm), then randomly ligated by DNA repair pathway, resulting in high numbers of junctions clustered, and oscillating copy numbers due to the random loss/retention of DNA shards.
wzxhzdk:49
### Complex amplicons (bfb, dm, tyfonas)
Long has been known that genomic amplifications can drive carcinogenesis, yet the characterization of the exact structures beyond the increase in dosage are limited. Two mechanisms are well known, double minute (DM, also known as extrachromosomal circular DNA, eccDNA) and breakage-fusion-bridge cycles (BFBC). Genome graphs provides an oppurtunity to systematically differentiate and discover the spectrum of amplicon structures. In Hadi et al., Cell 2020, We describe a genome graph-derived feature space for these subgraphs that contain amplified junctions (JCN>7) and stably clustered into at least 3 classes, two of which correspond to DM and BFBC, but a third class with high numbers of junctions and high numbers of fold-back inversions, which is named *tyfonas* based on its wide reaching, extensively rearranged appearance.
The function `amp` identifies
wzxhzdk:50
### Chromoplexy and TICs
A series of long-range junctions (jumping over >=10Mbp on reference genome) can form a chain-like topology, when two breakends from two adjacent junctions in the chain are close enough to each other (by default <= 10 kbp).
wzxhzdk:51
### Simple
We also support identification of 3 other simple event types, namely inversion, inverted duplication, and ranslocation. Here are some examples:
wzxhzdk:52
### Make your own caller
Besides running our current implemented event callers, `h2228nome` serves as a generalized framework for users to write their own event callers. A general template procedure is:
- Filter the nodes/edges of interest by their marginal features, e.g. copy number, width, junction span (reference distance between the breakends), orientations
- Cluster by the node or edge based on graph topologies to get candidate subgraphs
- Or walk the graph and look for paths or cycles as candidates
- Summarize a set of features of the candidate subgraphs or walks
## Balance
`balance` is a function for inferring copy number of nodes and edges of genome graphs by integrating read depth together with graph topology and applying the junction balance constraint. `balance` is the core function used in our package [JaBbA](https://github.com/mskilab/JaBbA). To learn more about junction balance analysis, please refer to the [Hadi et al.](https://doi.org/10.1016/j.cell.2020.08.006) publication.
`balance` takes as input a gGraph with segment copy number estimates stored in the node metadata and solves a Mixed Integer Program (MIP) to infer integer node and edge copy numbers. By default, a mixed-integer linear program (MILP) optimization is performed. JaBbA used a mixed integer quadratic program (MIQP) at the time of publication, but has since been modified to use linear programming due to improved speed and performance. To run `balance`, users will need to install [CPLEX](https://www.ibm.com/analytics/cplex-optimizer).
### What's the difference between JaBbA and balance?
JaBbA is basically a wrapper for the balance function and accepts as input coverage depth, segmentation and junction calls. JaBbA performs various steps prior to generating and balancing a gGraph (for more details, refer to the [publication](https://doi.org/10.1016/j.cell.2020.08.006)). JaBbA is intended for estimating segment and junction copy numbers for a whole genome.
The `balance` could be used for fine-tuning a gGraph or for estimating values for a subgraph, as well as for analysis of more general cases of graphs such as long reads, segmentation data without coverage depth, or any case in which preliminary copy number estimates are available and the junction balance analysis is applied to find the closest balanced graph with integer cn values.
Notice that the usage of `balance` outside of JaBbA is an experimental functionality, please feel free to [reach out to us](mailto:lab@mskilab.org) if you are interested in exploring this functionality and need help.
### balancing an example subgraph
Here we demonstrate how to run `balance` on a small region of the CCLE [HCC1954](https://depmap.org/portal/cell_line/HCC1954_BREAST?tab=mutation) genome. `balance` can also be applied genome-wide to fit whole genome `gGraph`s.
First, we will read the example subgraph and coverage file.
wzxhzdk:53
Note that in the plotted image, the "y" location of the subgraph graph nodes is not reflective of copy number.
### adding node copy number estimates via binstats
`binstats` is a function for using read coverage from your coverage file and sample purity/ploidy to produce node copy number estimates. It returns a new `gGraph` with the fields `cn` and `weight` added to the node metadata. `cn` represents a node copy number estimate. `weight` is calculated by dividing the number of coverage bins within a node by the coverage variance and reflects our certainty about the node CN estimate (e.g. we are more confident about the copy number estimate of nodes with higher weight).
A few important parameters in binstats:
- `bins` is a `GRanges` with coverage data in the metadata column specified in `field`. In this case, the coverage information is stored in a column named "ratio"
- `field` is the name of the column containing coverage data.
- `purity` is the sample purity. It should be between 0 and 1, specifying the proportion of tumor cells.
- `ploidy` is the sample ploidy (width-weighted average copy number).
wzxhzdk:54
The `cn` and `weight` fields of `binstats.sg` are now populated:
### balancing the graph
Now that the subgraph has CN estimates, we are ready to run balance.
There are a few parameters in `balance` to be aware of:
- `lp` controls whether MILP (`TRUE`) or MIQP (`FALSE`) is used and is `TRUE` by default. We recommend this default because it improves runtime and convergence.
- `lambda` controls the loose end slack penalty. Higher values of lambda make it less likely for `balance` to incorporate a copy number change without an associated junction. Suggested values are between 10 and 1000.
- `epgap` controls the MIP optimization relative gap tolerance. Optimization will stop when this number is reached. We suggest values of 1e-4 and below.
- `tilim` is the alotted time for MIP optimization in seconds. Optimization will stop when this number is reached.
- `verbose` controls how much output is printed. Setting to 0 will print nothing, setting to 1 will print helpful debugging messages, and setting to 2 will also print the CPLEX output.
wzxhzdk:55
Note that the "y" location of the subgraph nodes now represent their integer copy numbers as inferred by `balance`:
# Interactive visualization with gGnome.js
We built [`gGnome.js`](https://github.com/mskilab/gGnome.js), an interactive web application that allows browsing the genome graph at arbitrary windows along the reference genome, rich graph annotation options, cohort metadata filtering and more. In addition, the interface allows you to visualize coverage and RPKM data alongside your graphs/walks. We have made public the 2778 genome graphs in our paper [here](http://mskilab.com/gGraph/) so you can start exploring the interface right away.
The gGnome package includes some methods for preparing the data to be visualized with gGnome.js. We will describe these methods below as a practical guide of how to get your data visualized.
The best way to get familiar with the features and configuration of gGnome.js is to read the [gGnome.js documentation](https://github.com/mskilab/ggnome.js).
In order to visualize your own `gGraph` and `gWalk` in the browser, first clone the repository from github and follow the instructions in its [README](https://github.com/mskilab/gGnome.js/blob/master/README.md).
Once you have gGnome.js installed on your machine, you can add your files to the gGnome.js folder in order to visualize your own data.
## Generating gGnome.js instance
The easiest way to visualize your own data is to use the `gGnome.js()` function in the `gGnome` package.
This function accepts a table with paths to RDS files containing the genome graphs (gGraph objects) and coverage files.
Notice that the coverage files can be provided either as an RDS containing a GRanges object or as any file that can be parsed into a GRanges (e.g. txt, tsv, bed, bw, or wig).
As an example, we can use this function to generate a gGnome.js project for the hcc1954 and h526 graphs from above. First we need to save these graphs into RDS files:
wzxhzdk:56
Let's load the path to a sample coverage file for HCC1954 (this file only includes coverage for chr8 just for demonstration).
wzxhzdk:57
We then generate the table required by `gGnome.js`:
wzxhzdk:58
We are now ready to generate the gGnome.js project. We would need to provide a path to the location where we want the project to be created. Notice that this must be a new directory. If you wish to add more files to an existing directory you must use `append = TRUE`.
wzxhzdk:59
That's it, you are now ready to launch the gGnome.js interface. You can do so by running the BASH script in your gGnome.js project dir (in our example: `./demo_gGnome.js/start.sh`). This should open the following adress in your browser: [http://localhost:8080/index.html](http://localhost:8080/index.html).
And this is what you should see:
wzxhzdk:60
Here is an outline of the steps done by `gGnome.js()`:
1. Clone the [github repository of gGnome.js](https://github.com/mskilab/ggnome.js) (unless `append` is set to `TRUE`).
2. Generate a JSON format version of the genome graph using the `json` method of `gGraph` and store it in the `json` subdirectory of the gGnome.js directory.
3. Save a CSV formatted version of the coverage file of each sample to the `scatterPlot` subdirectory of the gGnome.js directory.
4. Generate the [sample description file: `datafiles.csv`](https://github.com/mskilab/gGnome.js#adding-description-of-your-samples).
## How are files generated
To generate the genome graph JSON files and the coverage CSV files `gGnome.js()` uses `gGnome` functions/methods. You can use these functions directly to add files to your gGnome.js folder.
JSON files are generated using the `json` methods of `gGraphs`:
wzxhzdk:61
`gWalks` also have a `json` method that generates a JSON file that could be visualized by gGnome.js.
We plan to soon add an option to `gGnome.js()` to automatically output certain walks along with the genome graph.
Coverage CSV files are generated using the `cov2csv` function. For example:
wzxhzdk:62
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.