knitr::opts_chunk$set(collapse = TRUE, fig.width = 8, fig.height = 8, message = FALSE, warning = FALSE)
library(gTrack)
library(rtracklayer)
library(kableExtra)    
library(magrittr)
library(tidyr)

Introduction

The gGnome package provides a flexible, queriable R interface to graphs and walks of genomic intervals. gGnome is written in the R6 object oriented standard and built around a powerful GenomicRanges, data.table, and igraph backend, and thus supports agile interaction with graphs consisting of hundreds of thousands of nodes and edges. Because gGnome classes are written in R6, their methods, variables, and "active bindings" are referenced using the $ symbol. This includes methods that (similar to other object oriented languages) enable the object to be modified "in place". Please see here for more details about the R6 standard.

Our key interest in developing gGnome is to develop a framework to analyze the sorts of graphs that arise in the study of whole genome cancer structural variation. However, we believe that the package can be useful in any context where graphs and paths of reference genomic intervals are generated, such as through the analysis of complex germline haplotypes, population genetics of structural variation, partially assembled genome drafts, comparative genomics, and transcriptome assembly.

For installation instructions, please visit the gGnome github page. For background, it may help to have some familiarity with data.table, GenomicRanges, gUtils, igraph packages.

gGnome is currently in an alpha release and a manuscript in preparation. If you use gGnome in your work, please contact us so that we can update you when the citation becomes available.

The Basics

The key classes in gGnome comprise the following: the Junction, the genome graph (gGraph) (and its nodes (gNode) and edges (gEdge)), and the genomic walk (gWalk). The nodes in a genome graph represent reference genomic intervals, also known in BioConductor parlance as GRanges. Each gNode corresponds to a chunk of reference genome, and represents a double-stranded DNA sequence. Each gEdge represents a pair of complementary 3-5' phosphodiester bonds (+/- some foreign sequence insertion), each termed an adjacency.

gGraph anatomy

Since DNA is double stranded, every gNode in the gGraph is actually a pair of stranded GRanges, one on the "+" and the other on the "-" reference strand. Similarly, every gEdge is a pair of directed adjacencies. This makes the gGraph a skew-symmetric directed graph.
In this directed graph, every interval has an "anti-interval", every adjacency has an "anti-adjacency", and every path (i.e. sequence of intervals or adjacencies) has an "anti-path".

The vertices in a standard directed graph do not formally have left and right "sides" that edges can both enter and exit. This limits the applicability of standard directed graphs to the modeling of DNA, since double-stranded DNA segments can be rearranged with each other in any of four (left/left, right/left, right/right, left/left) possible orientations. Though, in a directed graph, we may arbitrarily label nodes that send edges to our vertex as being to its "left", and edges that receive edges from our vertex to its "right", it is impossible to then connect two vertices in our directed graph on their mutual left.

However, since gGraph nodes represent interval / anti-interval pairs, we can talk about their "right-side" (i.e. the side with adjacencies leaving the "+" interval and entering the "-" interval) and their "left side" (i.e. the side with adjacencies leaving the "-" interval and entering the "+" interval). Adjacencies associated with the left side of one node can connect to the left or right side of any other node. Though we retain the underlying structure of a directed graph (convenient for applying standard graph algorithms) we are able to model the "sidedness" of DNA sequence. We contrast this skew-symmetric structure to the traditional "bidirected graph" formulation of DNA assemblies, which uses nodes to represent double-stranded sequence "ends" and two flavors of (intrasegment, intersegment) edges to represent sequences and their adjacencies, respectively.

Orientation is fundamental

Just to review, strands of any (double-stranded) reference genome are arbitrarily labeled "+" and "-", where the "+" strand goes from 5' to 3' in the direction of increasing genomic coordinates. Conversely, the "-" strand of the genome goes from 5' to 3' in the direction of decreasing genomic coordinates. As a convention we refer to the direction of increasing coordinates on the reference as "right" and the direction of decreasing coordinates as "left". In the human genome, the 1q telomere is thus on the "right" side of chromosome 1, and the 1p telomere is on its "left".

Using this system of orientation, there is an intrinsic direction associated with every node, i.e. there are edges associated with its "left" and "right" side. Some of those edges may be "REF" edges, i.e. connecting to that node's reference-adjacent neighbor. The REF edges connecting a node to its right reference neighbor consist of two adjacencies: one leaving the "+" GRanges associated with the node and entering the "+" GRanges of its neighor, and a second (complementary) adjacency leaving the "-" GRanges of the neighbor and entering the node. Other ("ALT") edges may connect our node to a distant locus that is non-adjacent in the reference, i.e. an adjacency created through rearrangement.

Genomes are not graphs

Real genomes are not graphs, but consist of linear (and possibly circular) alleles. Each allele is a string of nodes and edges, which we term a gWalk. When these alleles are superimposed (i.e. added), the resulting structure is a graph. The gGraph is thus only an abstraction, representing a summary of our (best) knowledge about the latent allelic state of the genomic sequence (also called its phase). When the nodes and edges on the gGraph are associated with an allelic dosage, this notion becomes quantitative - i.e. the graph represents an estimate of the total dosage of sequences and adjacencies in the given genome. Such an inference is produced by a handful of rearrangement callers, including Weaver, PREGO, RemiXT, CouGaR, and (our tool) JaBbA

Though a gGraph may be our "best guess" at the variant structure of a genome (i.e. from short-read whole genome sequencing (WGS)), certain biological questions require us to make inferences about linear (or circular) alleles. These require us to query features of gWalks generated from our gGraph, representing possible alleles in the genome. For example, we may want to query the genome for the existence of a possible high-copy allele (i.e. path) joining EML4 and ALK resulting in a protein-coding gene fusion. We may want to determine whether a particular regulatory element (i.e. "super-enhancer") is connected via rearrangement to the gene MYC, resulting in its overexpression. We would like to identify the frequency with which EGFR amplified through the formation of circular extra-chromosomal (episomal) DNA. We may want to identify rearrangement signatures (e.g. BFBs) creating recurrent patterns in a compendium of gGraphs.

We may also be able to further deconvolve (i.e. phase) alleles in the graph using long-range sequence or mapping data (e.g. from optical mapping, linked-read sequencing, or long-read sequencing). Given the complexity of certain aneuploid cancer genomes this deconvolution will most likely be partial, i.e. local and probabilistic. gWalks and gGraphs can help us represent and analyze partially phased genomes, either as probability distributions over gWalks or complex gGraphs with "branches" and "bubbles" representing alternate paths over the same reference intervals.

Similarly, the intermediate stages of *de novo * (i.e. bottom up) genome assembly pipelines produce graphs, which are often broken or arbitrarily linearized at some stage of analysis, losing possibly important genomic signal. Alignment, visualization, and analysis of assembly graphs in reference genomic coordinates can yield significant biological insight and technical intuition, without the need to generate a "reference-grade" linear assembly. We believe that gGraph and gWalks can serve as useful abstraction to help annotate and refine these partial views of altered genome structure.

gGoals

Our goals in building gGnome are to enable agile analysis of structural variation (rearrangement and copy number alterations) in complex and highly rearranged genomes. Though our work relates to the recent movement in the sequence mapping community to build reference-grade "graph genomes" and "variation graphs", our goal is to serve a different (though related) purpose. Firstly, these frameworks have been designed primarily with the goal of improving alignment and (single or multi-nucleotide) variant calling. Secondly, these frameworks aim to replace the linear reference genome with a novel graph based data strucutre.

In contrast, the vision for gGnome is to sit downstream of read mapping, assembly, and basic variant calling. We have designed gGnome to enable interactive exploration, annotation, and visualization of complex variants for biological discovery by the genomic data scientist. In addition, gGraphs are still firmly tethered to "linear reference" coordinates, where many / most datasets (epigenetic, transcriptome) live. The goal is to enable vetting of inferred patterns against raw alignment data, both to verify the quality of variant calls / assemblies / phases and interrogate their impact on biology. gGnome achieves this by leveraging several excellent packages in the R / Bioconductor universe for its back-end (GenomicRanges, data.table, igraph) as well as several useful mskilab (gTrack, gUtils) packages for data manipulation and visualization. We demonstrate examples below.

## Load the package
library(gGnome)

The Junction

The atomic unit of genomic structural variation

Before we get to graph, we have to define the Junction. A Junction is a signed pair of reference genomic locations that are brought together through rearrangement. Each junction connecting locations a and b has one of four possible orientations (a-|b+, a+|b+, a-|b-, a+|b-) that represent the "sides" of the breakpoint that are being joined. By convention we refer to - orientation as connecting the left side of the given breakpoint, while the + orientation connects the right side of the breakpoint.

Junctions calls are usually stored and shared in .bedpe or .vcf files. Though most callers usually provide additional "event" annotations to each junction (e.g. DEL, DUP, INV, TRA), such annotations lose their meaning in complex genomes where many junctions cluster near each other on the reference genome. Many complex (and even some simple variants e.g. inversions, balanced translocations) arise through multiple junctions, therefore the mapping of Junctions and "events" is likely not 1-1. Since "event calling" is an open question in structural variant analysis, the most fundamental unit of structural variant calling is the Junction - a signed genomic location pair.

Importing Junctions

We can load Junctions into gGnome from a variety of input formats from common junction / SV callers, including SvaBa, DELLY, Novobreak using the function jJ.

## SvAbA
svaba = jJ(system.file('extdata', "HCC1143.svaba.somatic.sv.vcf", package = "gGnome"))

## DELLY
delly = jJ(system.file('extdata', "delly.final.vcf.gz", package = "gGnome"))

## novobreak
novobreak = jJ(system.file('extdata', "novoBreak.pass.flt.vcf", package = "gGnome"))

## BEDPE
bedpe = jJ(system.file('extdata', "junctions.bedpe", package = "gGnome"))

In general we support BND style .vcf files and .bedpe files, though many callers vary in their details of implementing these formats.

Subsetting junctions

In gGnome, a Junction is a vectorized object that can be subsetted and queries on the basis of its metadata using convenient data.table style (with = TRUE) syntax. Junction metadata is caller dependent.

## can use both row and column subsetting on Junction metadata
head(novobreak[1:2, 1:10])

## can use data.table style expressions on metadata to subset Junctions
## here, filter novobreak translocations with quality greater than 50
novobreak[ALT == "<TRA>" & QUAL>50, 1:10][1:2, 1:5]

## subsetting SvAbA junctions with >10 bases of homologous sequence (short templated-sequence insertions or STSI as defined in the SvABA paper)
svaba[nchar(INSERTION)>10, ][1:2, 1:5]

## subsetting SVabA junctions with same sign and nearby breakpoints (i.e. small $span)
svaba[svaba$sign>0 & svaba$span<1e5][1:2, 1:5]

## subsetting junctions with infinite span (ie different chromosome) and homology length >5
delly[is.infinite(delly$span) & HOMLEN>5, ][1:2,1:5]

Note that we are showing off the $span and sign active bindings of the Junction class, which return the "span" (i.e. distance between breakpoints) and "sign" (i.e. product of the breakpoint signs, where "+" = 1, "-" = 1) as a vector.

Yes, a Junction is just a glorified GRangesList, which we can extract using the $grl accessor

svaba$grl[1:2]

Junction overlaps and merges

However, unlike GRangesList we can perform junction specific queries which only allow for overlap between junctions if both breakpoint locations and orientations match.

## subset svaba by those intersect with DELLY using gUtils subset %&% operator
length(svaba %&% delly)

## increase the overlap substanntially by padding delly calls with 100bp (using + operator)
length(svaba %&% (delly+100))

## basic set operations also work
length(setdiff(svaba, delly+100))

## length(union(svaba, delly+100))

We can do more comprehensive coordinate-based merges / joins of junctions using the merge function. The output is a "consensus" junction set, i.e. a Junction object representing the union of the input junctions with logical metadata columns prefixed by $seen.by that keep track of which junction was "seen" by which caller.

## any names work in the arguments to merge, these will be reflected in metadata as a $seen.by column
## (using padding of 1kb and c() to remove existing metadata)
res = merge(svaba = svaba[, c()], delly = delly[, c()], 
            novo = novobreak[, c()], anynameworks = bedpe[,c()], pad = 1e3)
head(res)

## here we can use the $dt (data.table) accessor quickly make an UpSetR plot
library(UpSetR)

## munge res$dt into upset friendly format
df = as.data.frame(sign(as.matrix(res$dt[,.(seen.by.anynameworks, seen.by.delly, seen.by.novo, seen.by.svaba)])))
upset(df)

The gGraph

Instantiation

We can create a gGraph from several input sources, including junctions, node and edges, and a a host of "graph SV callers" (Weaver, PREGO, RemiXT, CouGaR, and JaBbA) using the gG constructor.

From junctions and breaks

This constructor will "break" the genome (defined using seqinfo of the Junction object) at each connection end and create an ALT gEdge from 2 of the 4 resulting interval pairs. It will also leave a REF gEdge connecting the reference adjacent nodes.

### gGraph from svaba input
gg = gG(juncs = svaba)

The gGraph contains 1059 nodes and 1475 edges. The majority of edges are REF edges connecting reference adjacent nodes, while the remaining are ALT edges that inherit metadata from the input SvAbA junctions. The graph also contains 168 "loose ends", of which all are terminal. Loose ends represent sequences "ends" that lack a left or right neighbor. These occur at the ends of the contigs that make up the Seqinfo or seqlengths of this object. Since this hg19 derived seqinfo(gg) has 84 "chromosomes", this gGnome object has 168 terminal loose ends ends. In theory, loose ends can represent terminal 3' and 5' phosphates at real DNA "ends" (i.e. teleomeres). More generally, loose ends represent possible sites of attachment of missing or annotated sequence, e.g. unknown or repetitive sequence. Technically, even the annotated telomere represents an "end".

We can plot the resulting graph as well as the junctions using gTrack, a useful Marcin Imielinski Lab package for visualizing GRanges-based genomic tracks. Each gGraph object has an accessor $gt that generates a gTrack object, which can be concatenated with other gTrack objects, e.g. those representing genome-wide coverage, ChIP-seq, gene annotations, etc.

## we use gTrack to plot the gTrack associated with this gGraph
## the second argument to gTrack plot is a string or GRanges representing the
## window to plot, the links argument enables drawing of junctions from GRangesList
plot(gg$gt, '1', links = svaba$grl)

Each sequence on chromosome 1 is drawn as a gray rectangle. Rectangles are connected by gray REF and red ALT edges. Most red ALT connect different rectangles on chromosome 1 (i.e. are intrachromosomal), though some edges go off the graph and only one end is visible. (are interchromosomal). The topmost track demonstrates the input SvAbA junctions, which distributed across the gGraph. If you look close you can see two "hooked" blue loose ends, one at the left terminus and the other at the right terminus of chromosome 1.

In this plot, the y axis position of each rectangle has no meaning - the vertical stacking is purely for visual purposes, i.e. to minimize collisions.

We can create a graph with additional REF edges by supplying a "breaks" GRanges to the gG constructor. This will break the genome and add new gray REF edges (but no ALT edges) at each location.

### generate breaks using gUtils function to tile genome at evenly spaced 1MB intervals
breaks = gr.tile(seqinfo(svaba), 1e6)

### gGraph from svaba input
gg2 = gG(breaks = breaks, juncs = svaba)

### set gGraph metadata
gg2$set(name = 'with breaks')

### compare graphs towards the beginning of chromosome 1
### (gTracks can be concatenated to plot multiple tracks)
plot(c(gg$gt, gg2$gt), '1:1-5e7', links = svaba$grl)

From nodes and edges

To create custom graphs we provide a constructor that takes a GRanges of interval nodesand a data.table of edges specifying their connectivity. The edges data.table contains 4 essential columns: n1 and n2 specifies the nodes indices that are being joined and n1.side and n2.side specify which side of each node is being joined. The value of side can be integer (0 for left, 1 for right) or character (left, right).

## tiling of chromosome 1 
nodes = gr.tile(seqlengths(svaba)["1"], 1e7)

## generate 20 random edges (n1, n2, n1.side, n2.side)
edges = data.table(
           n1 = sample(length(nodes), 20, replace = TRUE),
           n2 = sample(length(nodes), 20, replace = TRUE))
edges[, n1.side := ifelse(runif(.N)>0.5, 'right', 'left')]
edges[, n2.side := ifelse(runif(.N)>0.5, 'right', 'left')]

gg3 = gG(nodes = nodes, edges = edges)

plot(gg3$gt, '1')

Note that all edges in the graph by default are `ALT` (red) and loose ends (blue) are automatically placed at any interval end that lacks an edge. In this case all the nodes are non-overlapping and equal size but nothing in gGraph requires this. wzxhzdk:11 ### From external tools We support several "graph SV caller" formats for `gGraph` import. wzxhzdk:12 These are graphs built from the analysis of the Her2+ breast cancer cell line HCC1954. These plots, unlike the previous, have a y-axis, which in this case represents copy number. The objects / tracks named by the caller that generated them. These`gGraph` metadata features were set during object instantiation and can be accessed using the `$meta`active binding and set using using the `$set` method. These meta fields are used as defaults for certain gGraph methods e.g. `$gt`, `$simplify`, `$reduce`, and others (some of which are described below). The `y.field` meta data field, in particular, tells the `$gt` active binding to plot the `gNode` metadata `"cn"` on the y.axis. As a result the y axis position of each node tells us how many copies of that interval is predicted by that algorithm to exist in HCC1954. If we set `y.field` to `NULL` then `$gt` will not create a y axis and stack the intervals, similar to the previous examples where created a `gGraph` from `breaks` or directly from `nodes` and `edges`. wzxhzdk:13 ## Manipulating and browsing gGraphs One useful feature of the `gGraph`, enabled by the `R6` system, is the ability to easily navigate nodes and edges of the graph by subsetting, querying on metadata, and quickly selecting neighboring nodes and edges of a given query. ### Navigating nodes and edges Every `gGraph` object contains `gNode` object (accessed by `$nodes`) and a `gEdge` object accessed by `$edges`. A `gNode` can be converted to `GRanges` via the `$gr` acitve binding and to `data.table` via the `dt` active binding. Similarly `gEdge` can be converted to `GRangesList` via the `$grl` active binding and to `data.table` via the `dt` active binding. A `gEdge` can also be converted to a `Junction` via the `$junctions` active binding. wzxhzdk:14 Though nodes and edges can be subsetted using numeric indices, it is often useful to subset them using metadata. For example we may be interested in a high copy interval in the JaBbA `gGraph` or an aberrant junction annotated with a long range sequence insertion. `gNode` and `gEdge` both support subsetting using expressions based on metadata. Two default metadata that exist for gEdge is `type` and `class`. Edges can be of `type = "REF"` (reference) or `"ALT"` (variant). `"ALT"` edges are further classified (according to the `class` metadata field) as `"DEL-like"`, `"DUP-like"`, `"INV-like"`, and `"TRA-like"` depending on the junction orientation and reference chromosome locations. We use the suffix `-like` to emphasize the fact that we don't actually know whether a `DEL-like` junction represents a deletion without further modeling / analysis. Depending on a given of model of locus evolution and phase, a duplication event may appear as a `DEL` on the reference, and a late duplication may appear as a `TRA-like`, i.e. translocation-like. Nevertheless, this initial classification is useful for subsetting and analysis, using a `data.table` style syntax. wzxhzdk:15 We can use `gNode` `$left` and `$right` accessors to navigate the "left" and "right" neighbors of `gGraph` nodes and edges. (See "The Basics" section above for left / right defintitions). We can also use the `$nodes` accessor on a `gEdge` object to get all of the nodes connecting to any of the subsetted edges, and (vice-versa), we can use the `$edges` accessor of a `gNode` object to get all the edges connected to at least one of those nodes. wzxhzdk:16 Since `gTrack::plot` takes `GRanges` as well as character strings for it's second argument, we can easily plot the JaBbA and RemiXT model in the vicinity of the `highcopy` nodes. We also use the `track.gencode()` function in `gTrack` to plot the GENCODE genes in these regions. wzxhzdk:17 We can see here that JaBbA finds the known high-level amplification of `ERBB2` aka HER2 in this cell line, which appears to be driven by a cluster of fold back inversions known as BFBs. One useful feature of `gNode` and `gEdge` is that they take `gUtils` operations such as `%&%` (strand-agnostic subset by overlaps). For example, we may want to locate the high-copy JaBbA nodes in the RemiXT graph. wzxhzdk:18 ### Marking edges and node metadata Since `gNode` and `gEdge` objects are pointers to part of a `gGraph`, their methods can be used to modify the metadata of the `gGRaph` in place. This is done using the `$mark()` method. It may be useful for example for us to highlight where the junctions harboring templated insertions in the JaBbA graph sit in the ReMiXT graph. wzxhzdk:19 We can see this plot shows two blue edges in three (reference) discontiguous windows. One of the blue edges (on the left) is a `biginsert` edge that is found only in the JaBbA model, while the the second blue edge (in the right window on chromosome 4 is found by both JaBbA and ReMiXT). The nodes attached to these edges have been highlighted, which in the bottom track mark the nodes where we would expect to find the first blue `biginsert` edge. ### Trimming and subsetting gGraphs In certain cases we may want to isolated pieces of `gGraphs` for further analysis. For example, we may want to pull the subgraph associated with a particular subset of nodes, trim a graph around a set of `GRanges`, or pull a subgraph that is within some (base pair) distance of a seed region on the `gGraph`. wzxhzdk:20 The first two (bottom) tracks generate identical graphs, subsetted to neighbors of `highcopy` nodes. The third track provides a window that includes 100kbp on each side of high copy nodes. The `trim()` function trims the boundaries of the nodes in the graph to the provided window. The fourth track does the same to the RemiXT graph. Another useful task in `gGraph` browsing is to define a subgraph in a given "neighborhood" around a given seed. Though this can be sort of done using the combinations of `$nodes` and `$edges` accessors on `gNodes`, this limits us to seed that are themselves nodes (or edges) in the graph only allows us to expand a certain order (i.e. `ego` from the given seed). Though we provide an `$ego` query on nodes (shown below), a more useful way to define local subgraph is around a specific `GRanges` seed and some *base pair distance* away from that seed. Let's go back to the `highcopy` region around *ERBB2* and try to zoom out a certain distance in the graph. wzxhzdk:21 Here, we see several low copy junctions to regions of chromosome 1, 11, and 12 internal to locations internal to the *ERBB2* amplicon (rightmost window). This suggests late incorporation of distant genetic material into this (likely episomal) amplicion, or (alternatively) possible chromosomal integration sites of 1 or more copies of this amplicon e.g. into chromosome 12. ## Advanced gGraph manipulation We can edit `gGraph` objects by adding edges or nodes, replicating nodes or paths (i.e. introducing "bubbles"), and merging graphs with each other. We may want to "chop up" (or "disjoin") the nodes of a graph across implicit internal reference connections, or simplifying the graph by merging reference adjacent nodes (+/- if they share a particular metadata feature). We may also want to combine graphs that represent different (partially phased) haplotypes into a union graph, and reduce or disjoin this graph into one that contains a single set of non-overlapping nodes. Let's begin with a simple graph wzxhzdk:22 ### Concatenate, disjoin, and simplify graphs We can concatenate any graphs, which takes the union of their nodes and junctions. We can also apply "disjoin", "simplify", and "reduce" to split, collapse, and merge graphs wzxhzdk:23 ### Add nodes and junctions Disjoining by `GRanges` is a quick way to add intervals to a graph without changing any connectivity (since we are just "instantiating" the reference edges that are internal to a node and implicitly exist). wzxhzdk:24 We can use the `$add` method to add edges to the graph wzxhzdk:25 A simpler syntax is to use `$connect`. wzxhzdk:26 We can also `$add` junctions, however since a `Junction` is a pair of coordinates, they they create ALT edges between every intersecting node pair in the graph. If we want to add these same locations to just one "haplotype" of the input graph, then we will need to use `$connect` or `$add` with a specific node pair. We illustrate using `gg3`. wzxhzdk:27 ### Replicate nodes and paths An important concept in representing individual or population variation in genome graphs is the concept of a "bubble", or a variant / ALT path through the graph that contains one or more sequence differences like SNV or indels. We can instantiate such paths in an existing graph using the `$rep` operation, which "replicates" a given node or path in the graph. wzxhzdk:28 In addition to replicating individual nodes we can create bubbles from paths in the graph. wzxhzdk:29 Now that we have a complex haplotype structure with alternate paths, we may want to add an allele specific rearrangement edges, in this case a shortcut from the right side of node 5 to the left side of node 10. wzxhzdk:30 ## Components and communities ### Weakly and strongly connected components of nodes We can label weakly and strongly connected components in a `gGraph` with the `$clusters` method. These allow us to group nodes and define genomic subgraphs based on the connectivity patterns of nodes. This method modifies the `$cluster` node metadata data field, which can then be used to query and subset the gGraph. wzxhzdk:31 Strongly connected components require that every node is reachable from every other node via a directed path. As seen above, multi-node strongly connected components result in cyclic structures. In genome graphs, these represent complex amplicons. Weakly connected components may not reveal interesting structures for a full genome graph representing a highly rearranged cancer genome, but may reveal interesting structure once we remove very wide nodes. The `gNode` version of the `$clusters` method allows us to conveniently mark up clusters in our original graph after clustering the subgraph only involving the selected set of nodes wzxhzdk:32 ### Edge clusters Clusters of quasi-reciprocal ALT edge can reveal "cycles" or "chains" of rearrangements in genome graphs. Such patterns have been dubbed "chromoplexy" (multi-way balanced rearrangements) and also linked to TIC (templated insertion chains). We enable detection of these edge clusters with the `gGraph` method `$eclusters`. This takes an argument `thresh` which is a distance threshold with which to group nearby quasi-reciprocal junctions - i.e. if `thresh=0` then we only consider clusters of exactly reciprocal junctions. The default value of `thresh` is 1 Kbp. wzxhzdk:33 # The gWalk The `gWalk` allows us to represent and analyze paths of genomic intervals in gGraphs, and annotate these paths with important metadata. Most fundamentally, a `gWalk` represents a possible linear (or circular) allele that can be inferred from our current graph model. More generally, the analysis of gWalks can help us study and annotate the connectivity of the graph, with the goal of identifying protein-coding gene fusions and non-coding regulatory element "proximities" (see Applications section below). Once we generate a `gWalk` object, it is very easy to query the `gGraph` that it belongs to and extract information about nodes and edges belonging to that `gWalk` or more loosely associated with it. ## (Shortest) paths between node pairs A simple way to generate `gWalk` objects is to search for paths connecting distant nodes in our graphs. These paths represent *possible* alleles or haplotypes that may exist in the genome that our graph is modeling. The `$paths` method of `gGraph` takes as input vectors of signed node ids or `gNode` objects corresponding the start (i.e. "source") and terminus (i.e. "sink") of paths that we are searching for. If no such paths exist, `$paths` will return an empty `gWalk` object. The default behavior of `$paths` is to operate in a strand-inspecific fashion (demonstrated below), but if we set `ignore.strand = FALSE` then the query will only return paths that start and end on the specified strand. This is important when examining connections between single-stranded genetic features (e.g. exons). wzxhzdk:34 The `gWalk` object is vectorized, and can be subsetted much in the same way as a `gEdge` or `gWalk` is, either by using indices or `data.table` style expressions on `gWalk` metadata. Since every gWalk represents a sequence of double-stranded genomic intervals in a given orientation (+ vs -) it is biologically equivalent to its reverse complement (i.e. the reverse sequence of intervals in the opposite orientation). To reverse complement a `gWalk` we just index it using a negative integer (this is different from the standard interpretation of R vector indexing, where negative integers means removal of the provided indices). wzxhzdk:35 ## Walk decompositions of (sub)graphs (TBC) Genome graphs represent a summary of the all the possible linear or circular alleles that are possible given our model of a rearranged genome. We can generate such walks for a gGraph using the `$walks` method, though for a large and complicated graph the number of such possible walks can become untenable to compute due to combinatoric issues. However, we can do this reliably for subgraphs, e.g. those defined through the analysis of clusters and commmunities (see previous section). wzxhzdk:36 ## Creating gWalks *de novo* Underlying each`gWalk` object is a list of signed node ids and a list of signed edge ids in a `gGraph`. These can be retrieved using the `$snode.id` and `$sedge.id` active binding. Each list item represents an ordered set of nodes and edges comprising the walk. A `gWalk` can be instantiated directly from signed node id or signed edge id lists and a `gGraph` input via the `gW` function. The provided node or edge sequence will have to exist in the provided `gGraph`, otherwise `gW` will error out or (if `drop = TRUE` is set) will ignore invalid inputs and return a `gWalk` object whose length is smaller than the provided lists. wzxhzdk:37 Alternatively, we can use the `gW` function to instanatiate a `gWalk` from a `GRangesList`. This can be done on top of a provided `gGraph`, however the connectivity of the `gGraph` must support the input `GRangesList`. As above, this requires that all reference and non-reference edges implied by the provided `GRangesList` must exist in the graph. However this does not require the intervals in the input `GRangesList` exactly match those in the graph. For example the `GRangesList` can contain a hyper-segmented or hypo-segmented set of nodes relative to the `gGraph`, i.e. one that explicitly instantiates the implicit reference edges that are internal to an interval, or collapses one or more reference adjacent nodes into a larger node. wzxhzdk:38 A `gGraph` does not have to be provided during `gWalk` instantiation from `GRangesList`. However, every `gWalk` object must be associated with a `gGraph`. If a `gGraph` is not provided, a brand new `gGraph` is created either by treating each interval in each `GRangesList` as a separate node (if `disjoin = FALSE`) or by collapsing all input intervals (if `disjoin = TRUE`) into a single `gGraph` of non-overlapping nodes. wzxhzdk:39 ## Disjoin and simplify gWalks Just like a `gGraph`, a `gWalk` can be simplified and disjoined. These operations are first applied to the underlying `gGraph` and then the walks are re-threaded on this altered graph. Since simplify, disjoin, and reduce do not restrict the connectivity of the graph (i.e. every path that exists in the original graph will exist in the altered graph), all the walks on the original graph will be "legal" on the altered graph. wzxhzdk:40 ## Analyzing gWalks Each gWalk refers to a string of `gNode` and `gEdge` objects, each which contain their own metadata. When analyzing a collection of walks (e.g. those representing rearranged haplotypes connecting regulatory elements and genes), we may want to annotate the walks on the basis of the features of the nodes and edges that contribute to them. This is enabled through the `gWalk` method `$eval`, which takes a scalar-valued expression on node or edge metadata (default is to test both) which will return a vector whose length is the number of walks where that expression has been evaluated once per walk. You can think of this as a fast lapply. It useful for doing "feature-engineering" of walks to identify those with interesting or biologically relevant properties. wzxhzdk:41 # Applications OK, now we've gotten through all the basics of `gGnome`, including a survey of the key functionality of `gGraph`, `gWalk`, `gEdge`, `gNode`, and `Junction`. Now we're ready to do some analysis with `gGnome`. We present two use cases: the first involves searching for protein-coding gene fusions, and the second involves identifying regulatory element-gene proximities induced through genomic rearrangement. Each involve the analysis of a `gGraph` object and and return a `gWalk` object for the analyst to inspect, plot, annotate, and/or manipulate. ## Protein fusions A very important reason to be interested in (cancer) genomic rearrangements is that they may bringing together two (or more) coding sequence (CDS) regions into a neomorphic protein fusion. Such fusions may result from simple (i.e. single ALT junction) alleles but may employ more than one junction in *cis*. We can derive gene fusions as a path search on a `gGraph` using the `fusions()` function. Each output `gWalk` represents the *genomic* coordinates of a possible protein-coding gene fusion. The metadata field `$gene.pc` is a character vector, which describes each protein fusion as a string of protein coordinates delimited by ';', where protein interval is representing using the character notation for a `GRanges` (seqnames:start-pos). Since `fusions()` does careful book-keeping of protein frame, certain nodes in the walk may represent "out-of-frame" regions. These are represented by brackets around the gene name, and a `fs` suffix (e.g. `[BRAF]fs`). The `$gene.pc` string can be converted to a bona-fide protein coordinate `GRanges` via the `gUtils::parse.grl` function. Some fusions may duplicate or delete genetic material. The protein coordinates of any deleted or duplicated material is encapsulated in `$amp.pc` and `$del.pc` metadata fields of the output. Only a subset of fusions outputted by `gGnome` are completely in-frame (denoted by metadata field `$in.frame == TRUE`). Others may start and end at in-frame node, but have one or more out-of-frame regions in the middle - we label these using the metadata field `$frame.rescue == TRUE`. Finally, though we usually identify protein fusions by their gene, fusion call is actually made on the coordinates of a specific GENCODE transcript. The "transcript analogue" of the `$gene.pc` metadata field is `tx.cc`, which represents the fusion as a string of CDS coordinates, with the same `GRanges` style character string that can be converted to a cds `GRangesList` via `grl.string`. The same naming convention is used to denote in-frame and out-of-frame segments of the neomorphic transcript. This field can be used to reconcile fusion predictions between RNA-seq and WGS. The output of `fusions()` also contains out-of-frame fusions (i.e. that have both `$in.frame == FALSE` and `$frame.rescue == FALSE`). We include these because searching for in-frame fusions is not perfectly sensitive for complex multi-junction gene fusions. When a fusion `gWalk` includes a piece of intergenic sequence, the (fast, greedy) graph search algorithm underlying `fusions()` may not be able to optimize for "frameness". We thus include these "out-of-frame" fusions to allow users to look harder for more optimal paths that might be in-frame and also to allow for the (perhaps exotic) possibility that out-of-frame fusions may undergo RNA editing or alternate splicing to yield an alternate fusion product. Finally, some users may have interest in *bona-fide* truncating fusions, though many of these can be discovered through a simpler analysis (e.g. somatic copy number analysis through the search for recurrent deletions). Though the `fusions()` function may take many minutes to run in "unbiased mode" on a highly rearranged genome like HCC1954, we can also use it in a targeted fashion using the `genes=` argument to quickly query for specific gene fusions. The examples below employ this usage for speed, though most users will want to do this analysis transcriptome-wide (i.e. without setting the `genes=` argument). Use of `mc.cores` will also speed up computation considerably for those with multicore machines. wzxhzdk:42 We can also find frame-rescues that begin at *ASAP1*, then incorporate a frameshifted chunk of the gene *NSD1*, and end up back at *ASAP1* in frame. Such variants are a bit strange to interpret, since the frameshifted chunk may likely contain a stop codon (currently not annotated), and may require further investigation to determine whether they produce functioning protein products. wzxhzdk:43 ## Proximity analysis Rearrangements can bring regulatory elements in (1D and 3D) proximity with each other and with genes, perturbing gene expression and chromatin states in *cis*. To detect / annotate such noncoding consequences of complex rearrangeents, we have developed `proximity()` analysis, which takes two sets of `GRanges` (`query` and `subject`) and identifies pairs of these that have been brought near each other through rearrangement. For each such "proximity" it emits a `gWalk` representing the path connecting these genetic elements. We illustrate this analysis by identifying proximities between super-enhancers (also called locus control regions) and certain genes in HCC1954. The `gWalk` object outputted by the `proximity()` function will have several standard metadata fields. These include: `$reldist` which is the fractional distance of the distance in the `ALT` graph (i.e. the `$altdist`) relative to the reference distance (i.e. `$refdist`). The remaining metadata features are inherited from the `GRanges` metadata from the pairs of subject and query arguments provided to `proximity()`. The returned `gWalk` object is sorted by its `$altdist`. wzxhzdk:44 In this case, the superenhancer is on the same chromosome as BRD9 (chromosome 5). We may be specifically interested in proximities involving superenhancers that were previously very distant to their target gene (e.g. `Inf` base pairs away on the reference). In addition we may want to identify those that use many ALT junctions. We can use the `gWalk` subsetting and analysis features to quickly find these. wzxhzdk:45 ## Classifying SV events There are many mutagenesis pathways producing various patterns of SVs in the cancer genomes. By identifying the specific motif that are recurrent in the cancer genoem graphs, we can classify subgraphs that satisfy the set of criteria corresponding to a particular class. These classes range from simple like deletions and tandem duplications, to complex like chromothripsis and breakage-fusion-bridge cycles. As of [Hadi et al, Cell 2020](https://www.biorxiv.org/content/10.1101/836296v2), we support the identification of 5 simple event and 8 complex: * simple + [Deletion (del)](#deletions-and-rigma) + [Tandem duplication (dup)](#tandem-duplications-and-pyrgo) + [Inversion (inv)](#simple) + [Inverted duplication (invdup)](#simple) + [Translocations (tra)](#simple) * complex + [Rigma](#deletions-and-rigma) + [Pyrgo](#tandem-duplications-and-pyrgo) + [Templated insertion chains (tic)](cp-and-tic) + [Chromoplexy](#cp-and-tic) + [Chromothripsis](#chromothripsis) + [Breakage-fusion-bridge cycles (bfb)](#amplicons) + [Double minute (dm)](#amplicons) + [Tyfonas](#amplicons) For ease of use, all the callers are wrapped into a single function `events`: wzxhzdk:46 Below we describe each function in detail. ### Deletions and Rigma When simple deletions accumulate in the same locus more than expected from a null distribution (explained below), their cluster is labeled rigma. wzxhzdk:47 ### Tandem duplications and pyrgo Similar to deletions and rigma, when a particular genomic region attain more than expected tandem duplications, the cluster of duplications is classified as a pyrgo. wzxhzdk:48 ### Chromothripsis Chromothripsis has been a stereotypical example of complex SV event, since its inception in [Stephens et al., Cell 2011](https://doi.org/10.1016/j.cell.2010.11.055), in which it is proposed to originate from a single catastrophic shattering of a relatively focused chromosome region (e.g. arm), then randomly ligated by DNA repair pathway, resulting in high numbers of junctions clustered, and oscillating copy numbers due to the random loss/retention of DNA shards. wzxhzdk:49 ### Complex amplicons (bfb, dm, tyfonas) Long has been known that genomic amplifications can drive carcinogenesis, yet the characterization of the exact structures beyond the increase in dosage are limited. Two mechanisms are well known, double minute (DM, also known as extrachromosomal circular DNA, eccDNA) and breakage-fusion-bridge cycles (BFBC). Genome graphs provides an oppurtunity to systematically differentiate and discover the spectrum of amplicon structures. In Hadi et al., Cell 2020, We describe a genome graph-derived feature space for these subgraphs that contain amplified junctions (JCN>7) and stably clustered into at least 3 classes, two of which correspond to DM and BFBC, but a third class with high numbers of junctions and high numbers of fold-back inversions, which is named *tyfonas* based on its wide reaching, extensively rearranged appearance. The function `amp` identifies wzxhzdk:50 ### Chromoplexy and TICs A series of long-range junctions (jumping over >=10Mbp on reference genome) can form a chain-like topology, when two breakends from two adjacent junctions in the chain are close enough to each other (by default <= 10 kbp). wzxhzdk:51 ### Simple We also support identification of 3 other simple event types, namely inversion, inverted duplication, and ranslocation. Here are some examples: wzxhzdk:52 ### Make your own caller Besides running our current implemented event callers, `h2228nome` serves as a generalized framework for users to write their own event callers. A general template procedure is: - Filter the nodes/edges of interest by their marginal features, e.g. copy number, width, junction span (reference distance between the breakends), orientations - Cluster by the node or edge based on graph topologies to get candidate subgraphs - Or walk the graph and look for paths or cycles as candidates - Summarize a set of features of the candidate subgraphs or walks ## Balance `balance` is a function for inferring copy number of nodes and edges of genome graphs by integrating read depth together with graph topology and applying the junction balance constraint. `balance` is the core function used in our package [JaBbA](https://github.com/mskilab/JaBbA). To learn more about junction balance analysis, please refer to the [Hadi et al.](https://doi.org/10.1016/j.cell.2020.08.006) publication. `balance` takes as input a gGraph with segment copy number estimates stored in the node metadata and solves a Mixed Integer Program (MIP) to infer integer node and edge copy numbers. By default, a mixed-integer linear program (MILP) optimization is performed. JaBbA used a mixed integer quadratic program (MIQP) at the time of publication, but has since been modified to use linear programming due to improved speed and performance. To run `balance`, users will need to install [CPLEX](https://www.ibm.com/analytics/cplex-optimizer). ### What's the difference between JaBbA and balance? JaBbA is basically a wrapper for the balance function and accepts as input coverage depth, segmentation and junction calls. JaBbA performs various steps prior to generating and balancing a gGraph (for more details, refer to the [publication](https://doi.org/10.1016/j.cell.2020.08.006)). JaBbA is intended for estimating segment and junction copy numbers for a whole genome. The `balance` could be used for fine-tuning a gGraph or for estimating values for a subgraph, as well as for analysis of more general cases of graphs such as long reads, segmentation data without coverage depth, or any case in which preliminary copy number estimates are available and the junction balance analysis is applied to find the closest balanced graph with integer cn values. Notice that the usage of `balance` outside of JaBbA is an experimental functionality, please feel free to [reach out to us](mailto:lab@mskilab.org) if you are interested in exploring this functionality and need help. ### balancing an example subgraph Here we demonstrate how to run `balance` on a small region of the CCLE [HCC1954](https://depmap.org/portal/cell_line/HCC1954_BREAST?tab=mutation) genome. `balance` can also be applied genome-wide to fit whole genome `gGraph`s. First, we will read the example subgraph and coverage file. wzxhzdk:53 Note that in the plotted image, the "y" location of the subgraph graph nodes is not reflective of copy number. ### adding node copy number estimates via binstats `binstats` is a function for using read coverage from your coverage file and sample purity/ploidy to produce node copy number estimates. It returns a new `gGraph` with the fields `cn` and `weight` added to the node metadata. `cn` represents a node copy number estimate. `weight` is calculated by dividing the number of coverage bins within a node by the coverage variance and reflects our certainty about the node CN estimate (e.g. we are more confident about the copy number estimate of nodes with higher weight). A few important parameters in binstats: - `bins` is a `GRanges` with coverage data in the metadata column specified in `field`. In this case, the coverage information is stored in a column named "ratio" - `field` is the name of the column containing coverage data. - `purity` is the sample purity. It should be between 0 and 1, specifying the proportion of tumor cells. - `ploidy` is the sample ploidy (width-weighted average copy number). wzxhzdk:54 The `cn` and `weight` fields of `binstats.sg` are now populated: ### balancing the graph Now that the subgraph has CN estimates, we are ready to run balance. There are a few parameters in `balance` to be aware of: - `lp` controls whether MILP (`TRUE`) or MIQP (`FALSE`) is used and is `TRUE` by default. We recommend this default because it improves runtime and convergence. - `lambda` controls the loose end slack penalty. Higher values of lambda make it less likely for `balance` to incorporate a copy number change without an associated junction. Suggested values are between 10 and 1000. - `epgap` controls the MIP optimization relative gap tolerance. Optimization will stop when this number is reached. We suggest values of 1e-4 and below. - `tilim` is the alotted time for MIP optimization in seconds. Optimization will stop when this number is reached. - `verbose` controls how much output is printed. Setting to 0 will print nothing, setting to 1 will print helpful debugging messages, and setting to 2 will also print the CPLEX output. wzxhzdk:55 Note that the "y" location of the subgraph nodes now represent their integer copy numbers as inferred by `balance`: # Interactive visualization with gGnome.js We built [`gGnome.js`](https://github.com/mskilab/gGnome.js), an interactive web application that allows browsing the genome graph at arbitrary windows along the reference genome, rich graph annotation options, cohort metadata filtering and more. In addition, the interface allows you to visualize coverage and RPKM data alongside your graphs/walks. We have made public the 2778 genome graphs in our paper [here](http://mskilab.com/gGraph/) so you can start exploring the interface right away. The gGnome package includes some methods for preparing the data to be visualized with gGnome.js. We will describe these methods below as a practical guide of how to get your data visualized. The best way to get familiar with the features and configuration of gGnome.js is to read the [gGnome.js documentation](https://github.com/mskilab/ggnome.js). In order to visualize your own `gGraph` and `gWalk` in the browser, first clone the repository from github and follow the instructions in its [README](https://github.com/mskilab/gGnome.js/blob/master/README.md). Once you have gGnome.js installed on your machine, you can add your files to the gGnome.js folder in order to visualize your own data. ## Generating gGnome.js instance The easiest way to visualize your own data is to use the `gGnome.js()` function in the `gGnome` package. This function accepts a table with paths to RDS files containing the genome graphs (gGraph objects) and coverage files. Notice that the coverage files can be provided either as an RDS containing a GRanges object or as any file that can be parsed into a GRanges (e.g. txt, tsv, bed, bw, or wig). As an example, we can use this function to generate a gGnome.js project for the hcc1954 and h526 graphs from above. First we need to save these graphs into RDS files: wzxhzdk:56 Let's load the path to a sample coverage file for HCC1954 (this file only includes coverage for chr8 just for demonstration). wzxhzdk:57 We then generate the table required by `gGnome.js`: wzxhzdk:58 We are now ready to generate the gGnome.js project. We would need to provide a path to the location where we want the project to be created. Notice that this must be a new directory. If you wish to add more files to an existing directory you must use `append = TRUE`. wzxhzdk:59 That's it, you are now ready to launch the gGnome.js interface. You can do so by running the BASH script in your gGnome.js project dir (in our example: `./demo_gGnome.js/start.sh`). This should open the following adress in your browser: [http://localhost:8080/index.html](http://localhost:8080/index.html). And this is what you should see: wzxhzdk:60 Here is an outline of the steps done by `gGnome.js()`: 1. Clone the [github repository of gGnome.js](https://github.com/mskilab/ggnome.js) (unless `append` is set to `TRUE`). 2. Generate a JSON format version of the genome graph using the `json` method of `gGraph` and store it in the `json` subdirectory of the gGnome.js directory. 3. Save a CSV formatted version of the coverage file of each sample to the `scatterPlot` subdirectory of the gGnome.js directory. 4. Generate the [sample description file: `datafiles.csv`](https://github.com/mskilab/gGnome.js#adding-description-of-your-samples). ## How are files generated To generate the genome graph JSON files and the coverage CSV files `gGnome.js()` uses `gGnome` functions/methods. You can use these functions directly to add files to your gGnome.js folder. JSON files are generated using the `json` methods of `gGraphs`: wzxhzdk:61 `gWalks` also have a `json` method that generates a JSON file that could be visualized by gGnome.js. We plan to soon add an option to `gGnome.js()` to automatically output certain walks along with the genome graph. Coverage CSV files are generated using the `cov2csv` function. For example: wzxhzdk:62

mskilab/gGnome documentation built on May 8, 2024, 4:25 p.m.