knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Introduction

In this vignette, we demonstrate how to use annoFuse to filter putative oncogenic fusions from a filtered fusion list followed by plotting recurrent fusions, recurrently fused genes, as well as a summary plot.

Step by step analysis

We start by loading annoFuse and the other required packages in the chunk below

library("annoFuse")
suppressPackageStartupMessages(library("readr"))
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("qdapRegex"))

Overview of the package

Here, we present annoFuse, an R package developed to annotate and filter expressed gene fusions, along with highlighting artifact filtered novel fusions.

Standard fusion call format requirements

For OpenPBTA samples we will also require some project specific fields to visualize and explore the data

fusion_calls <- read_tsv(system.file("extdata", "FilteredFusionAnnoFuse.tsv", package = "annoFuseData"))

# distance are being removed here to capture all intergenic fusions in count
# distance within () were making them count as unique instead of the same fusion
fusion_calls$FusionName <- unlist(lapply(
  fusion_calls$FusionName, 
  function(x) rm_between(x, "(", ")", extract = FALSE)
))

cols_fusioncalls <- c("LeftBreakpoint", "RightBreakpoint", "FusionName",
                      "Gene1A", "Gene1B","Sample")

head(fusion_calls[,cols_fusioncalls])

If the fusion has been previously reported in TCGA or if fusions containing gene partners are known oncogenes, tumor suppressor genes, COSMIC genes, and/or transcription factors, then we consider the fusion call to be known oncogenic or putative oncogenic.

# Add reference gene list containing known oncogenes, tumor suppressors, kinases, and transcription factors
geneListReferenceDataTab <- read.delim(
  system.file("extdata", "genelistreference.txt", package = "annoFuseData"),
  stringsAsFactors = FALSE
)

# Add fusion list containing previously reported oncogenic fusions.
fusionReferenceDataTab <- read.delim(
  system.file("extdata", "fusionreference.txt", package = "annoFuseData"),
  stringsAsFactors = FALSE
)

# filter for driver fusions
putative_driver_fusions <- fusion_driver(
  standardFusioncalls = fusion_calls, 
  annotated = TRUE, 
  geneListReferenceDataTab = geneListReferenceDataTab, 
  fusionReferenceDataTab = fusionReferenceDataTab,checkDomainStatus = TRUE
)

# aggregate caller
putative_driver_fusions <- aggregate_fusion_calls(putative_driver_fusions, removeother = FALSE)

Putative Driver Fusions found in more than four distinct histologies were filtered out, as these fusions were considered likely artifactual.

# checking if fusions are called in multiple histologies
# this will suggest these fusion calls are artifactual or
# commonly occuring
found_in_morethan_4 <- groupcount_fusion_calls(
  putative_driver_fusions, 
  group = "broad_histology", 
  numGroup = 4
) %>% 
  arrange(desc(group.ct))

found_in_morethan_4

Other non-oncogenic (not annotated as oncogene/transcription factor/tumor suppressor gene/kinase) fusions that are recurrent fusions, called by both callers, and are unique to a specific histology in this filtered dataset are of interest as well.

However, non-oncogenic genes that are fused more than five times per sample were removed, as they were considered artifactual within our dataset from manual review.

# To scavenge back non-oncogenic recurrent/unique per broad histology fusions
fusion_calls <- aggregate_fusion_calls(
  fusion_calls,
  removeother = TRUE,
  filterAnnots = "LOCAL_REARRANGEMENT|LOCAL_INVERSION"
)

# Keep
# 1. Called by at least n callers
fusion_calls.summary <- called_by_n_callers(fusion_calls, 
                                            numCaller = 2)

# OR
# 2. Found in at least n samples in each group
sample.count <- samplecount_fusion_calls(fusion_calls, 
                                         numSample = 2, 
                                         group = "broad_histology")

# Remove
# 1. non-oncogenic fusions that are in > numGroup
group.count <- groupcount_fusion_calls(fusion_calls, 
                                       group = "broad_histology", 
                                       numGroup = 1)

# 2. non-oncogenic multi-fused genes
fusion_recurrent5_per_sample <- fusion_multifused(fusion_calls, 
                                                  limitMultiFused = 5)

# filter fusion_calls to keep recurrent fusions from above sample.count and fusion_calls.summary
QCGeneFiltered_recFusion <- fusion_calls %>%
  dplyr::filter(FusionName %in% unique(c(sample.count$FusionName, fusion_calls.summary$FusionName)))

# filter QCGeneFiltered_recFusion to remove fusions found in more than 1 group and multifused gene per samples
QCGeneFiltered_recFusionUniq <- QCGeneFiltered_recFusion %>%
  dplyr::filter(!FusionName %in% group.count$FusionName) %>%
  dplyr::filter(!Gene1A %in% fusion_recurrent5_per_sample$GeneSymbol |
    !Gene2A %in% fusion_recurrent5_per_sample$GeneSymbol |
    !Gene1B %in% fusion_recurrent5_per_sample$GeneSymbol |
    !Gene2B %in% fusion_recurrent5_per_sample$GeneSymbol)

# Check for domain retention
QCGeneFiltered_recFusionUniq <- fusion_driver(
  standardFusioncalls = QCGeneFiltered_recFusionUniq, 
  geneListReferenceDataTab = geneListReferenceDataTab, 
  fusionReferenceDataTab = fusionReferenceDataTab,
  # don't filter because these are non-oncogenic fusions
  filterPutativeDriver = FALSE,
  checkDomainStatus = TRUE,
  annotated = TRUE
)

Here, we visualize the distribution of fusions within two different genes that are annotated as "Genic" in our standard fusion format. Fusions that are in-frame are predicted to translate into protein, so we also filter for only in-frame fusions for visualization.

putative_driver_fusions <- putative_driver_fusions %>%
  dplyr::bind_rows(QCGeneFiltered_recFusionUniq[, colnames(putative_driver_fusions)]) %>%
  # filter fusions found in more than 4 histologies
  dplyr::filter(
    !FusionName %in% found_in_morethan_4$FusionName,
    # remove intergenic and intragenic fusion
    BreakpointLocation == "Genic",
    Fusion_Type == "in-frame"
  ) %>%
  as.data.frame()


head(putative_driver_fusions)

The summary of fusions called can provide an overview of the genomic rearrangements within cohort in-terms of intra or inter chromosomal changes and biotypes of genes fused. We also provide distribution of kinase domain retained in 5and 3 genes as well as overall distribution of annotation in each group. Here we've used broad_histology to group our cohorts.

# plot summary
plot_summary(standardFusioncalls = putative_driver_fusions, 
             groupby = "broad_histology")

Recurrent fusions and genes fused provide insights into subtypes of samples within a cohort. Here we choose broad_histology as the grouping variable and plot the n participants (Kids_First_Participant_ID) with these fusions.

# recurrently fused genes
plot_recurrent_genes(standardFusioncalls = putative_driver_fusions, 
                     groupby = "broad_histology", 
                     countID = "Kids_First_Participant_ID", 
                     plotn = 20)
# recurrent fusions
plot_recurrent_fusions(standardFusioncalls = putative_driver_fusions, 
                       groupby = "broad_histology", 
                       countID = "Kids_First_Participant_ID", 
                       plotn = 20)

Session info {-}

sessionInfo()


d3b-center/annoFuse documentation built on Feb. 21, 2023, 1:06 a.m.