knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The R package splice2neo provides multiple functions for the analysis of splice junctions. The main focus is on the association of splice junctions with somatic mutations. The package integrates the output of several tools that predict splicing effects from mutations. It annotates these mutation effects with the resulting splicing event (alternative 3’/5’ splice sites, exons skipping, intron retentions) and the resulting splice junctions in a standardized format. Expressed splice junctions detected by external tools from RNA-seq data can be parsed and integrated. Splice junctions can be filtered against canonical splice junctions and annotated with the altered transcript sequences, CDS, and resulting peptide sequences. Splice2neo was mainly developed to support the detection of neoantigen candidates from tumor-specific splice junctions. Neoantigens are tumor-specific mutated peptides recognized by T-cells and utilized in immunotherapies against tumors [@lang_identification_2022]. However, most functions from the splice2neo package are also useful for other types of splicing analysis.

In this vignette, we use use some toy example data to demonstrate how splice2neo can be used to analyze splice-junctions which are derived from mutation predictions as tumor-specific neoantigen candidates [@lang_prediction_2023].

Installation

The R package splice2neo is not yet on CRAN or Bioconductor. Therefore, you need to install splice2neo from github.

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

remotes::install_github("TRON-Bioinformatics/splice2neo")

Example workflow to predict mutation-associated alternative splicing

Integrating splice2neo functions and detection rules based on splice effect scores and RNA-seq support facilitates the identification of mutation-associated splice junctions which are tumor-specific and can encode neoantigen candidates. The following workflow explains how to use splice2neo to predict neoantigen candidates from mutation-associated splice junctions using example data.

library(splice2neo)
library(dplyr)

Building transcript database

A database of transcripts is required for transcript annotation with splice2neo. Here, we show how to build the database from a GTF file that is a subset of GENCODE annotations, but users can build these databases also from other resources.

# use gtf file of choice and transform into transcript database
gtf_file <- system.file("extdata/example/gencode.v34lift37.annotation.subset.gtf.gz", package="splice2neo")

# parse GTF file as txdb object
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file)
# optionally save database for later re-use 
# saveDb(txdb, file = "/path/to/transripts/txdb.sqlite")
# Build transcript database
transcripts <-
  GenomicFeatures::exonsBy(txdb, by = c("tx"), use.names = TRUE)
transcripts_gr <- GenomicFeatures::transcripts(txdb, columns = c("gene_id", "tx_id", "tx_name"))

# Build cds database
cds <- GenomicFeatures::cdsBy(txdb, by = c("tx"), use.name = TRUE)

Canonical junctions

We can retrieve the canonical splice junctions between all adjacent exons from a list of transcripts with canonical_junctions().

ref_junc <- canonical_junctions(transcripts)

str(ref_junc)

Splice junction id

In splice2neo, splice junctions are represented by a junc_id, which is defined by the genomic coordinates of the exon positions (1-based) that are combined through the junction and the transcription strand of the transcript: chr:pos1-pos2:strand.

This format allows direct conversion to GRangs objects from the GenomicRanges package.

gr <- GenomicRanges::GRanges(ref_junc)

gr

A given GRanges object with ranges of splice junctions can be converted into a vector of `junc_id``s:

junc_id <- as.character(gr)

head(junc_id)

Annotation of RNA-seq derived splice junctions

Splice2neo transforms splice events identified in RNA-seq into a standardized junction format. We show here how to parse the tools LeafCutter and SplAdder.

LeafCutter

The output of the tool LeafCutter [@li_annotation-free_2018] can be transformed using leafcutter_transform():

leafcutter_path <- system.file("extdata/example/leafcutter", package="splice2neo")

leafcutter <-
  leafcutter_transform(path = leafcutter_path)


head(leafcutter)

SplAdder

The output of the tool SplAdder [@kahles_spladder_2016] can be transformed using spladder_transform():

spladder_path <- system.file("extdata/example/spladder", package="splice2neo")

spladder <-
  spladder_transform(path = spladder_path)


head(spladder)

Combine RNA-seq derived splice junctions

Now we want to get a single data.frame of all splice junctions identified by RNA-seq tools and information by which tool they were identified.

rna_junctions <-
  generate_combined_dataset(list("spladder_juncs" = spladder, "leafcutter_juncs" = leafcutter))

head(rna_junctions)

Annotation of mutation effect on splicing

Deep learning tools can predict the effect of a mutation on the disruption or creation of canonical splicing sequence motifs. This is typically provided as gain or loss of splicing acceptor or donor sites. Splice2neo can transform and annotate such mutation effects into a unified splice junction format. Currently, the transformation of predictions by SpliceAI [@jaganathan_predicting_2019], MMSplice[@cheng_mmsplice_2019] and Pangolin[@zeng_predicting_2022] are supported.
Here, we show how to annotate results from SpliceAI and MMSplice. The annotation of Pangolin results works analogously with SpliceAI.

SpliceAI

Effect predictions by SpliceAI can be read with parse_spliceai()

spliceai_file <- system.file("extdata/example/spliceAI.vcf", package="splice2neo")

# parse SpliceAI results 
spliceai <-
  parse_spliceai(spliceai_file)

head(spliceai)

Next, SpliceAi effects can be transformed into a standardized effect dataframe with format_spliceai(). Here, we can optionally provide a data.frame (gene_table) which relates gene symbols with GENCODE gene ids. If gene_table is provided, the formatted data.frame contains the GENCODE gene ids. While this is advised to better select relevant transcripts, it is not required.

gene_table <-
  readr::read_tsv(system.file("extdata/example/gene_table.tsv", package =
                                "splice2neo"))

head(gene_table)
# format SpliceAI results
splicai_formatted <-
  format_spliceai(spliceai, gene_table = gene_table)

head(splicai_formatted)

Predicted SpliceAI effects can be annotated with all potential resulting junctions and transcripts with annotate_mut_effect(). If the input was formatted using gene_table, the user may want to choose gene_mapping = TRUE which removes unlikely events from the results. Otherwise, the default gene_mapping = FALSE should be used. The resulting dataframe contains all potential alternative transcripts per row, meaning that there can be multiple annotated transcripts per junction and also multiple junctions per mutation effect. At this point, no filtering based on effect size has been applied meaning that the dataframe contains unlikely events.

# annotate SpliceAI junctions with all potential junctions and transcripts
spliceai_annotated <-
  annotate_mut_effect(
    effect_df = splicai_formatted,
    transcripts = transcripts,
    transcripts_gr = transcripts_gr,
    gene_mapping = TRUE
  )


spliceai_annotated %>% 
  dplyr::select(mut_id, junc_id, gene_id, tx_id, effect, score) %>% 
  head()

It can happen that multiple effects lead to the same altered transcripts (defined by mut_id, junc_id, tx_id ). The function unique_mut_junc() can be used to filter for unique altered transcripts by keeping the effect with the highest effect score.

# unique altered transcripts
spliceai_annotated_unique <- unique_mut_junc(spliceai_annotated)

nrow(spliceai_annotated_unique) < nrow(spliceai_annotated)

MMSplice

We can read the effect predictions by MMSplice with parse_mmsplice()

mmsplice_file <- system.file("extdata/example/MMsplice.csv", package="splice2neo")

# parse MMSplice results 
mmsplice <-
  parse_mmsplice(mmsplice_file)

head(mmsplice)

The MMSplice data can be annotated with all potential splice junctions using annotate_mmsplice()

mmsplice_annotated <- mmsplice %>%
  annotate_mmsplice(transcripts = transcripts)

head(mmsplice_annotated)

Junctions predicted from MMSplice results can contain duplicated altered transcripts for exon inclusion events. We use the function unique_junc_mmsplice() to filter for unique altered transcripts by keeping the effect with the highest effect score.

# unique altered transcripts
mmsplice_annotated_unique <- unique_junc_mmsplice(mmsplice_annotated)

nrow(mmsplice_annotated_unique) < nrow(mmsplice_annotated)

Combine mutation-retrieved splice junctions

Next, we combine splice junctions from SpliceAI and MMSplice into one dataframe using combine_mut_junc(). This dataframe will contain unique altered transcripts (defined by mut_id, junc_id, tx_id) in rows. Tool specific information are defined in columns starting with the respective tool name.

mut_junctions <- combine_mut_junc(list(
    "spliceai" = spliceai_annotated_unique, 
    "mmsplice" = mmsplice_annotated_unique
    ))


mut_junctions %>%
  dplyr::select(mut_id, junc_id, tx_id, spliceai_event_type, spliceai_score) %>% 
  head()

Integrative analysis of mutation-retrieved splice junctions

Annotate canonical junctions

To exclude junctions expressed in healthy tissues, we here want to filter predicted junctions for non-canonical splice junctions. We can use for instance exon-exon junctions from GENCODE and junctions found in healthy tisssue samples. Splice2neo also provides the function bed_to_junc() which allows to transform a bed file into a list of splice junctions.

# this is just an example file with a few canonical junctions --> incomplete
canonical_juncs_file <- system.file("extdata/example/test_canonical.bed", package="splice2neo")

canonical_junctios <- bed_to_junc(bed_file = canonical_juncs_file, type = "exon-exon")

Annotate mutation-retrieved splice junction whether they are canonical splice junctions.

# annotated canonical junctions
mut_junctions <- mut_junctions %>%
  mutate(is_canonical = is_canonical(junc_id, ref_junc = canonical_junctios, exons_gr = transcripts))

mut_junctions %>% 
  count(is_canonical)

# remove canonical junctions 
noncan_mut_junctions <- mut_junctions %>% 
  filter(!is_canonical)

Annotate support in RNA-seq derived splice junctions

Annotate whether mutation-retrieved splice junction were also detected by a RNA-seq tool.

noncan_mut_junctions <- noncan_mut_junctions %>%
  mutate(is_in_rnaseq = is_in_rnaseq(junc_id, rna_juncs = rna_junctions$junc_id))

Annotate transcript context sequence

Splice2neo can annotate the resulting transcript sequence for each altered transcript with add_context_seq(). Here, we set the size parameter to 400 to extract the sequence of +/-200 nucleotides around the splice junctions as context sequence.

noncan_mut_junctions <- noncan_mut_junctions %>%
  add_context_seq(
    size = 400,
    bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
    transcripts = transcripts
  )

noncan_mut_junctions %>%
  dplyr::select(mut_id, junc_id, tx_id, cts_id, cts_seq) %>%
  head()

Annotate peptide context sequence

Splice2neo can annotate the resulting peptide sequence for each altered transcript with add_peptide(). This is relevant to analyze which altered transcript could qualify as neoantigens. We can define by the parameter flanking_size the number of wild-type amino acids flanking the junction or novel sequence caused by the splice junction to the left and to the right. If the junction leads to a reading frame shift, sequences are flanked by wild-type amino acids to the left and are annotated until the first stop codon.

noncan_mut_junctions <- noncan_mut_junctions %>%
  add_peptide(flanking_size = 13,
              bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
              cds = cds)

noncan_mut_junctions %>%
  dplyr::select(mut_id,
                junc_id,
                tx_id,
                junc_in_orf,
                peptide_context,
                frame_shift) %>%
  head()

Annotate if predicted retained introns are covered by an exon of another transcript

Relevant intron retention events are particular challenging to predict. Splice2neo provides the function exon_in_intron() to annotate whether the exon of another transcript is in the genomic region of a predicted retained introns. We do not want to consider them in the downstream analysis as it is difficult to differentiate if the read support from RNA-seq relates to the retained intron or to the exon of the other transcript.

# annotate exons in introns of other transcript
noncan_mut_junctions <-
  noncan_mut_junctions %>%
  exon_in_intron(transcripts = transcripts)

Re-quantification of altered transcripts with EasyQuant

Altered transcript from predicted splice junctions can be re-quantified in RNA-seq in a targeted manner with EasyQuant (available at https://github.com/TRON-Bioinformatics/easyquant). EasyQuant maps all RNA-seq reads to to a custom reference build from the context sequence around splice junctions. We can generate the input for EasyQuant using the splice2neo function transform_for_requant().
EasyQuant should be run in interval mode for the re-quantification of splice junction and it is advised to set the BP_distannce parameter to 10 (see EasyQuant README).

# transform for EasyQuant
easyquant_input <- noncan_mut_junctions %>%
  transform_for_requant()

head(easyquant_input)

After running EasyQuant (which is not shown here), we can import the EasyQuant results with the splice2neo function map_requant().

# path to easyquant folder 
easyquant_path <- system.file("extdata/example/easyquant", package="splice2neo")

# merge easyquant results 
requant_noncan_mut_junctions <-
  map_requant(path_to_easyquant_folder = easyquant_path,
              junc_tib = noncan_mut_junctions)

Here, we are interested in different result columns depending on the type of splice event. In case of alternative splice sites or exon skipping events, want use the number of junction reads that map on the splice junction (junc_interval_start) and/or the number of spanning reads that bridge the splice junction (span_interval_start).

# re-quantification results
requant_noncan_mut_junctions %>% 
  filter(spliceai_event_type != "intron retention") %>% 
  dplyr::select(mut_id, junc_id, tx_id, junc_interval_start, span_interval_start) %>% 
  head()

In case of intron retentions, we may want to consider the following columns:

# re-quantification results
requant_noncan_mut_junctions %>%
  filter(spliceai_event_type == "intron retention") %>%
  dplyr::select(
    mut_id,
    junc_id,
    tx_id,
    within_interval,
    coverage_perc,
    coverage_median,
    coverage_mean
  ) %>%
  head()
# alternative to only import EasyQuant results without merging 

requantification_results  <-
  read_requant(path_folder = easyquant_path)

head(requantification_results)

Prediction of mutation-associated splice targets

In the analysis described above we identified mutation-retrieved splice junctions supported by RNA-seq. We have developed a stringent detection rule to predict tumor-specific targets from mutation-associated splice targets based on the mutation effect and the RNA-seq support (See @lang_prediction_2023). Currently, we suggest to focus on events from alternative splice sites and exon skipping and define such splice junctions as targets that have a |splice effect score| $\geq$ 0.35 and are found by LeafCutter/SplAdder or have at least 3 junction reads.

# exclude intron retention
potential_targets <- requant_noncan_mut_junctions %>%
  filter(spliceai_event_type != "intron retention")

potential_targets %>% 
  select(junc_id, tx_id, spliceai_score, mmsplice_delta_logit_psi, is_in_rnaseq, junc_interval_start)
# apply detection rule 
targets <- potential_targets %>% 
  filter(spliceai_score >= 0.35 | mmsplice_delta_logit_psi < -0.35) %>% 
  filter(is_in_rnaseq | junc_interval_start >= 3)

The data in targets can contain several altered transcripts per junctions which may lead to several neoantigen candidate sequences per target splice junction.

# apply detection rule 
targets %>% 
  distinct(junc_id, peptide_context) %>% 
  count(junc_id, name = "# neoantigen candidates") %>% 
  arrange(-`# neoantigen candidates`)

Other Splice2neo functions

Annotation of transcipts

Here, we have a list of splice junctions and want to know all potential affected transcripts.

Given a data.frame of splice junctions (with a column junc_id), the function add_tx() annotates all possible transcripts which overlap with the splice junctions.

junc_df <- dplyr::tibble(
  junc_id = toy_junc_id[c(1, 6, 10)]
)


junc_df <- junc_df %>% 
  add_tx(toy_transcripts)

junc_df

This can lead to large data sets containing many highly unlikely junction transcript combinations. We can select a subset of transcripts per junction that are more likely to be affected by a junction with choose_tx(). Please note that this function may loose relevant or keep irrelevant junction-transcripts in particular in regions with multiple isoforms with distinct splicing pattern.

selected_junc_df <- junc_df %>% 
  choose_tx()

selected_junc_df

References

Session info

sessionInfo()


TRON-Bioinformatics/splice2neo documentation built on July 1, 2024, 7:57 p.m.