In luannnguyen/hmfGeneAnnotation: Determines gene amplifications and biallelic losses from HMF pipeline output

hmfGeneAnnotation is an R package designed to determine the amplification/biallelic loss status of a set of genes (provided as a bed file) based on copy number and SNV/indel data generated by the HMF variant calling pipeline.

Getting started

Generated by HMF pipeline

Germline SNV/indel vcf (*.annotated.vcf.gz)
Somatic SNV/indel vcf (*.purple.somatic.vcf.gz)
Copy number info per gene (*.purple.cnv.gene.tsv)
Copy number info (*.purple.cnv.somatic.tsv)

Gene list

Bed file with the chromosome, start/end genome coordinates, and ENSEMBL gene IDs of the desired genes. Below are the first few lines of the default bed file.

default_bed <- read.delim(
   file=system.file('misc/cosmic_cancer_gene_census_20200225.bed',package='hmfGeneAnnotation'),
   check.names=F
)
head(default_bed)

Usage

First install the package and its dependencies.

## Install dependencies
install.packages('seqminer')

## Install hmfGeneAnnotation
install.packages('devtools'); library(devtools)
install_github('https://github.com/UMCUGenetics/hmfGeneAnnotation/')

detGeneStatuses() is the main function of the package. The user may specify the path to a bed.file, but if unspecified, the one included in this package will be used. The user may also optionally specify the path to the java binary (java.path; default is the one installed on the system), as well as the path to the SnpSift jar (snpsift.path; default is the jar included at inst/dep/SnpSift.jar).

detGeneStatuses(
   out.dir='/path/to/write/output/files/', 
   hmf.pl.output.paths=c(
     germ_vcf='/path/to/annotated.vcf.gz', 
     som_vcf='/path/to/purple.somatic.vcf.gz', 
     gene_cnv='/path/to/purple.cnv.gene.tsv', 
     cnv='/path/to/purple.cnv.somatic.tsv'
   ), 
   sample.name='sample_name',

   ## Optional arguments
   bed.file='/path/to/bed/file', 
   java.path='/path/to/java/binary', 
   snpsift.path='/path/to/snpsift/jar',

   verbose=T
)

The output is a table where each row contains (1) data about copy number gains at the chromosome arm level relative to the genome ploidy, and local copy number gains relative to the chromosome arm ploidy; (2) data about losses/mutations of allele 1 and allele 2, with each variant being given an impact score from 0-5 based on ClinVar annotations (has priority) or SnpEff variant type annotations. Below is a schematic overview of the output table.

      || gene_metadata || CN_gain_info || allele_1_losses             || allele_2_losses             ||
      ||               ||              || variant_type | impact_score || variant_type | impact_score ||
------------------------------------------------------------------------------------------------------
gene1 ||               ||              ||              |              ||              |              ||
gene2 ||               ||              ||              |              ||              |              ||
 ...

Package workflow

Pre-processing HMF pipeline outputs

Calculate ploidy for each chromosome arm
Subset gene cnv table for genes of interest
Subset germline and somatic vcfs using SnpSift for regions of genes of interest

Assign scores for CN loss events

If min copy number < 0.3: flag as deep deletion. Assign score of 5+5
Else if max copy number < 0.3: flag as truncation. Assign score of 5+5
Else if min minor allele ploidy < 0.2: flag as LOH. Assign score of 5 to allele 1
Else flag as no copy number variant

Assign scores to SNV/indels

Flag origin of variant (i.e. germline or somatic)
Assign score to mutations in each allele based on Clinvar or SnpEff annotations:

score | ClinVar           | Snpeff
-----------------------------------------------------------
  5   | pathogenic        | frameshift
  4   | likely_pathogenic | nonsense
  3   | VUS               | missense, splice, inframe indel
  2   | likely_benign     | other variants
  1   | benign            | other variants
  0   | no data available | other variants

Combine monoallelic events:

If deep deletion or truncation, assign gene CNV output to both allele 1 and 2
Else make pairs of the following events: LOH, germline mut, somatic mut
Determine variant pair with the highest hit score (i.e. combined score). The order of in which biallelic event types are prioritized is described below.

biallel_event | allele1_event | allele2_event 
---------------------------------------------
CN loss       | deep deletion | deep deletion
CN loss       | truncation    | truncation
LOH+som       | LOH           | SNV/indel
LOH+germ      | LOH           | SNV/indel
som+som       | SNV/indel     | SNV/indel
germ+som      | SNV/indel     | SNV/indel

Output

A table containing for each gene: (1) the maximum impact variant pair; and (2) the amplification data

pkg_dir  <- '/Users/lnguyen/hpc/cog_bioinf/cuppen/project_data/Luan_projects/CHORD/scripts_main/hmfGeneAnnotation/'
file.copy(
   paste0(pkg_dir,'/doc/README.md'),
   paste0(pkg_dir,'/README.md'),
   overwrite=T
)
##file.remove(paste0(pkg_dir,'/doc/README.md'))