README.md

Identifying gene amplification and biallelic losses

hmfGeneAnnotation is an R package designed to determine the amplification/biallelic loss status of a set of genes (provided as a bed file) based on copy number and SNV/indel data generated by the HMF variant calling pipeline.

Getting started

Generated by HMF pipeline

Gene list

default_bed <- read.delim(
   file=system.file('misc/cosmic_cancer_gene_census_20200225.bed',package='hmfGeneAnnotation'),
   check.names=F
)
head(default_bed)
##   #chrom    start      end hgnc_id hgnc_symbol ensembl_gene_id
## 1      1  2160134  2241558   10896         SKI ENSG00000157933
## 2      1  2487078  2496821   11912    TNFRSF14 ENSG00000157873
## 3      1  2985732  3355185   14000      PRDM16 ENSG00000142611
## 4      1  6241329  6269449   10315       RPL22 ENSG00000116251
## 5      1  6845384  7829766   18806      CAMTA1 ENSG00000171735
## 6      1 11166592 11322564    3942        MTOR ENSG00000198793

Usage

First install the package and its dependencies.

## Install dependencies
install.packages('seqminer')

## Install hmfGeneAnnotation
install.packages('devtools'); library(devtools)
install_github('https://github.com/UMCUGenetics/hmfGeneAnnotation/')

detGeneStatuses() is the main function of the package. The user may specify the path to a bed.file, but if unspecified, the one included in this package will be used. The user may also specify the path to the java binary (java.path; default is the one installed on the system), as well as the path to the SnpSift jar (snpsift.path; default is the jar included at inst/dep/SnpSift.jar).

detGeneStatuses(
   out.dir='/path/to/write/output/files/', 
   hmf.pl.output.paths=c(
     germ_vcf='/path/to/annotated.vcf.gz', 
     som_vcf='/path/to/purple.somatic.vcf.gz', 
     gene_cnv='/path/to/purple.cnv.gene.tsv', 
     cnv='/path/to/purple.cnv.somatic.tsv'
   ), 
   sample.name='sample_name',

   ## Optional arguments
   bed.file='/path/to/bed/file', 
   java.path='/path/to/java/binary', 
   snpsift.path='/path/to/snpsift/jar',

   verbose=T
)

The output is a table where each row contains (1) data about copy number gains at the chromosome arm level relative to the genome ploidy, and local copy number gains relative to the chromosome arm ploidy; (2) data about losses/mutations of allele 1 and allele 2, with each variant being given an impact score from 0-5 based on ClinVar annotations (has priority) or SnpEff variant type annotations. Below is a schematic overview of the output table.

      || gene_metadata || CN_gain_info || allele_1_losses             || allele_2_losses             ||
      ||               ||              || variant_type | impact_score || variant_type | impact_score ||
------------------------------------------------------------------------------------------------------
gene1 ||               ||              ||              |              ||              |              ||
gene2 ||               ||              ||              |              ||              |              ||
 ...

Package workflow

Pre-processing HMF pipeline outputs

Assign scores for CN loss events

Assign scores to SNV/indels

score | ClinVar           | Snpeff
-----------------------------------------------------------
  5   | pathogenic        | frameshift
  4   | likely_pathogenic | nonsense
  3   | VUS               | missense, splice, inframe indel
  2   | likely_benign     | other variants
  1   | benign            | other variants
  0   | no data available | other variants

Combine monoallelic events:

biallel_event | allele1_event | allele2_event 
---------------------------------------------
CN loss       | deep deletion | deep deletion
CN loss       | truncation    | truncation
LOH+som       | LOH           | SNV/indel
LOH+germ      | LOH           | SNV/indel
som+som       | SNV/indel     | SNV/indel
germ+som      | SNV/indel     | SNV/indel

Output



luannnguyen/hmfGeneAnnotation documentation built on May 6, 2020, 1:07 p.m.