knitr::opts_knit$set(root.dir = ".");
knitr::opts_chunk$set(echo = T, eval = T, cache = T, warning = F, message = F)

This is a vignette intended to introduce new users to clinvaR (https://github.com/jamesdiao/clinvaR).

Purpose

For clinvaR, the first goal is to go from genes to a vcf of related variants, along with classifications from any version of ClinVar.
This involves 3 steps:

  1. Select a list of genes that you are interested in. clinvaR makes it easy to download and import relevant variants from 1KG.
  2. Select a version of ClinVar to assign pathogenicity labels with. All previous ClinVar VCFs are stored as binary files and can be retrieved with a "closest date" function.
  3. Select an analysis method. These include frequency, submission, and assertion counts over time, and aggregate frequency by ancestral group.

Dependencies

clinvaR has a number of file dependencies that are generally useful for these types of projects. I have attached them here. remotes and devtools are interchangeable.

pkg_list <- c("remotes","pander","ggplot2","tibble","tidyr","dplyr", "stringr")
installed <- installed.packages()[,"Package"]
is.installed <- pkg_list %in% installed
if (!all(is.installed))
  install.packages(pkg_list[!is.installed])
print(names(sapply(pkg_list, require, character.only = T)), quote = F)
if (!("clinvaR" %in% installed))
  remotes::install_github('jamesdiao/clinvaR')
require(clinvaR)

Select a List of Genes

The first step is to have a list of genes that you are interested in.

Method 1: Direct assignment

The easiest way is to directly save your genes of interest in gene_panel as a character vector (all caps). We have taken the example of the 20 genes on the Laboratory of Molecular Medicine's gene testing panel for hypertrophic cardiomyopathy.

hcm_panel <- c("ACTC1", "ACTN2", "CSRP3", "GLA", "LAMP2", "MYBPC3", "MYH7", 
"MYL2", "MYL3", "MYOZ2", "NEXN", "PLN", "PRKAG2", "PTPN11", "RAF1", 
"TNNC1", "TNNI3", "TNNT2", "TPM1", "TTR")

Method 2: Import from text file

From own file

Another way is to save your gene list in a text file (1 gene per line) and import it.

path <- system.file("extdata/MacArthur_Gene_Lists/lists/lmm_hcm.tsv", package = "clinvaR")
print(path)
hcm_panel <- get_genes(path)
print(hcm_panel)

From clinvaR stored lists

The MacArthur Lab has assembled a compilation of gene lists (https://github.com/macarthur-lab/gene_lists) that may be a useful starting point. We have included all sets (and a few of our own) with abbreviated descriptions. Please visit the GitHub repo for citation information.

| List | File Name | Count | Description | | ---------------- | ------- | ------- | ------------------------------------------------------- | | Universe | universe.tsv | 18,991 | Approved symbols for 18,991 protein-coding genes according to HGNC. All other lists are subsets. | | Hypertrophic cardiomyopathy | lmm_hcm.tsv | 20 | Gene testing panel for hypertrophic cardiomyopathy from the Laboratory of Molecular Medicine | | ACMG secondary findings recommendations v2.0 | acmg_59.tsv | 59 | The American College of Medical Genetics and Genomics has recommended that sequencing laboratories seek and report secondary findings in a minimum list of 59 genes. | | BROCA - Cancer Risk Panel | BROCA_Cancer_Risk_Panel.tsv | 66 | Suspected hereditary cancer predisposition, with a focus on breast or ovarian cancers. May co-occur with other cancer types (such as colorectal, endometrial, pancreatic, endocrine, or melanoma). | | DNA Repair Genes, KangJ| DRG_KangJ.tsv | 151 | 151 DNA repair genes from DNA repair pathways: ATM, BER, FA/HR, MMR, NHEJ, NER, TLS, XLR, RECQ, and other. | | ClinGen haploinsufficient genes | clingen_level3_genes_2015_02_27.tsv | 221 | Genes with sufficient evidence for dosage pathogenicity (level 3) as determined by the ClinGen Dosage Sensitivity Map | | Genes with any disease association reported in ClinVar | clinvar_path_likelypath.tsv | 3078 | All gene symbols for which there is at least one variant with an assertion of pathogenic or likely pathogenic in ClinVar. | | FDA-approved drug targets | fda_approved_drug_targets.tsv | 286 | Genes whose protein products are known to be the mechanistic targets of FDA-approved drugs. | | Drug targets by Nelson et al 2012 | drug_targets_nelson.tsv | 201 | Drug targets according to Nelson et al 2012. | | X-linked ClinVar genes | x-linked_clinvar.tsv | 61 | X chromosome genes in the August 6, 2015 ClinVar release that have at least 3 reportedly pathogenic, non-conflicted variants in ClinVar with at least one submitter other than OMIM or GeneReviews.| | All dominant genes | all_ad.tsv | 709 | Currently the union of the Berg and Blekhman autosomal dominant OMIM gene lists, may add more lists later. | | All recessive genes | all_ar.tsv | 1183 | Currently the union of the Berg and Blekhman autosomal recessive OMIM gene lists, may add more lists later. | | Essential in culture | core_essentials_hart.tsv | 285 | Genes deemed essential in multiple cultured cell lines based on shRNA screen data | | Essential in mice | mgi_essential.tsv | 2,454 | Genes where homozygous knockout in mice results in pre-, peri- or post-natal lethality. | | Genes nearest to GWAS peaks | gwascatalog.tsv | 3,762 | Closest gene 3' and 5' of GWAS hits in the NHGRI GWAS catalog as of Feb 9, 2015 | | Kinases | kinases.tsv | 351 | From UniProt's pkinfam list | | G-protein-coupled receptors | gpcr.tsv | 1705 | GPCR list from guidetopharmacology.org | | Natural product targets| natural_product_targets.tsv | 37 | List of hand-curated targets of natural products |

These are easily imported using get_genes() by directly referring to the file name. Use the get_genes() command without arguments to list possibilities.

get_genes()
hcm_panel <- get_genes('lmm_hcm.tsv')
print(hcm_panel)

The last way is to search OMIM for related genes. This has not been added as a full feature yet.

#omim_table <- read.table('clinvaR/storage/gene_lists/other_data/omim.full.tsv', 
#                         sep = '\t', fill = T, header = T)
#search_terms <- c("Cardiomyopathy")

Download and Import 1000 Genomes VCF

#hcm_download_log <- download_1000g(genes = hcm_panel)
vcf <- import_file_1000g(genes = hcm_panel)
vcf[1:3,1:8]

Note that there are another 2504+ columns, including individual-level data (0 = homozygous reference, 1 = heterozygous, 2 = homozygous alternate).

Import ClinVar VCF

This must be annotated with some version of ClinVar. This can be done either using the latest saved version (July 5, 2017), or by specifying a date from the list.

get_date_list()

Use get_clinvar() to collect from a certain date (if date is not present, the closest date is used).

newest_clinvar <- get_clinvar(Sys.Date())
newest_clinvar[1:4,] #%>% select(-VAR_ID, -CLNDBN)

Annotate Gene-Variant VCF

The ClinVar VCF can be used to annotate the Gene-Variant VCF. Note that this removes all variants without pathogenic assertions.

hcm_vcf <- annotate_1000g(vcf = vcf, clinvar = newest_clinvar, conflicts = TRUE)
hcm_vcf[1:3,1:12]

If we wanted to do this in fewer lines of code (with all the defaults used above), we could have done this more quickly by simply specifying the genes in annotate. We can also specify whether we want to exclude the (likely) pathogenic variants with any (likely) benign conflicts as well. These are included by default.

Simple Analyses

Plot variant frequency by ancestry

You can use the plot function (plot.annotated_vcf) to look at the frequency of these variants across ancestral groups.

  1. Fraction = TRUE returns the fraction of the population with at least 1 of the genes.
  2. Fraction = FALSE returns the average number of variants per person.
plot(hcm_vcf, fraction = TRUE)
plot(hcm_vcf, fraction = FALSE)

Similar plots can be generated for the un-annotated VCF using a slightly different plot function (plot.plain_vcf).

plot(vcf, fraction = FALSE)

Plot ClinVar Time-Series: over_time(vcf)

over_time() works for unannotated vcfs. It generates a plot of aggregated allele frequency over time.

over_time(vcf, verbose = FALSE) -> output
plot(output$Frequencies)
plot(output$Submissions)
plot(output$Significances)
#output$Frequencies %>% kable
#output$Submissions %>% kable

To add penetrance computations, specify the prevalence and case frequency.

over_time(vcf, prevalence = 0.002, case_freq = 0.4, verbose = FALSE) -> output
plot(output$Penetrance)


jamesdiao/clinvaR documentation built on May 18, 2019, 11:19 a.m.