knitr::opts_knit$set(root.dir = "."); knitr::opts_chunk$set(echo = T, eval = T, cache = T, warning = F, message = F)
This is a vignette intended to introduce new users to clinvaR
(https://github.com/jamesdiao/clinvaR).
For clinvaR
, the first goal is to go from genes to a vcf of related variants, along with classifications from any version of ClinVar.
This involves 3 steps:
clinvaR has a number of file dependencies that are generally useful for these types of projects. I have attached them here. remotes
and devtools
are interchangeable.
pkg_list <- c("remotes","pander","ggplot2","tibble","tidyr","dplyr", "stringr") installed <- installed.packages()[,"Package"] is.installed <- pkg_list %in% installed if (!all(is.installed)) install.packages(pkg_list[!is.installed]) print(names(sapply(pkg_list, require, character.only = T)), quote = F) if (!("clinvaR" %in% installed)) remotes::install_github('jamesdiao/clinvaR') require(clinvaR)
The first step is to have a list of genes that you are interested in.
The easiest way is to directly save your genes of interest in gene_panel as a character vector (all caps). We have taken the example of the 20 genes on the Laboratory of Molecular Medicine's gene testing panel for hypertrophic cardiomyopathy.
hcm_panel <- c("ACTC1", "ACTN2", "CSRP3", "GLA", "LAMP2", "MYBPC3", "MYH7", "MYL2", "MYL3", "MYOZ2", "NEXN", "PLN", "PRKAG2", "PTPN11", "RAF1", "TNNC1", "TNNI3", "TNNT2", "TPM1", "TTR")
Another way is to save your gene list in a text file (1 gene per line) and import it.
path <- system.file("extdata/MacArthur_Gene_Lists/lists/lmm_hcm.tsv", package = "clinvaR") print(path) hcm_panel <- get_genes(path) print(hcm_panel)
The MacArthur Lab has assembled a compilation of gene lists (https://github.com/macarthur-lab/gene_lists) that may be a useful starting point. We have included all sets (and a few of our own) with abbreviated descriptions. Please visit the GitHub repo for citation information.
| List | File Name | Count | Description | | ---------------- | ------- | ------- | ------------------------------------------------------- | | Universe | universe.tsv | 18,991 | Approved symbols for 18,991 protein-coding genes according to HGNC. All other lists are subsets. | | Hypertrophic cardiomyopathy | lmm_hcm.tsv | 20 | Gene testing panel for hypertrophic cardiomyopathy from the Laboratory of Molecular Medicine | | ACMG secondary findings recommendations v2.0 | acmg_59.tsv | 59 | The American College of Medical Genetics and Genomics has recommended that sequencing laboratories seek and report secondary findings in a minimum list of 59 genes. | | BROCA - Cancer Risk Panel | BROCA_Cancer_Risk_Panel.tsv | 66 | Suspected hereditary cancer predisposition, with a focus on breast or ovarian cancers. May co-occur with other cancer types (such as colorectal, endometrial, pancreatic, endocrine, or melanoma). | | DNA Repair Genes, KangJ| DRG_KangJ.tsv | 151 | 151 DNA repair genes from DNA repair pathways: ATM, BER, FA/HR, MMR, NHEJ, NER, TLS, XLR, RECQ, and other. | | ClinGen haploinsufficient genes | clingen_level3_genes_2015_02_27.tsv | 221 | Genes with sufficient evidence for dosage pathogenicity (level 3) as determined by the ClinGen Dosage Sensitivity Map | | Genes with any disease association reported in ClinVar | clinvar_path_likelypath.tsv | 3078 | All gene symbols for which there is at least one variant with an assertion of pathogenic or likely pathogenic in ClinVar. | | FDA-approved drug targets | fda_approved_drug_targets.tsv | 286 | Genes whose protein products are known to be the mechanistic targets of FDA-approved drugs. | | Drug targets by Nelson et al 2012 | drug_targets_nelson.tsv | 201 | Drug targets according to Nelson et al 2012. | | X-linked ClinVar genes | x-linked_clinvar.tsv | 61 | X chromosome genes in the August 6, 2015 ClinVar release that have at least 3 reportedly pathogenic, non-conflicted variants in ClinVar with at least one submitter other than OMIM or GeneReviews.| | All dominant genes | all_ad.tsv | 709 | Currently the union of the Berg and Blekhman autosomal dominant OMIM gene lists, may add more lists later. | | All recessive genes | all_ar.tsv | 1183 | Currently the union of the Berg and Blekhman autosomal recessive OMIM gene lists, may add more lists later. | | Essential in culture | core_essentials_hart.tsv | 285 | Genes deemed essential in multiple cultured cell lines based on shRNA screen data | | Essential in mice | mgi_essential.tsv | 2,454 | Genes where homozygous knockout in mice results in pre-, peri- or post-natal lethality. | | Genes nearest to GWAS peaks | gwascatalog.tsv | 3,762 | Closest gene 3' and 5' of GWAS hits in the NHGRI GWAS catalog as of Feb 9, 2015 | | Kinases | kinases.tsv | 351 | From UniProt's pkinfam list | | G-protein-coupled receptors | gpcr.tsv | 1705 | GPCR list from guidetopharmacology.org | | Natural product targets| natural_product_targets.tsv | 37 | List of hand-curated targets of natural products |
These are easily imported using get_genes()
by directly referring to the file name. Use the get_genes()
command without arguments to list possibilities.
get_genes() hcm_panel <- get_genes('lmm_hcm.tsv') print(hcm_panel)
The last way is to search OMIM for related genes. This has not been added as a full feature yet.
#omim_table <- read.table('clinvaR/storage/gene_lists/other_data/omim.full.tsv', # sep = '\t', fill = T, header = T) #search_terms <- c("Cardiomyopathy")
#hcm_download_log <- download_1000g(genes = hcm_panel) vcf <- import_file_1000g(genes = hcm_panel) vcf[1:3,1:8]
Note that there are another 2504+ columns, including individual-level data (0 = homozygous reference, 1 = heterozygous, 2 = homozygous alternate).
This must be annotated with some version of ClinVar. This can be done either using the latest saved version (July 5, 2017), or by specifying a date from the list.
get_date_list()
Use get_clinvar()
to collect from a certain date (if date is not present, the closest date is used).
newest_clinvar <- get_clinvar(Sys.Date()) newest_clinvar[1:4,] #%>% select(-VAR_ID, -CLNDBN)
The ClinVar VCF can be used to annotate the Gene-Variant VCF. Note that this removes all variants without pathogenic assertions.
hcm_vcf <- annotate_1000g(vcf = vcf, clinvar = newest_clinvar, conflicts = TRUE) hcm_vcf[1:3,1:12]
If we wanted to do this in fewer lines of code (with all the defaults used above), we could have done this more quickly by simply specifying the genes in annotate. We can also specify whether we want to exclude the (likely) pathogenic variants with any (likely) benign conflicts as well. These are included by default.
You can use the plot function (plot.annotated_vcf) to look at the frequency of these variants across ancestral groups.
plot(hcm_vcf, fraction = TRUE) plot(hcm_vcf, fraction = FALSE)
Similar plots can be generated for the un-annotated VCF using a slightly different plot function (plot.plain_vcf).
plot(vcf, fraction = FALSE)
over_time()
works for unannotated vcfs. It generates a plot of aggregated allele frequency over time.
over_time(vcf, verbose = FALSE) -> output plot(output$Frequencies) plot(output$Submissions) plot(output$Significances) #output$Frequencies %>% kable #output$Submissions %>% kable
To add penetrance computations, specify the prevalence and case frequency.
over_time(vcf, prevalence = 0.002, case_freq = 0.4, verbose = FALSE) -> output plot(output$Penetrance)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.