knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=7 )
To search for clustering and burden signals across multiple genes, vectorised versions of the BIN_test(), burden_test() and combined_test() functions are available; BIN_test_WES(), burden_test_WES() and ClusterBurden_WES().
Controls can be generated automatically using the function collect_gnomad_controls() which will also generate the coverage file for the chosen cohort i.e. exomes or genomes (v2).
The parameters to collect_gnomad_controls() include;
For example purposes, there is a further function collect_example_cases() which will make an example cases with disease-genes from ClinVAR (defined by "clndn_regex") and null genes from GnomAD exomes or genoems dependent on what was used for the control group.
Below;
library(ClusterBurden) n_random_genes = 200 disease = "cardiomyopathy" genes = find_genes(n_random_genes, disease) controls = collect_gnomad_controls(genenames=genes) # control dataset summary(controls) cases = collect_example_cases(controls, disease) # case dataset summary(cases) # disease genes paste(cases[group=="cln", unique(symbol)], collapse=", ")
Of course in practice, at least the case dataset would be supplied by the user. This should be filtered in the same manner as the control dataset supplied by the user of collect_gnomad_controls(). The dataset must contain the columns aff (1=cases, 0=controls), symbol (HGNC genename), protein_position (variant residue index) and ac (allele count for variant).
Notes on using real data
Once the datasets are filtered and in the correct format. Calculating p-values across all genes is as simple as;
pvals = ClusterBurden_WES(cases, controls, covstats = T) # BIN_test_WES can calculate only clustering p-values # burden_WES can be used for only burden p-values # covstats=T provides more coverage details for each gene including the protein ranges excluded from the analysis # Coverage data is automatically supplied with cases and controls generated by the collect_* functions # "pvals" output head(pvals[order(ClusterBurden)])
Default output includes flags that may indicate potential confounders included coverage and protein length. These may represent alternative explanations to significant results other an association with affection status.
pl_flag (PL = protein length):
cov_flag1 (mcov = Mean coverage over region):
cov_flag2 (covbins = Proportion of bins with less than 80% coverage):
ClusterBurden contains a few functions for plotting the results from this analysis. For example manhattan() plots a simple manhattan plot that takes the pvals object two additional arguments; "test" which represents the test for which to plot p-values ('BIN.test_pvalue', 'burden_pvalue' or 'ClusterBurden') and "n_genes" which represents how many of the most significant genes to display gene symbols for. The "SCALE" argument just helps improve visualisation for smaller outputs e.g. to make it look good in Rmarkdown it needs to be scaled down.
manhattan(pvals, "BIN-test", n_genes=10, SCALE=0.5) manhattan(pvals, "ClusterBurden", n_genes=10, SCALE=0.5)
The function plot_signif_distribs() can display distributions of variant positions and coverage in the "n_genes" most significant genes.
plot_signif_distribs(pvals, cases, controls, n_genes = 9, SCALE=0.5, test="BIN-test")
To further investigate a clustering signal of interest the variants can be plotted against sequence features from uniprot.
# Top clustering signals pvals[order(BIN.test_pvalue)][,.(symbol, BIN.test_pvalue, pl_flag, cov_flag1, cov_flag2)][1:5] # Choose a gene to investigate further plot_features("MYH7", pvals, cases, controls, SCALE=0.7)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.