knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=7
)

Automatic GnomAD controls and example cases

To search for clustering and burden signals across multiple genes, vectorised versions of the BIN_test(), burden_test() and combined_test() functions are available; BIN_test_WES(), burden_test_WES() and ClusterBurden_WES().

Controls can be generated automatically using the function collect_gnomad_controls() which will also generate the coverage file for the chosen cohort i.e. exomes or genomes (v2).

The parameters to collect_gnomad_controls() include;

For example purposes, there is a further function collect_example_cases() which will make an example cases with disease-genes from ClinVAR (defined by "clndn_regex") and null genes from GnomAD exomes or genoems dependent on what was used for the control group.

Below;

  1. a small gene set for analysis is defined by cardiomyopathy disease genes and an additional 200 random 'null' genes
  2. A control group is made with the default dataset GnomAD exomes (dataset="exome"), missense variants including inframes up to 3 residues (inframes=T, max_inframe_size=3) and MAF filtering is done based on a strict popmax threshold of 0.1% (filtertype="strict", maxmaf=0.0001).
  3. A case group is made using the same geneset, the same filtering options (defaults) and will automatically select GnomAD genomes for the null genes (ClinVAR-based data for the disease genes).
library(ClusterBurden)

n_random_genes = 200
disease = "cardiomyopathy"

genes = find_genes(n_random_genes, disease)

controls = collect_gnomad_controls(genenames=genes)

# control dataset
summary(controls)

cases = collect_example_cases(controls, disease)

# case dataset
summary(cases)

# disease genes
paste(cases[group=="cln", unique(symbol)], collapse=", ")

Formatting real data

Of course in practice, at least the case dataset would be supplied by the user. This should be filtered in the same manner as the control dataset supplied by the user of collect_gnomad_controls(). The dataset must contain the columns aff (1=cases, 0=controls), symbol (HGNC genename), protein_position (variant residue index) and ac (allele count for variant).

Notes on using real data

Calculating p-values

Once the datasets are filtered and in the correct format. Calculating p-values across all genes is as simple as;

pvals = ClusterBurden_WES(cases, controls, covstats = T)
# BIN_test_WES can calculate only clustering p-values
# burden_WES can be used for only burden p-values 
# covstats=T provides more coverage details for each gene including the protein ranges excluded from the analysis
# Coverage data is automatically supplied with cases and controls generated by the collect_* functions

# "pvals" output 
head(pvals[order(ClusterBurden)])

Default output includes flags that may indicate potential confounders included coverage and protein length. These may represent alternative explanations to significant results other an association with affection status.

pl_flag (PL = protein length):

cov_flag1 (mcov = Mean coverage over region):

cov_flag2 (covbins = Proportion of bins with less than 80% coverage):

Visualising results

ClusterBurden contains a few functions for plotting the results from this analysis. For example manhattan() plots a simple manhattan plot that takes the pvals object two additional arguments; "test" which represents the test for which to plot p-values ('BIN.test_pvalue', 'burden_pvalue' or 'ClusterBurden') and "n_genes" which represents how many of the most significant genes to display gene symbols for. The "SCALE" argument just helps improve visualisation for smaller outputs e.g. to make it look good in Rmarkdown it needs to be scaled down.

manhattan(pvals, "BIN-test", n_genes=10, SCALE=0.5)
manhattan(pvals, "ClusterBurden", n_genes=10, SCALE=0.5)

The function plot_signif_distribs() can display distributions of variant positions and coverage in the "n_genes" most significant genes.

plot_signif_distribs(pvals, cases, controls, n_genes = 9, SCALE=0.5, test="BIN-test")

To further investigate a clustering signal of interest the variants can be plotted against sequence features from uniprot.

# Top clustering signals 
pvals[order(BIN.test_pvalue)][,.(symbol, BIN.test_pvalue, pl_flag, cov_flag1, cov_flag2)][1:5]

# Choose a gene to investigate further
plot_features("MYH7", pvals, cases, controls, SCALE=0.7)


adamwaring/ClusterBurden documentation built on July 29, 2020, 9:50 p.m.