BIN_test: Split data into bins and carry out a two-sample...

Description Usage Arguments Details Value Author(s) Examples

View source: R/BIN_test.r

Description

Split data into bins and carry out a two-sample goodness-of-fit test Calculate a p-value for positional differences of rare missense-variant residue positions between cases and controls.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
BIN_test(
  case_residues,
  control_residues,
  case_coverage = NULL,
  control_coverage = NULL,
  cov_threshold = 0.5,
  pval = T,
  method = "mann-wald",
  nbins = NULL,
  plot_resids = F
)

Arguments

case_residues

vector of case variant residue positions

control_residues

vector of control variant residue positions

case_coverage

optional coverage data for cases in format: data.table(protein_position, over_10)

control_coverage

optional coverage data for controls in format: data.table(protein_position, over_10)

cov_threshold

threshold at which to exclude a residue from the analysis (choose 0 to keep all residues)

pval

return only p-value or return chi-squared test output?

method

method to bin data either mann-wald or nbins

nbins

number of bins to use if method == "nbins"

plot_resids

should chi-squared residuals be plotted? Defaults to False

Details

The function takes a vector of case and control missense-variant residue positions (aggregated over a protein-coding-region) as input and returns a p-value representing the significance of variant clustering within the gene. The linear sequence of the protein is split into 'bins' and the the counts for variant within each bin for each cohort are used to construct a kx2 contigency table where k is the number of bins and 2 is for the two cohorts: cases and controls. The binning method is either: - "mann_wald" where the number of bins k is determined by the total number of observed variants n by the equation k ~ n^(2/5) - "nbins" where the user selects a specific number of bins (reasonable values here would be ~10-20 bins) Setting "plot_resids" to true allows the residuals for each cell in the kx2 contigency table to be plotted - this allows the user to determine which cells (protein regions) contribute towards the significance of the test.

When coverage files are supplied then regions with a 10X coverage below "cov_threshold" (default=0.5) are exluded from the analysis. For the remaining regions, cell counts are adjusted by the reciprocal of the mean coverage across the bin.

Value

Returns an object of class htest or p.value depending on value of argument "pval=?"

Author(s)

Adam Waring - adam.waring@msdtc.ox.ac.uk

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# The essential inputs are case_residues and control_residues

# Example 1: Simulated NULL data
# Bin by the Mann-Wald heuristic; n ^ (2/5) where n = length(case_residues) + length(control_residues)
# simulate case-control residue positions from the same distribution

nresidues = 1000 # length of the protein
probs = rexp(nresidues)^2 # probability of a missense variant at each residue

case_residues = sample(1:nresidues, 100, rep=T, probs)
control_residues = sample(1:nresidues, 100, rep=T, probs)

BIN_test(case_residues, control_residues)

# Example 2: Simulated DISEASE data
# simulate case-control residue positions from different distributions

nresidues = 1000 # length of the protein
probs = rexp(nresidues)^2
case_probs = probs * rep(c(1, 3, 1), c(200, 200, 600))
control_probs = probs * rep(c(2, 1, 2), c(200, 200, 600))

case_residues = sample(1:nresidues, 100, rep=T, case_probs)
control_residues = sample(1:nresidues, 100, rep=T, control_probs)

plot_distribs(case_residues, control_residues)

BIN_test(case_residues, control_residues, plot_resids = T)

adamwaring/ClusterBurden documentation built on July 29, 2020, 9:50 p.m.