prob.hits: Find Probability of Locus Hit

View source: R/prob.hits.R

prob.hitsR Documentation

Find Probability of Locus Hit

Description

Computes the probability that each genomic locus (e.g., gene or regulatory region) is affected by one or more types of genomic lesions. This function estimates statistical significance for lesion enrichment using a convolution of independent but non-identical Bernoulli distributions.

Usage

prob.hits(hit.cnt, chr.size = NULL)

Arguments

hit.cnt

A list returned by the count.hits() function, containing the number of subjects and hits affecting each locus by lesion type.

chr.size

A data.frame containing chromosome sizes for all 22 autosomes and the X and Y chromosomes. It must include two columns: "chrom" for chromosome number, and "size" for chromosome lengths in base pairs.

Details

This function estimates a p-value for each locus based on the probability of observing the observed number of lesions (or more) by chance, under a model where lesion events are treated as independent Bernoulli trials.

For each lesion type, the model considers heterogeneity in lesion probability across loci based on their genomic context (e.g., locus size, chromosome size). These probabilities are then combined using a convolution of Bernoulli distributions to estimate the likelihood of observing the actual hit counts.

In addition, the function calculates:

  • FDR-adjusted q-values using the method of Pounds and Cheng (2006), which estimates the proportion of true null hypotheses.

  • p- and q-values for multi-lesion constellation hits, i.e., the probability that a locus is affected by one (p1), two (p2), or more types of lesions simultaneously.

Value

A list with the following components:

gene.hits

A data.frame containing GRIN statistical results. Includes gene annotations, the number of subjects and hits by lesion type, and the computed p-values and FDR-adjusted q-values for lesion enrichment across one or more lesion types.

lsn.data

Original input lesion data.

gene.data

Original input gene annotation data.

gene.lsn.data

A data.frame in which each row corresponds to a gene overlapped by a specific lesion. Includes columns for Ensembl gene ID (gene) and patient/sample ID (ID).

chr.size

Chromosome size information used in the computation.

gene.index

A data.frame indexing rows in gene.lsn.data corresponding to each chromosome.

lsn.index

A data.frame indexing rows in gene.lsn.data corresponding to each lesion.

Author(s)

Abdelrahman Elsayed abdelrahman.elsayed@stjude.org and Stanley Pounds stanley.pounds@stjude.org

References

Pounds, S. et al. (2013). A genomic random interval model for statistical analysis of genomic lesion data.

Cao, X., Elsayed, A. H., & Pounds, S. B. (2023). Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics.

See Also

prep.gene.lsn.data, find.gene.lsn.overlaps, count.hits

Examples

data(lesion_data)
data(hg38_gene_annotation)
data(hg38_chrom_size)

# 1) Prepare gene and lesion data:
prep.gene.lsn <- prep.gene.lsn.data(lesion_data, hg38_gene_annotation)

# 2) Identify overlapping gene-lesion events:
gene.lsn.overlap <- find.gene.lsn.overlaps(prep.gene.lsn)

# 3) Count number of subjects and lesions affecting each gene:
count.subj.hits <- count.hits(gene.lsn.overlap)

# 4) Compute p- and q-values for lesion enrichment per gene:
hits.prob <- prob.hits(count.subj.hits, hg38_chrom_size)

GRIN2 documentation built on June 17, 2025, 9:11 a.m.