pathway_vote: Pathway Voting-Based Enrichment Analysis

View source: R/pathway-vote.R

pathway_voteR Documentation

Pathway Voting-Based Enrichment Analysis

Description

Performs pathway enrichment analysis using a voting-based framework that integrates CpG–gene regulatory information from expression quantitative trait methylation (eQTM) data. For a grid of top-ranked CpGs and filtering thresholds, gene sets are generated and refined using an entropy-based pruning strategy that balances information richness, stability, and probe bias correction. In particular, gene lists dominated by genes with disproportionately high numbers of CpG mappings are penalized to mitigate active probe bias—a common artifact in methylation data analysis. Enrichment results across parameter combinations are then aggregated using a voting scheme, prioritizing pathways that are consistently recovered under diverse settings and robust to parameter perturbations.

Usage

pathway_vote(
  ewas_data,
  eQTM,
  databases = c("Reactome"),
  k_grid = NULL,
  stat_grid = NULL,
  distance_grid = NULL,
  fixed_prune = NULL,
  grid_size = 5,
  min_genes_per_hit = 2,
  overlap_threshold = 0.7,
  readable = FALSE,
  workers = NULL,
  verbose = FALSE
)

Arguments

ewas_data

A data.frame containing CpG-level association results. The first column must contain CpG probe IDs, which will be matched against the eQTM object. The second column should contain a numeric ranking metric, such as a p-value, t-statistic, or feature importance score.

eQTM

An eQTM object containing CpG–gene linkage information, created by the create_eQTM() function. This object provides the CpG-to-gene mapping used for pathway inference.

databases

A character vector of pathway databases. Supporting: "Reactome", "KEGG", and "GO".

k_grid

A numeric vector of top-k CpGs used for gene set construction. If NULL, the grid is automatically inferred using a log-scaled range guided by the number of CpGs passing FDR < 0.05. Note: This requires that ewas_data contains raw p-values (second column) from an association analysis; for other metrics (e.g., t-statistic or importance scores), k_grid must be provided manually.

stat_grid

A numeric vector of eQTM statistic thresholds. If NULL, generated based on quantiles of the observed distribution.

distance_grid

A numeric vector of CpG-gene distance thresholds (in base pairs). If NULL, generated similarly.

fixed_prune

Integer or NULL. Minimum number of votes to retain a pathway. If NULL, will use cuberoot(N) where N is the number of enrichment runs.

grid_size

Integer. Number of values in each grid when auto-generating. Default is 5.

min_genes_per_hit

Minimum number of genes ('Count') a pathway must include to be considered. Default is 2.

overlap_threshold

Numeric between 0 and 1. Controls the maximum allowed Jaccard similarity between gene lists during redundancy filtering. Defualt is 0.7, which provides robust and stable results across a variety of simulation scenarios.

readable

Logical. whether to convert Entrez IDs to gene symbols in enrichment results.

workers

Optional integer. Number of parallel workers. If NULL, use 2 logical cores.

verbose

Logical. whether to print progress messages.

Value

A named list of data.frames, each corresponding to a selected pathway database (e.g., 'Reactome', 'KEGG', 'GO'). Each data.frame contains enriched pathways with columns: 'ID', 'p.adjust', 'Description', and 'geneID'.

Examples

set.seed(123)

# Simulated EWAS result: a mix of signal and noise
n_cpg <- 500
ewas <- data.frame(
  cpg = paste0("cg", sprintf("%08d", 1:n_cpg)),
  p_value = c(runif(n_cpg*0.1, 1e-9, 1e-5), runif(n_cpg*0.2, 1e-3, 0.05), runif(n_cpg*0.7, 0.05, 1))
)

# Corresponding eQTM mapping (some of these CpGs have gene links)
signal_genes <- c("5290", "673", "1956", "7157", "7422")
background_genes <- as.character(1000:9999)
entrez_signal <- sample(signal_genes, n_cpg * 0.1, replace = TRUE)
entrez_background <- sample(setdiff(background_genes, signal_genes), n_cpg * 0.9, replace = TRUE)

eqtm_data <- data.frame(
  cpg = ewas$cpg,
  statistics = rnorm(n_cpg, mean = 2, sd = 1),
  p_value = runif(n_cpg, min = 0.001, max = 0.05),
  distance = sample(1000:100000, n_cpg, replace = TRUE),
  entrez = c(entrez_signal, entrez_background),
  stringsAsFactors = FALSE
)
eqtm_obj <- create_eQTM(eqtm_data)

# Run pathway voting with minimal settings
## Not run: 
results <- pathway_vote(
  ewas_data = ewas,
  eQTM = eqtm_obj,
  databases = c("GO", "KEGG", "Reactome"),
  readable = TRUE,
  verbose = TRUE
)
head(results$GO)
head(results$KEGG)
head(results$Reactome)

## End(Not run)


PathwayVote documentation built on June 25, 2025, 5:09 p.m.