proxReg: Run Proximity Regulation test on a set of narrow genomic...
In chipenrich: Gene Set Enrichment For ChIP-seq Peak Data

Description Usage Arguments Value Regulatory locations Method Examples

This method is designed for a set of narrow genomic regions (e.g. TF peaks) and is used to test whether the genomic regions assigned to genes in a gene set are closer to regulatory locations (i.e. promoters or enhancers) than by chance.

proxReg(peaks, out_name = "proxReg", out_path = getwd(),
  genome = supported_genomes(), reglocation = "tss",
  genesets = c("GOBP", "GOCC", "GOMF"), randomization = NULL,
  qc_plots = TRUE, min_geneset_size = 15, max_geneset_size = 2000,
  n_cores = 1)

`peaks`	Either a file path or a `data.frame` of peaks in BED-like format. If a file path, the following formats are fully supported via their file extensions: .bed, .broadPeak, .narrowPeak, .gff3, .gff2, .gff, and .bedGraph or .bdg. BED3 through BED6 files are supported under the .bed extension. Files without these extensions are supported under the conditions that the first 3 columns correspond to 'chr', 'start', and 'end' and that there is either no header column, or it is commented out. If a `data.frame` A BEDX+Y style `data.frame`. See `GenomicRanges::makeGRangesFromDataFrame` for acceptable column names.
`out_name`	Prefix string to use for naming output files. This should not contain any characters that would be illegal for the system being used (Unix, Windows, etc.) The default value is "proxReg", and a file "proxReg_results.tab" is produced. If `qc_plots` is set, then a file "proxReg_qcplots.pdf" is produced containing a number of quality control plots. If `out_name` is set to NULL, no files are written, and results then must be retrieved from the list returned by `proxReg`.
`out_path`	Directory to which results files will be written out. Defaults to the current working directory as returned by `getwd`.
`genome`	One of the `supported_genomes()`. If reglocation = enhancer, genome MUST be 'hg19'.
`reglocation`	One of: 'tss', 'enhancer'. Details in the "Regulatory locations" section
`genesets`	A character vector of geneset databases to be tested for enrichment. See `supported_genesets()`. Alternately, a file path to a tab-delimited text file with header and first column being the geneset ID or name, and the second column being Entrez Gene IDs. For an example custom gene set file, see the vignette.
`randomization`	One of: 'shuffle', 'unif', 'bylength', 'byenh'. These were used to test for Type I error under the null hypothesis. A general user will never have to use these.
`qc_plots`	A logical variable that enables the automatic generation of plots for quality control.
`min_geneset_size`	Sets the minimum number of genes a gene set may have to be considered for testing.
`max_geneset_size`	Sets the maximum number of genes a gene set may have to be considered for testing.
`n_cores`	The number of cores to use for testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore testing.

A list, containing the following items:

opts

A data frame containing the arguments/values passed to polyenrich.

peaks

A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS.

The columns are:

peak_id: an ID given to unique combinations of chromosome, peak start, and peak end.
chr: the chromosome the peak originated from.
peak_start: start position of the peak.
peak_end: end position of the peak.
gene_id: the Entrez ID of the gene to which the peak was assigned.
gene_symbol: the official gene symbol for the gene_id (above).
gene_locus_start: the start position of the locus for the gene to which the peak was assigned (specified by the locus definition used.)
gene_locus_end: the end position of the locus for the gene to which the peak was assigned (specified by the locus definition used.)
nearest_tss: the closest TSS to this peak (for any gene, not necessarily the gene this peak was assigned to.)
dist_to_tss: the distance in bp to the closest TSS to this peak.
nearest_tss_gene: the gene having the closest TSS to the peak (should be the same as gene_id when using the nearest TSS locus definition.)
nearest_tss_gene_strand: the strand of the gene with the closest TSS.
log_dtss: log of dist_to_tss
log_gene_ll: the log of length of the gene locus in bp
scaled_dtss: the adjusted distance to TSS, used in the calculations. Shown if reglocation = "tss"
dist_to_enh: the distance to the nearest enhancer. Shown if reglocation = "enhancer"
avg_denh: the empirical average for distance to the nearest enhancer for the gene the peak is assigned to. Shown if reglocation = enhancer
scaled_denh: the adjusted distance to the nearest enhancer. Shown if reglocation = enhancer

results

A data frame of the results from performing the proxReg test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:

Geneset.ID: the identifier for a given gene set from the selected database. For example, GO:0000003.
Geneset.Type: specifies from which database the Geneset.ID originates. For example, "Gene Ontology Biological Process."
Description: gives a definition of the geneset. For example, "reproduction."
P.Value: the probability of observing the proxmity of genomic regions in the gene set given the null hypothesis that peaks are not closer or farther in the gene set.
FDR: the false discovery rate proposed by Bejamini \& Hochberg for adjusting the p-value to control for family-wise error rate.
Effect: the signed Wilcoxon statistic, with positive values meaning the gene set has closer genomic regions than expected by chance.
Status: specifies if the peaks in the gene set tend to be closer or farther than those not in the gene set.
Odds.Ratio: the estimated odds that peaks are associated with a given gene set compared to the odds that peaks are associated with other gene sets, after controlling for locus length and/or mappability. An odds ratio greater than 1 indicates enrichment, and less than 1 indicates depletion.
N.Geneset.Genes: the number of genes in the gene set.
N.Geneset.Peak.Genes: the number of genes in the genes set that were assigned at least one peak.
Geneset.Peak.Genes: the list of genes from the gene set that had at least one peak assigned.

Current supported regulatory locations are gene transcription start sites (tss) or enhancer locations (hg19 only)

ProxReg first calculates the distance between each peak midpoint and regulatory location in base pairs. For gene transcription start sites, since parts of the chromosome are more sparse than others, there is an association with gene locus length that needs to be adjusted for. When using tss as the regulatory location, the peak distances are adjusted for this confounding variable based on an average of 90 ENCODE ChIP-seq experiments (details in citation pending). Similarly, for enhancers, distances depend on the density of enhancers within a gene locus, so distance to enhancer is adjusted using an empirical average of 90 ChIP-seq ENCODE experiments.

For each gene set of interest, the genomic regions are divided into two groups indicating the gene with the nearest tss is in the gene set or not. A Wilcoxon Rank-Sum test is then done to test for a difference in the adjusted distances (either to tss or enhancer).

# Run proxReg using an example dataset, assigning peaks to the nearest TSS,
# and on a small custom geneset
data(peaks_E2F4, package = 'chipenrich.data')
peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1')
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = proxReg(peaks_E2F4, reglocation = 'tss',
			genome = 'hg19', genesets=gs_path, out_name=NULL)

# Get the list of peaks that were assigned to genes and their distances to 
# regulatory regions.
assigned_peaks = results$peaks

# Get the results of enrichment testing.
enrich = results$results