hybridenrich: Running Hybrid test, either from scratch or using two results...

Description Usage Arguments Value Hybrid p-values Function inputs Joining two results files

View source: R/hybrid.R

Description

Hybrid test is designed for people unsure of which test between ChIP-Enrich and Poly-Enrich to use, so it takes information of both and gives adjusted P-values. For more about ChIP- and Poly-Enrich, consult their corresponding documentation.

Usage

1
2
3
4
5
6
hybridenrich(peaks, out_name = "hybridenrich", out_path = getwd(),
  genome = supported_genomes(), genesets = c("GOBP", "GOCC", "GOMF"),
  locusdef = "nearest_tss", methods = c("chipenrich", "polyenrich"),
  weighting = NULL, mappability = NULL, qc_plots = TRUE,
  min_geneset_size = 15, max_geneset_size = 2000,
  num_peak_threshold = 1, randomization = NULL, n_cores = 1)

Arguments

peaks

Either a file path or a data.frame of peaks in BED-like format. If a file path, the following formats are fully supported via their file extensions: .bed, .broadPeak, .narrowPeak, .gff3, .gff2, .gff, and .bedGraph or .bdg. BED3 through BED6 files are supported under the .bed extension. Files without these extensions are supported under the conditions that the first 3 columns correspond to 'chr', 'start', and 'end' and that there is either no header column, or it is commented out. If a data.frame A BEDX+Y style data.frame. See GenomicRanges::makeGRangesFromDataFrame for acceptable column names.

out_name

Prefix string to use for naming output files. This should not contain any characters that would be illegal for the system being used (Unix, Windows, etc.) The default value is "chipenrich", and a file "chipenrich_results.tab" is produced. If qc_plots is set, then a file "chipenrich_qcplots.pdf" is produced containing a number of quality control plots. If out_name is set to NULL, no files are written, and results then must be retrieved from the list returned by chipenrich.

out_path

Directory to which results files will be written out. Defaults to the current working directory as returned by getwd.

genome

One of the supported_genomes().

genesets

A character vector of geneset databases to be tested for enrichment. See supported_genesets(). Alternately, a file path to a a tab-delimited text file with header and first column being the geneset ID or name, and the second column being Entrez Gene IDs. For an example custom gene set file, see the vignette.

locusdef

One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb', '1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream', '10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each, see the vignette or supported_locusdefs. Alternately, a file path for a custom locus definition. NOTE: Must be for a supported_genome(), and must have columns 'chr', 'start', 'end', and 'gene_id' or 'geneid'. For an example custom locus definition file, see the vignette.

methods

A character string array specifying the method to use for enrichment testing. Currently actually unused as the methods are forced to be one chipenrich and one polyenrich.

weighting

A character string specifying the weighting method. Method name will automatically be "polyenrich_weighted" if given weight options. Current options are: 'signalValue', 'logsignalValue', and 'multiAssign'.

mappability

One of NULL, a file path to a custom mappability file, or an integer for a valid read length given by supported_read_lengths. If a file, it should contain a header with two column named 'gene_id' and 'mappa'. Gene IDs should be Entrez IDs, and mappability values should range from 0 and 1. For an example custom mappability file, see the vignette. Default value is NULL.

qc_plots

A logical variable that enables the automatic generation of plots for quality control.

min_geneset_size

Sets the minimum number of genes a gene set may have to be considered for enrichment testing.

max_geneset_size

Sets the maximum number of genes a gene set may have to be considered for enrichment testing.

num_peak_threshold

Sets the threshold for how many peaks a gene must have to be considered as having a peak. Defaults to 1. Only relevant for Fisher's exact test and ChIP-Enrich methods.

randomization

One of NULL, 'complete', 'bylength', or 'bylocation'. See the Randomizations section below.

n_cores

The number of cores to use for enrichment testing. We recommend using only up to the maximum number of physical cores present, as virtual cores do not significantly decrease runtime. Default number of cores is set to 1. NOTE: Windows does not support multicore enrichment.

Value

A data.frame containing:

results

A data frame of the results from performing the gene set enrichment test on each geneset that was requested (all genesets are merged into one final data frame.) The columns are:

Geneset.ID

is the identifier for a given gene set from the selected database. For example, GO:0000003.

P.Value.x

is the probability of observing the degree of enrichment of the gene set given the null hypothesis that peaks are not associated with any gene sets, for the first test.

P.Value.y

is the same as above except for the second test.

P.Value.Hybrid

The calculated Hybrid p-value from the two tests

FDR.Hybrid

is the false discovery rate proposed by Bejamini \& Hochberg for adjusting the p-value to control for family-wise error rate.

Other variables given will also be included, see the corresponding methods' documentation for their details.

Hybrid p-values

Given n tests that test for the same hypothesis, same Type I error rate, and converted to p-values: p_1, ..., p_n, the Hybrid p-value is computed as: n*min(p_1, ..., p_n). This hybrid test will have at most the same Type I error as any individual test, and if any of the tests have 100% power as sample size goes to infinity, then so will the hybrid test.

Function inputs

Every input in hybridenrich is the same as in chipenrich and polyenrich. Inputs unique to chipenrich are: num_peak_threshold; and inputs unique to polyenrich are: weighting. Currently the test only supports running chipenrich and polyenrich, but future plans will allow you to run any number of different support tests.

Joining two results files

Combines two existing results files and returns one results file with hybrid p-values and FDR included. Current allowed inputs are objects from any of the supplied enrichment tests or a dataframe with at least the following columns: P.value, Geneset.ID. Optional columns include: Status. Currently we only allow for joining two results files, but future plans will allow you to join any number of results files.


chipenrich documentation built on Nov. 8, 2020, 8:11 p.m.