peaks2genes: Run the test process up to, but not including the enrichment...

Description Usage Arguments Value Randomizations Poly-Enrich Weighting Options Examples

View source: R/peaks2genes.R

Description

This function is used to create the *_peaks and *_peaks-per-gene files This way one does not need to remake these files whenever one just wants to test enrichment methods.

Usage

1
2
3
4
peaks2genes(peaks, out_name = "readyToEnrich", out_path = getwd(),
  genome = supported_genomes(), locusdef = "nearest_tss",
  weighting = NULL, mappability = NULL, qc_plots = TRUE,
  num_peak_threshold = 1, randomization = NULL)

Arguments

peaks

Either a file path or a data.frame of peaks in BED-like format. If a file path, the following formats are fully supported via their file extensions: .bed, .broadPeak, .narrowPeak, .gff3, .gff2, .gff, and .bedGraph or .bdg. BED3 through BED6 files are supported under the .bed extension. Files without these extensions are supported under the conditions that the first 3 columns correspond to 'chr', 'start', and 'end' and that there is either no header column, or it is commented out. If a data.frame A BEDX+Y style data.frame. See GenomicRanges::makeGRangesFromDataFrame for acceptable column names.

out_name

Prefix string to use for naming output files. This should not contain any characters that would be illegal for the system being used (Unix, Windows, etc.) The default value is "polyenrich", and a file "polyenrich_results.tab" is produced. If qc_plots is set, then a file "polyenrich_qcplots.pdf" is produced containing a number of quality control plots. If out_name is set to NULL, no files are written, and results then must be retrieved from the list returned by polyenrich.

out_path

Directory to which results files will be written out. Defaults to the current working directory as returned by getwd.

genome

One of the supported_genomes().

locusdef

One of: 'nearest_tss', 'nearest_gene', 'exon', 'intron', '1kb', '1kb_outside', '1kb_outside_upstream', '5kb', '5kb_outside', '5kb_outside_upstream', '10kb', '10kb_outside', '10kb_outside_upstream'. For a description of each, see the vignette or supported_locusdefs. Alternately, a file path for a custom locus definition. NOTE: Must be for a supported_genome(), and must have columns 'chr', 'start', 'end', and 'gene_id' or 'geneid'. For an example custom locus definition file, see the vignette.

weighting

(Poly-Enrich only) character string specifying the weighting method if method is chosen to be 'polyenrich_weighted'. Current options are: 'signalValue', 'logsignalValue', and 'multiAssign'.

mappability

One of NULL, a file path to a custom mappability file, or an integer for a valid read length given by supported_read_lengths. If a file, it should contain a header with two column named 'gene_id' and 'mappa'. Gene IDs should be Entrez IDs, and mappability values should range from 0 and 1. For an example custom mappability file, see the vignette. Default value is NULL.

qc_plots

A logical variable that enables the automatic generation of plots for quality control.

num_peak_threshold

(ChIP-Enrich only) Sets the threshold for how many peaks a gene must have to be considered as having a peak. Defaults to 1. Only relevant for Fisher's exact test and ChIP-Enrich methods.

randomization

One of NULL, 'complete', 'bylength', or 'bylocation'. See the Randomizations section below.

Value

A list, containing the following items:

opts

A data frame containing the arguments/values passed to polyenrich.

peaks

A data frame containing peak assignments to genes. Peaks which do not overlap a gene locus are not included. Each peak that was assigned to a gene is listed, along with the peak midpoint or peak interval coordinates (depending on which was used), the gene to which the peak was assigned, the locus start and end position of the gene, and the distance from the peak to the TSS.

The columns are:

peak_id

is an ID given to unique combinations of chromosome, peak start, and peak end.

chr

is the chromosome the peak originated from.

peak_start

is start position of the peak.

peak_end

is end position of the peak.

peak_midpoint

is the midpoint of the peak.

gene_id

is the Entrez ID of the gene to which the peak was assigned.

gene_symbol

is the official gene symbol for the gene_id (above).

gene_locus_start

is the start position of the locus for the gene to which the peak was assigned (specified by the locus definition used.)

gene_locus_end

is the end position of the locus for the gene to which the peak was assigned (specified by the locus definition used.)

nearest_tss

is the closest TSS to this peak (for any gene, not necessarily the gene this peak was assigned to.)

nearest_tss_gene

is the gene having the closest TSS to the peak (should be the same as gene_id when using the nearest TSS locus definition.)

nearest_tss_gene_strand

is the strand of the gene with the closest TSS.

peaks_per_gene

A data frame of the count of peaks per gene. The columns are:

gene_id

is the Entrez Gene ID.

length

is the length of the gene's locus (depending on which locus definition you chose.)

log10_length

is the log10(locus length) for the gene.

num_peaks

is the number of peaks that were assigned to the gene, given the current locus definition.

peak

is whether or not the gene has a peak.

Randomizations

Randomization of locus definitions allows for the assessment of Type I Error under the null hypothesis. The randomization codes are:

NULL:

No randomizations, the default.

'complete':

Shuffle the gene_id and symbol columns of the locusdef together, without regard for the chromosome location, or locus length. The null hypothesis is that there is no true gene set enrichment.

'bylength':

Shuffle the gene_id and symbol columns of the locusdef together within bins of 100 genes sorted by locus length. The null hypothesis is that there is no true gene set enrichment, but with preserved locus length relationship.

'bylocation':

Shuffle the gene_id and symbol columns of the locusdef together within bins of 50 genes sorted by genomic location. The null hypothesis is that there is no true gene set enrichment, but with preserved genomic location.

The return value with a selected randomization is the same list as without. To assess the Type I error, the alpha level for the particular data set can be calculated by dividing the total number of gene sets with p-value < alpha by the total number of tests. Users may want to perform multiple randomizations for a set of peaks and take the median of the alpha values.

Poly-Enrich Weighting Options

Poly-Enrich also allows weighting of individual peaks. Currently the options are:

'signalValue:'

weighs each peak based on the Signal Value given in the narrowPeak format or a user-supplied column, normalized to have mean 1.

'logsignalValue:'

weighs each peak based on the log Signal Value given in the narrowPeak format or a user-supplied column, normalized to have mean 1.

'multiAssign:'

weighs each peak by the inverse of the number of genes it is assigned to.

Examples

1
2
3
4
5
6
7
8
9
# Run peaks2genes using an example dataset, assigning peaks to the nearest TSS
data(peaks_E2F4, package = 'chipenrich.data')
peaks_E2F4 = subset(peaks_E2F4, peaks_E2F4$chrom == 'chr1')
gs_path = system.file('extdata', package='chipenrich')
results = peaks2genes(peaks_E2F4, locusdef='nearest_tss',
			genome = 'hg19', out_name=NULL)

# Get the list of peaks that were assigned to genes.
assigned_peaks = results$peaks

chipenrich documentation built on Nov. 8, 2020, 8:11 p.m.