defineRegions: Define Genomic Regions Based on Gene Annotations
In steveped/chipExtra: Additional functions for working with ChIP-Seq data

defineRegions

R Documentation

Define Genomic Regions Based on Gene Annotations

Description

Use gene, transcript and exon annotations to define genomic regions

Usage

defineRegions(
  genes,
  transcripts,
  exons,
  promoter = c(2500, 500),
  upstream = 5000,
  intron = TRUE,
  proximal = 10000,
  simplify = FALSE,
  cols = c("gene_id", "gene_name", "transcript_id", "transcript_name"),
  ...
)

Arguments

`genes`, `transcripts`, `exons`	GRanges objects defining each level of annotation
`promoter`	Numeric vector defining upstream and/or downstream distances for a promoter. Passing a single value will define a symmetrical promoter The first value represents the upstream range
`upstream`	The distance from a TSS defining an upstream promoter
`intron`	logical(1) Separate gene bodies into introns and exons. If `intron = FALSE` gene bodies will simply be defined as gene bodies
`proximal`	Distance from a gene to be considered a proximal intergenic region. If set to 0, intergenic regions will simply be considered as uniformly intergenic
`simplify`	Passed internally to `reduceMC` and `setdiffMC`
`cols`	Column names to be retained from the supplied annotations
`...`	Not used

Details

Using GRanges annotated as genes, transcripts and exons this function will define ranges uniquely assigned to a region type using a hierarchical process. By default, these region types will be (in order) 1) Promoters, 2) Upstream Promoters, 3) Exons, 4) Introns, 5) Proximal Intergenic and 6) Distal Intergenic.

Setting intron = FALSE will replace introns and exons with a generic "Gene Body" annotation. Setting proximal = 0 will return all intergenic regions (not previously annotated as promoters or upstream promoters) to an "Intergenic" category

Notably, once a region has been defined, it is excluded from all subsequent candidate regions.

Any columns matching the names provided in cols will be returned, and it is assumed that the gene/transcript/exon ranges will contain informative columns in the mcols() element.

Value

A GRangesList

Examples


## Define two exons for two transcripts
sq <- Seqinfo(seqnames = "chr1", seqlengths = 50000)
e <- c("chr1:20001-21000", "chr1:29001-29950", "chr1:22001-23000", "chr1:29001-30000")
e <- GRanges(e, seqinfo = sq)
mcols(e) <- DataFrame(
  gene_id = "Gene1", transcript_id = paste0("Trans", c(1, 1, 2, 2))
)

## Define the transcript ranges
t <- unlist(endoapply(split(e, e$transcript_id), range))
t$gene_id <- "Gene1"
t$transcript_id <- names(t)
names(t) <- NULL

## Summarise to gene level
g <- reduceMC(t)
g$transcript_id <- NA_character_

## Now annotate the regions
regions <- defineRegions(genes = g, transcripts = t, exons = e)
sort(unlist(regions))

## Alternatively, collpse gene body and intergenic ranges
regions <- defineRegions(
  genes = g, transcripts = t, exons = e, intron = FALSE, proximal = 0
)
sort(unlist(regions))

steveped/chipExtra documentation built on June 1, 2025, 7:05 p.m.