defineRegions: Define Genomic Regions Based on Gene Annotations

View source: R/defineRegions.R

defineRegionsR Documentation

Define Genomic Regions Based on Gene Annotations

Description

Use gene, transcript and exon annotations to define genomic regions

Usage

defineRegions(
  genes,
  transcripts,
  exons,
  promoter = c(2500, 500),
  upstream = 5000,
  intron = TRUE,
  proximal = 10000,
  simplify = FALSE,
  cols = c("gene_id", "gene_name", "transcript_id", "transcript_name"),
  ...
)

Arguments

genes, transcripts, exons

GRanges objects defining each level of annotation

promoter

Numeric vector defining upstream and/or downstream distances for a promoter. Passing a single value will define a symmetrical promoter The first value represents the upstream range

upstream

The distance from a TSS defining an upstream promoter

intron

logical(1) Separate gene bodies into introns and exons. If intron = FALSE gene bodies will simply be defined as gene bodies

proximal

Distance from a gene to be considered a proximal intergenic region. If set to 0, intergenic regions will simply be considered as uniformly intergenic

simplify

Passed internally to reduceMC and setdiffMC

cols

Column names to be retained from the supplied annotations

...

Not used

Details

Using GRanges annotated as genes, transcripts and exons this function will define ranges uniquely assigned to a region type using a hierarchical process. By default, these region types will be (in order) 1) Promoters, 2) Upstream Promoters, 3) Exons, 4) Introns, 5) Proximal Intergenic and 6) Distal Intergenic.

Setting intron = FALSE will replace introns and exons with a generic "Gene Body" annotation. Setting proximal = 0 will return all intergenic regions (not previously annotated as promoters or upstream promoters) to an "Intergenic" category

Notably, once a region has been defined, it is excluded from all subsequent candidate regions.

Any columns matching the names provided in cols will be returned, and it is assumed that the gene/transcript/exon ranges will contain informative columns in the mcols() element.

Value

A GRangesList

Examples


## Define two exons for two transcripts
sq <- Seqinfo(seqnames = "chr1", seqlengths = 50000)
e <- c("chr1:20001-21000", "chr1:29001-29950", "chr1:22001-23000", "chr1:29001-30000")
e <- GRanges(e, seqinfo = sq)
mcols(e) <- DataFrame(
  gene_id = "Gene1", transcript_id = paste0("Trans", c(1, 1, 2, 2))
)

## Define the transcript ranges
t <- unlist(endoapply(split(e, e$transcript_id), range))
t$gene_id <- "Gene1"
t$transcript_id <- names(t)
names(t) <- NULL

## Summarise to gene level
g <- reduceMC(t)
g$transcript_id <- NA_character_

## Now annotate the regions
regions <- defineRegions(genes = g, transcripts = t, exons = e)
sort(unlist(regions))

## Alternatively, collpse gene body and intergenic ranges
regions <- defineRegions(
  genes = g, transcripts = t, exons = e, intron = FALSE, proximal = 0
)
sort(unlist(regions))



steveped/chipExtra documentation built on May 2, 2024, 12:11 p.m.