Home

/

GitHub

/

ZW-xjtlu/WhistleR

/

genomeDerivedFeatures: Extract genome-derived predictive features for range-based...

genomeDerivedFeatures: Extract genome-derived predictive features for range-based...
In ZW-xjtlu/WhistleR: Comprehensive Annotation of Predictive Features for Interval-based Genomic Data

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/genomeDerivedFeatures.R

A function to extract genome-derived features from the input GRanges object and the annotations specified by TxDb-like objects.

genomeDerivedFeatures(
  x,
  transcriptdb,
  sequence = NULL,
  gscores = NULL,
  clusteringY = NULL,
  extraRegions = NULL,
  flankSizes = (25 * 2^(0:6)),
  ambiguityMethod = c("auto", " mean", "sum", "min", "max"),
  nomapValue = c("0", "NA", "nearest"),
  annotSeqnames = FALSE,
  annotBiotype = FALSE
)

`x`	A GRanges object for the genomic ranges to be extracted.
`transcriptdb`	A TxDb or EnsDb object for the transcript annotation.
`sequence`	A BSgenome or XStringSet object for the genome sequence. See the available.genomes function for how to install a genome.
`gscores`	A GScores object specifying the genomic scores to be extracted, or a `list` of `GScores` for creating properties using more than one type of genomic scores. See the vignette for more details: `vignette("GenomicScores", package = "GenomicScores")`.
`clusteringY`	A GRanges or GRangesList specifying the object for clustering annotation, or a `list` of `GRanges`/`GRangesList` for creating clustering properties on more than one annotations.
`extraRegions`	A GRanges or GRangesList providing the additional region to extract properties, can be a `list` of `GRanges` or `GRangesList` to supplement more than one regions.
`flankSizes`	An `integer` vector indicating the number of bases flanked by `x` when calculating the `x` centered properties; default`=(25*2^(0:6))`.
`ambiguityMethod`	By default, except the overlapping features, if `x[i]` overlaps multiple regions, the returned property will be the average of the properties over all overlapped regions by `x[i]`. For the overlapping features, as long as x overlaps any range of the region, the returned value is 1 . If `ambiguityMethod` is `"mean"`, `"sum"`, `"min"`, or `"max"`, then the mean, sum, minimum, and maximum values of the >1 mapping will be returned.
`nomapValue`	When `nomapValue` is `"NA"` or `"0"`, the `x` that do not match the region will return `NA` and `0` respectively. If `nomapValue` is `"nearest"`, the not matched `x` will be set to be the properties on its nearest region.
`annotSeqnames`	`TRUE` or `FALSE`. If `TRUE`, the seqnames in `x` are output as features, with each level represented by a dummy variable column.
`annotBiotype`	`TRUE` or `FALSE`. If `TRUE`, the transcript biotypes defined in the `EnsDb` are output as features, with each biotype represented by a dummy variable column.

The function first extract multiple genomic properties/attributes on the 14 region types defined by the TxDb-like object, and then assign the attributes to x through the region closest to the instances of x. The specific behaviors of the assignments can be modified through the parameters of ambiguityMethod and nomapValue.

Particularly, the genome region types include x (self), exons, introns, 5'UTR (exons only), full 5'UTR, CDS (exons only), full CDS, 3'UTR (exons only), full 3'UTR, mature transcripts, full transcripts, genes (exons only), full genes, and promoters.

For each region type, the function can generate 5 types of basic properties accordingly: overlap with region, region length, relative position of x on the region, distance to region 5'end, and distance to region 3'end.

In addition, when the corresponding annotation objects are provided, the function can add 5 more types of properties. Specifically, GC contents of regions are extracted if sequence is provided; Genomic Scores of the regions are extracted if gscores is provided; and Clustering effects of annotations on regions, including Count, Densities, and the Distance to the nearest annotation are calculated if clusteringY is provided.

More region types and properties can be extracted if the list of objects are supplied at the corresponding arguments extraRegions, gscores, and clusteringY. The names of the list elements will correspond to the labeling of the regions or properties in the column names of the returned table.

The returned properties and region types can be subsetted and augmented by with the arguments properties, regions, supRegion.

Finally, the function generate 3 extra features to describe the unique properties at the gene's level: gene's exon number, gene's transcript isoform number, and meta-tx topology.

A data.frame object whose number of rows is the length of x. The column types in the data.frame are all numeric.

Zhen Wei

The extractRegionProperty for crafting of various properties on genomic regions.
The sequenceDerivedFeatures for extraction of sequence-derived features under different encoding schema.
The topologyOnTranscripts for calculation of the meta-tx topology on transcripts of genes.

## ---------------------------------------------------------------------
## SIMPLE EXAMPLE
## ---------------------------------------------------------------------

## Load the hg19 TxDb object for human transcript annotation (UCSC hg19):
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

## Define the Granges to be extracted:
set.seed(01)

X <- GRanges(rep(c("chr1", "chr2"), c(15, 15)),
             IRanges(c(sample(11874:12127, 15), sample(38814:41527,15)), width=5),
             strand=Rle(c("+", "-"), c(15, 15)))

## Extract the basic set of properties using the genomic regions defined in TxDb:
gfeatures <- genomeDerivedFeatures(X,
                                   transcriptdb=TxDb.Hsapiens.UCSC.hg19.knownGene)
str(gfeatures)
             
## ---------------------------------------------------------------------
## DEFINE MORE PROPERTIES AND REGIONS
## ---------------------------------------------------------------------
## Not run: 
library(BSgenome.Hsapiens.UCSC.hg19)
library(phastCons100way.UCSC.hg19)

## Extract more genomic properties derived from genome sequence and phastCons score:
gfeatures_expanded <- genomeDerivedFeatures(X,
                                            transcriptdb=TxDb.Hsapiens.UCSC.hg19.knownGene,
                                            sequence=BSgenome.Hsapiens.UCSC.hg19,
                                            gscores=phastCons100way.UCSC.hg19,
                                            clusteringY=X) #quantify clustering properties of x itself
str(gfeatures_expanded) 
 
## Extract region properties features defined in Ensemble transcript annotation (EnsDb):
library(EnsDb.Hsapiens.v75)

gfeatures_ensdb <- genomeDerivedFeatures(X,
                                         transcriptdb=EnsDb.Hsapiens.v75,
                                         sequence=BSgenome.Hsapiens.UCSC.hg19,
                                         gscores=phastCons100way.UCSC.hg19,
                                         clusteringY=X,
                                         annotBiotype=TRUE) #extract transcript biotypes in EnsDb
str(gfeatures_ensdb)

## End(Not run)

ZW-xjtlu/WhistleR documentation built on March 13, 2021, 10:50 a.m.

ZW-xjtlu/WhistleR index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ZW-xjtlu/WhistleR
Comprehensive Annotation of Predictive Features for Interval-based Genomic Data

genomeDerivedFeatures: Extract genome-derived predictive features for range-based...
In ZW-xjtlu/WhistleR: Comprehensive Annotation of Predictive Features for Interval-based Genomic Data

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to genomeDerivedFeatures in ZW-xjtlu/WhistleR...

R Package Documentation

Browse R Packages

We want your feedback!

ZW-xjtlu/WhistleR Comprehensive Annotation of Predictive Features for Interval-based Genomic Data

genomeDerivedFeatures: Extract genome-derived predictive features for range-based... In ZW-xjtlu/WhistleR: Comprehensive Annotation of Predictive Features for Interval-based Genomic Data

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to genomeDerivedFeatures in ZW-xjtlu/WhistleR...

R Package Documentation

Browse R Packages

We want your feedback!

ZW-xjtlu/WhistleR
Comprehensive Annotation of Predictive Features for Interval-based Genomic Data

genomeDerivedFeatures: Extract genome-derived predictive features for range-based...
In ZW-xjtlu/WhistleR: Comprehensive Annotation of Predictive Features for Interval-based Genomic Data