genomeDerivedFeatures: Extract genome-derived predictive features for range-based...

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/genomeDerivedFeatures.R

Description

A function to extract genome-derived features from the input GRanges object and the annotations specified by TxDb-like objects.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
genomeDerivedFeatures(
  x,
  transcriptdb,
  sequence = NULL,
  gscores = NULL,
  clusteringY = NULL,
  extraRegions = NULL,
  flankSizes = (25 * 2^(0:6)),
  ambiguityMethod = c("auto", " mean", "sum", "min", "max"),
  nomapValue = c("0", "NA", "nearest"),
  annotSeqnames = FALSE,
  annotBiotype = FALSE
)

Arguments

x

A GRanges object for the genomic ranges to be extracted.

transcriptdb

A TxDb or EnsDb object for the transcript annotation.

sequence

A BSgenome or XStringSet object for the genome sequence. See the available.genomes function for how to install a genome.

gscores

A GScores object specifying the genomic scores to be extracted, or a list of GScores for creating properties using more than one type of genomic scores. See the vignette for more details: vignette("GenomicScores", package = "GenomicScores").

clusteringY

A GRanges or GRangesList specifying the object for clustering annotation, or a list of GRanges/GRangesList for creating clustering properties on more than one annotations.

extraRegions

A GRanges or GRangesList providing the additional region to extract properties, can be a list of GRanges or GRangesList to supplement more than one regions.

flankSizes

An integer vector indicating the number of bases flanked by x when calculating the x centered properties; default=(25*2^(0:6)).

ambiguityMethod

By default, except the overlapping features, if x[i] overlaps multiple regions, the returned property will be the average of the properties over all overlapped regions by x[i]. For the overlapping features, as long as x overlaps any range of the region, the returned value is 1 .

If ambiguityMethod is "mean", "sum", "min", or "max", then the mean, sum, minimum, and maximum values of the >1 mapping will be returned.

nomapValue

When nomapValue is "NA" or "0", the x that do not match the region will return NA and 0 respectively.

If nomapValue is "nearest", the not matched x will be set to be the properties on its nearest region.

annotSeqnames

TRUE or FALSE. If TRUE, the seqnames in x are output as features, with each level represented by a dummy variable column.

annotBiotype

TRUE or FALSE. If TRUE, the transcript biotypes defined in the EnsDb are output as features, with each biotype represented by a dummy variable column.

Details

The function first extract multiple genomic properties/attributes on the 14 region types defined by the TxDb-like object, and then assign the attributes to x through the region closest to the instances of x. The specific behaviors of the assignments can be modified through the parameters of ambiguityMethod and nomapValue.

Particularly, the genome region types include x (self), exons, introns, 5'UTR (exons only), full 5'UTR, CDS (exons only), full CDS, 3'UTR (exons only), full 3'UTR, mature transcripts, full transcripts, genes (exons only), full genes, and promoters.

For each region type, the function can generate 5 types of basic properties accordingly: overlap with region, region length, relative position of x on the region, distance to region 5'end, and distance to region 3'end.

In addition, when the corresponding annotation objects are provided, the function can add 5 more types of properties. Specifically, GC contents of regions are extracted if sequence is provided; Genomic Scores of the regions are extracted if gscores is provided; and Clustering effects of annotations on regions, including Count, Densities, and the Distance to the nearest annotation are calculated if clusteringY is provided.

More region types and properties can be extracted if the list of objects are supplied at the corresponding arguments extraRegions, gscores, and clusteringY. The names of the list elements will correspond to the labeling of the regions or properties in the column names of the returned table.

The returned properties and region types can be subsetted and augmented by with the arguments properties, regions, supRegion.

Finally, the function generate 3 extra features to describe the unique properties at the gene's level: gene's exon number, gene's transcript isoform number, and meta-tx topology.

Value

A data.frame object whose number of rows is the length of x. The column types in the data.frame are all numeric.

Author(s)

Zhen Wei

See Also

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## ---------------------------------------------------------------------
## SIMPLE EXAMPLE
## ---------------------------------------------------------------------

## Load the hg19 TxDb object for human transcript annotation (UCSC hg19):
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

## Define the Granges to be extracted:
set.seed(01)

X <- GRanges(rep(c("chr1", "chr2"), c(15, 15)),
             IRanges(c(sample(11874:12127, 15), sample(38814:41527,15)), width=5),
             strand=Rle(c("+", "-"), c(15, 15)))

## Extract the basic set of properties using the genomic regions defined in TxDb:
gfeatures <- genomeDerivedFeatures(X,
                                   transcriptdb=TxDb.Hsapiens.UCSC.hg19.knownGene)
str(gfeatures)
             
## ---------------------------------------------------------------------
## DEFINE MORE PROPERTIES AND REGIONS
## ---------------------------------------------------------------------
## Not run: 
library(BSgenome.Hsapiens.UCSC.hg19)
library(phastCons100way.UCSC.hg19)

## Extract more genomic properties derived from genome sequence and phastCons score:
gfeatures_expanded <- genomeDerivedFeatures(X,
                                            transcriptdb=TxDb.Hsapiens.UCSC.hg19.knownGene,
                                            sequence=BSgenome.Hsapiens.UCSC.hg19,
                                            gscores=phastCons100way.UCSC.hg19,
                                            clusteringY=X) #quantify clustering properties of x itself
str(gfeatures_expanded) 
 
## Extract region properties features defined in Ensemble transcript annotation (EnsDb):
library(EnsDb.Hsapiens.v75)

gfeatures_ensdb <- genomeDerivedFeatures(X,
                                         transcriptdb=EnsDb.Hsapiens.v75,
                                         sequence=BSgenome.Hsapiens.UCSC.hg19,
                                         gscores=phastCons100way.UCSC.hg19,
                                         clusteringY=X,
                                         annotBiotype=TRUE) #extract transcript biotypes in EnsDb
str(gfeatures_ensdb)

## End(Not run)

ZW-xjtlu/WhistleR documentation built on March 13, 2021, 10:50 a.m.