predictors_annot: Generate predictors/features for a range based RNA...

Description Usage Arguments Details Value See Also Examples

View source: R/predictors_annot.R

Description

predictors_annot is used to generate features given a SummarizedExperiment object of RNA modification / target.

Usage

1
2
3
4
5
6
7
8
predictors_annot(se, txdb, bsgnm, fc = NULL, pc = NULL,
  struct_hybridize = NULL, feature_lst = NULL, motif = c("AAACA",
  "GAACA", "AGACA", "GGACA", "AAACT", "GAACT", "AGACT", "GGACT", "AAACC",
  "GAACC", "AGACC", "GGACC"), motif_clustering = "DRACH",
  annot_clustering = NULL, hk_genes_list = NULL,
  isoform_ambiguity_method = c("longest_tx", "average"),
  genes_ambiguity_method = c("drop_overlap", "average"),
  standardization = TRUE)

Arguments

se

A SummarizedExperiment object containing the rowRanges for modifications. colData and assay are not neccessarily specified for this function.

txdb

TxDb object for annotating the corresponding rowRanges, this is either obtained from bioconductor or converted from the annotation files by GenomicFeatures::makeTxDbFromGFF.

bsgnm

BSgenome object for genomic sequence annotation, this should be downloaded from bioconductor.

fc, pc

Optional; GScores objects for annotations of standardized Fitness consequences scores and UCSC phastCons conservation scores.

Gulko B, Melissa J. Hubisz, Gronau I and Siepel A (2015). <e2><80><9c>Probabilities of fitness consequences for point mutations across the human genome.<e2><80><9d> Nature Genetics, 47, pp. 276-283.

Siepel A and al. e (2005). <e2><80><9c>Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.<e2><80><9d> Genome Research, 15, pp. 1034-1050.

struct_hybridize

Optional; A GRanges or GRangesList object indicating the hybridized region on the transcribed or exonic regions.

The precomputed MEA 2ndary structures could be find at the data attached in this package: Struc_hg19 and Struc_mm10.

feature_lst

Optional; A list of GRanges for user defined features, the names of the list will correspond to the names of features.

motif

A character vector indicating the motifs centered by the modification nucleotite, the motif will not be attached if the rowRanges of se is not single nucleotide resolution (with all width = 1).

By default, the motif selected is RRACH: c("AAACA","GAACA","AGACA","GGACA","AAACT","GAACT","AGACT","GGACT","AAACC","GAACC","AGACC","GGACC").

motif_clustering

A character vector indicating the motif used to generate the features for the clustering indexes, Default: "DRACH".

annot_clustering

A GRanges object to generate clustering features. Default: NULL.

The resulting clustering features will be named clust_f100, clust_f1000, dist_nearest_p200, and dist_nearest_p2000

hk_genes_list

Optional; A character string of the Gene IDs of the House Keeping genes. The Gene IDs should correspond to the Gene IDs used by the provided TxDb object.

The entrez gene IDs of the house keeping genes of mm10 and hg19 are included in this package: HK_hg19_eids and HK_mm10_eids.

isoform_ambiguity_method

Can be "longest_tx" or "average". The former keeps only the longest transcript as the transcript annotation. The later will use the average feature entries for multiple mapping of the transcript isoform.

genes_ambiguity_method

Can be "drop_overlap" or "average". The former will not annotate the modification sites overlapped with > 1 genes (By returning NA). The later will use the average feature entries for mapping of multiple genes.

standardization

A logical indicating whether to standardize the continous features; Default TRUE.

Details

This function retreave transcript related features that are previous known to be related with m6A modifications based on provided rowRanges of the SummarizedExperiment, and it return features in forms of meta data collums of the SummarizedExperiment.

The features that must be included:

###1. Transcript regions ### —- The entries are logical / dummy variables.

- UTR5: 5'UTR.

- UTR3: 3'UTR.

- cds: Coding Sequence.

- Stop_codons: Stop codon (301 bp center).

- Start_codons: Start codon (201 bp center).

- m6Am: 5'Cap m6Am (TSS that has underlying sequence of A).

- Exons: Exonic regions.

- last_exons_50bp: Start 50bp of the last exon of a transcript.

###2. Relative positions ### —- The entries fall into the scale of [0,1]. If the site is not mapped to any range on the right, the value is set to 0. (can be viewed as an interactive term on top of the region model.)

- pos_UTR5: Relative positioning on 5'UTR.

- pos_UTR3: Relative positioning on 3'UTR.

- pos_cds: Relative positioning on Coding Sequence.

- pos_Tx: Relative positioning on Transcript.

- pos_exons: Relative positioning on exons.

###3. Region length ###

- long_UTR3: Long 3'UTR (length > 400bp).

- long_exon: Long exon (length > 400bp).

- Gene_length_ex: standardized gene length of exonic regions (z score).

- Gene_length_all: standardized gene length of all regions (z score).

#####=============== The following features that are optional ===============#####

###4. Motif ###

by default it includes the following motifs search c("AAACA","GAACA","AGACA","GGACA","AAACT","GAACT","AGACT","GGACT","AAACC","GAACC","AGACC","GGACC"): i.e. instances of RRACH.

###5. Evolutionary fitness ###

- PC 1bp: standardized PC score 1 nt.

- PC 201bp: standardized PC score 101 nt.

- FC 1bp: standardized Fitness consequences scores 1bp.

- FC 5nt: standardized Fitness consequences scores 101bp.

###6. User specified features by argument feature_lst ###

The entries are logical / dummy variables, specifying whether overlapping with each GRanges or GRanges list.

###7.Gene attribute ###

- sncRNA: small noncoding RNA (<= 200bp)

- lncRNA: long noncoding RNA (> 200bp)

- Isoform_num: Transcript isoform numbers standardized by z score.

- HK_genes: mapped to house keeping genes, such as defined by paper below.

Eisenberg E, Levanon EY (October 2013). "Human housekeeping genes, revisited". Trends in Genetics. 29

###7.Batch effect ###

- GC_cont_genes: GC content of each gene.

- GC_cont_101bp: GC content of 101bp local region of the sites.

Value

This function will return a SummarizedExperiment object with a mcols of a feature or design matrix.

See Also

glm_bas, glm_multinomial, glm_regular to perform model selection, statistics calculation, and visualization across multiple samples.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
### ==== For hg19 ==== ###

library(SummarizedExperiment)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)
library(fitCons.UCSC.hg19)
library(phastCons100way.UCSC.hg19)


Feature_List_hg19 = list(
HNRNPC_eCLIP = eCLIP_HNRNPC_gr,
YTHDC1_TREW = YTHDC1_TREW_gr,
YTHDF1_TREW = YTHDF1_TREW_gr,
YTHDF2_TREW = YTHDF2_TREW_gr,
miR_targeted_genes = miR_targeted_genes_grl,
#miRanda = miRanda_hg19_gr,
TargetScan = TargetScan_hg19_gr,
Verified_miRtargets = verified_targets_gr
)

SE_features_added <- predictors_annot(se = SummarizedExperiment(rowRanges = hg19_miCLIP_gr),
txdb = txdb,
bsgnm = Hsapiens,
fc = fitCons.UCSC.hg19,
pc = phastCons100way.UCSC.hg19,
struct_hybridize = Struc_hg19,
feature_lst = Additional_features_hg19,
hk_genes_list = HK_hg19_eids,
motif = c("AAACA","AGACA","AAACT","AGACT","AAACC","AGACC",
          "GAACA","GGACA","GAACT","GGACT","GAACC","GGACC",
          "TAACA","TGACA","TAACT","TGACT","TAACC","TGACC"),
motif_clustering = "DRACH",
standardization = F,
genes_ambiguity_method = "average")


mcols(SE_features_added) ###Check the generated feature matrix.

ZhenWei10/m6ALogisticModel documentation built on May 17, 2019, 10:11 p.m.