predictors_annot_old: Generate predictors/features for a range based RNA...

Description Usage Arguments Details Value See Also Examples

View source: R/predictors_annot_old.R

Description

predictors_annot_old is used to generate features given a SummarizedExperiment object of RNA modification / target.

Usage

1
2
3
4
predictors_annot_old(se, txdb, bsgnm, fc = NULL, pc = NULL,
  struct_hybridize = NULL, feature_lst = NULL, motif = c("AAACA",
  "GAACA", "AGACA", "GGACA", "AAACT", "GAACT", "AGACT", "GGACT", "AAACC",
  "GAACC", "AGACC", "GGACC"), HK_genes_list = NULL)

Arguments

se

A SummarizedExperiment object containing the rowRanges for modifications. colData and assay are not neccessarily specified for this function.

txdb

TxDb object for annotating the corresponding rowRanges, this is either obtained from bioconductor or converted from the annotation files by GenomicFeatures::makeTxDbFromGFF.

bsgnm

BSgenome object for genomic sequence annotation, this should be downloaded from bioconductor.

fc, pc

Optional; GScores objects for annotations of standardized Fitness consequences scores and UCSC phastCons conservation scores.

Gulko B, Melissa J. Hubisz, Gronau I and Siepel A (2015). <e2><80><9c>Probabilities of fitness consequences for point mutations across the human genome.<e2><80><9d> Nature Genetics, 47, pp. 276-283.

Siepel A and al. e (2005). <e2><80><9c>Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.<e2><80><9d> Genome Research, 15, pp. 1034-1050.

struct_hybridize

Optional; A GRanges or GRangesList object indicating the hybridized region on the transcribed or exonic regions.

The precomputed MEA 2ndary structures could be find at the data attached in this package: Struc_hg19 and Struc_mm10.

feature_lst

Optional; A list of GRanges for user defined features, the names of the list will correspond to the names of features.

motif

A character vector indicating the motifs centered by the modification nucleotite, the motif will not be attached if the rowRanges of se is not single nucleotide resolution (with all width = 1).

By default, the motif selected is RRACH: c("AAACA","GAACA","AGACA","GGACA","AAACT","GAACT","AGACT","GGACT","AAACC","GAACC","AGACC","GGACC").

HK_genes_list

Optional; A character string of the Gene IDs of the House Keeping genes. The Gene IDs should correspond to the Gene IDs used by the provided TxDb object.

The entrez gene IDs of the house keeping genes of mm10 and hg19 are included in this package: HK_hg19 and HK_mm10.

Details

This function retreave transcript related features that are previous known to be related with m6A modifications based on provided rowRanges of the SummarizedExperiment, and it return features in forms of meta data collums of the SummarizedExperiment.

The features that must be included:

###1. Transcript regions ### —- The entries are logical / dummy variables.

- UTR5: 5'UTR.

- UTR3: 3'UTR.

- CDS: Coding Sequence.

- Stop_codons: Stop codon (301 bp center).

- Start_codons: Start codon (201 bp center).

- m6Am: 5'Cap m6Am (TSS that has underlying sequence of A).

- Exons: Exonic regions.

- Last_exons_50bp: Start 50bp of the last exon of a transcript.

###2. Relative positions ### —- The entries fall into the scale of [0,1]. If the site is not mapped to any range on the right, the value is set to 0. (can be viewed as an interactive term on top of the region model.)

- Pos_UTR5: Relative positioning on 5'UTR.

- Pos_UTR3: Relative positioning on 3'UTR.

- Pos_CDS: Relative positioning on Coding Sequence.

- Pos_Tx: Relative positioning on Transcript.

- Pos_exons: Relative positioning on exons.

###3. Region length ###

- long_UTR3: Long 3'UTR (length > 400bp).

- long_exon: Long exon (length > 400bp).

- Gene_length_ex: standardized gene length of exonic regions (z score).

- Gene_length_all: standardized gene length of all regions (z score).

#####=============== The following features that are optional ===============#####

###4. Motif ###

by default it includes the following motifs search c("AAACA","GAACA","AGACA","GGACA","AAACT","GAACT","AGACT","GGACT","AAACC","GAACC","AGACC","GGACC"): i.e. instances of RRACH.

###5. Evolutionary fitness ###

- PC 1nt: standardized PC score 1 nt.

- PC 201nt: standardized PC score 101 nt.

- FC 1nt: standardized Fitness consequences scores 1nt.

- FC 5nt: standardized Fitness consequences scores 101nt.

###6. User specified features by argument feature_lst ###

The entries are logical / dummy variables, specifying whether overlapping with each GRanges or GRanges list.

###7.Gene attribute ###

- sncRNA: small noncoding RNA (<= 200bp)

- lncRNA: long noncoding RNA (> 200bp)

- Isoform_num: Transcript isoform numbers standardized by z score.

- HK_genes: mapped to house keeping genes, such as defined by paper below.

Eisenberg E, Levanon EY (October 2013). "Human housekeeping genes, revisited". Trends in Genetics. 29

###7.Batch effect ###

- GC_cont_genes: GC content of each gene.

- GC_cont_101bp: GC content of 101bp local region of the sites.

Value

This function will return a SummarizedExperiment object with a mcols of a feature or design matrix.

See Also

logistic.modeling to perform model selection, statistics calculation, and visualization across multiple samples.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
### ==== For hg19 ==== ###

library(SummarizedExperiment)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)
library(fitCons.UCSC.hg19)
library(phastCons100way.UCSC.hg19)


Feature_List_hg19 = list(
HNRNPC_eCLIP = eCLIP_HNRNPC_gr,
YTHDC1_TREW = YTHDC1_TREW_gr,
YTHDF1_TREW = YTHDF1_TREW_gr,
YTHDF2_TREW = YTHDF2_TREW_gr,
miR_targeted_genes = miR_targeted_genes_grl,
#miRanda = miRanda_hg19_gr,
TargetScan = TargetScan_hg19_gr,
Verified_miRtargets = verified_targets_gr
)

SE_features_added <- predictors_annot_old(se = SE_example,
                       txdb = TxDb.Hsapiens.UCSC.hg19.knownGene,
                         bsgnm = Hsapiens,
                           fc = fitCons.UCSC.hg19,
                           pc = phastCons100way.UCSC.hg19,
                         struct_hybridize = Struc_hg19,
                       feature_lst = Feature_List_hg19,
                     HK_genes_list = HK_hg19_eids)

mcols(SE_features_added) ###Check the generated feature matrix.

#ToDo1 : add argument Reduce_GenomicFeature_Colinearity.
#ToDo2: add argument Reduce_GenomicResponse_Dependency.
#ToDo3: the sample_names_coldata is very very confusing.
#ToDo4: must support the input format of matrix and TRUE/FALSE for logistic regression.
#ToDo5: Response could be ordinary, binomial, and poisson.

#Fetures need to change into....
1. change fc and pc into z scores.
2. change last exon 50 bp into last exon relative position centered at 0.
3. transcript that stop codon falls in the last exons.
3. add last exon dummy.
4. add relative exonic rank 0-1.
5. add introns.
6. add relative intronic positions.
7. add relative intronic rank 0-1.
8. add splicing junction 5' 50bp exons
9. add splicing junction 3' 50bp exons
10. add splicing junction 5' 50bp introns.
11. add splicing junction 3' 50bp introns.
12. add all relative positions in MAD standardized absolute bp 5' end, absolute bp 3' end.

add another 30 features.

ZhenWei10/m6ALogisticModel documentation built on May 17, 2019, 10:11 p.m.