vignettes/riboHMM2_vignette.md

title: "Package riboHMM2" author: "Jiangrong Ouyang" date: "r Sys.Date()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Package ‘riboHMM2’} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )


The package **riboHMM2** is an R package designed to find the locations of upstream Open Reading Frames (uORFs) and the main coding region in RNA.  

Given the ribo-seq counts, together with the RNA information, one cane use **riboHMM2** to estimate the paramters of the assumed hidden Markov model, and further to obtain the the locations of uORFs and the main coding region in RNA.  


## Installation  
Start R and run:   
```R
library(devtools)  
install_github("shimlab/riboHMM2")

Workflow of riboHMM2

We demonstrate 3 main functions in this package.

To see the performance of estimation and prediction on this model,

How to use the package

Set the true parameters and then use these parameters to generate the data.

We set the true parametersthe as follow:

Expected lengths for 5 main parts of a transcript:

  • 20 for 5'U (front-end untranslated region)
  • 36 for uORF
  • 20 for 5'U2 (middle untranslated region)
  • 96 for main coding region
  • 10 for 3'U (back-end untranslated region)

Expected heights for all 21 states:

  • 5 for 5'U, 5'U2 and 3'U
  • 45, 30, 15 for uI's (3 bases of initiation of uORF)
  • 40, 20, 10 for uE's (3 bases of elongation of uORF)
  • 45, 30, 15 for uT's (3 bases of termination of uORF)
  • 90, 75, 30 I's (3 bases of initiation of main coding region)
  • 60, 50, 20 E's (3 bases of elongation of main coding region)
  • 90, 75, 30 T's (3 bases of termination of main coding region)

Shape parameters used in Gamma distribution for all 21 states are all 10.

In the following example, we generate num=500 transcripts, 20% of the transcripts do not contain uORF.

For the normalizing constant E:

  • 1 for 1-100th transcripts
  • 5 for 101-200th transcripts
  • 10 for 201-300th transcripts
  • 15 for 301-400th transcripts
  • 20 for 401-500th transcripts
library(riboHMM2)

```{r, eval=FALSE} set.seed(2019)

setting parameters

num <- 500 E <- c(rep(1,num/5), rep(5,num/5), rep(10,num/5), rep(15,num/5), rep(20,num/5)) r <- 4 # Ratio of chians with uORF and non-uORF eL <- c(20, 36, 20, 96, 10) eH <- c(5, 45,30,15, 40,20,10, 45,30,15, 5, 90,75,30, 60,50,20, 90,75,30, 5) v <- rep(10, 21)

generate uORF data list

uORF <- list() for (i in 1:num){ uORF[[i]] <- generate_uORF(eL, eH, v, E[i], r) }


The data list can also found in the package by calling `uORF`.  
```{r, eval=FALSE}
df <- uORF
for(i in seq(1,length(uORF), by=100)){
   barplot(uORF[[i]]$x, names.arg=uORF[[i]]$z)
}

Estimate the parameters, and check against the true parameters

For demonstration, we only select part of the transcripts

k <- 6  # number of transcripts from each group
ii <- c()
for (i in 1:5){
  ii <- c(ii, ((i-1)*100+1):((i-1)*100+k))
}
print(ii)

Extract count data, RNA and normalizing constant from data list

df <- uORF[ii]
X <- RNA <- list(); E <- c()
for (i in 1:length(df)){
  X[[i]] <-  df[[i]]$x
  RNA[[i]] <- df[[i]]$RNA
  E[i] <- df[[i]]$E
}

Make an initial guess of the parameters

init <- init_generate(df, cutoff=10, r=4)

Start to estimate. For huge data, we can set the option detail=TRUE, so that we can keep track of the process of the estimation.

res <- estimate(X, RNA, E, init, tol=0.001, detail=TRUE)
print(res)

The result is save as a list, one can obtain parameters by ```{r, eval=FALSE} res$v_hat res$m_hat res$trans_hat res$v_history res$m_histroy res$trans_history

Compare with true parameters
```{r}
trans_hat <- res$trans_hat
trans <- df[[1]]$trans
print((trans_hat-trans)/trans)

m <- df[[1]]$m; v <- df[[1]]$v
m_hat <- res$m_hat; v_hat <- res$v_hat
cbind(c(m[1:10],NA), c(m_hat[1:10],NA), c(m_hat[1:10],NA)-c(m[1:10],NA),
      m[11:21],      m_hat[11:21],      m_hat[11:21]-m[11:21])

Reveal the hidden states, and compare against the true hidden states

The function optimalPath uses the estimated parameters to find the optimal hidden states ```{r, eval=FALSE} optimalPath(x, RNA, E, v_hat, m_hat, trans_hat)


We summarise the accuracy of estimation in a matrix `res`. For each transcript,  

  *  1st column: percentage of correct prediction  
  *  2nd column: number of correct prediction  
  *  3rd column: number of incorrect prediction

```{r}
num <- k*5  # number of transcripts used
res <- matrix(0, num ,3)
colnames(res) <- c("rate", "TRUE", "FASLE")
for (i in 1:num){
  n <- length(df[[i]]$x)
  optPth <- optimalPath(df[[i]]$x, df[[i]]$RNA, E[i], 
                        v_hat, m_hat, trans_hat)$optPth
  res[i,2] <- sum(optPth == df[[i]]$z)
  res[i,3] <- n - res[i,2]
  res[i,1] <- res[i,2]/n
}
print(res)

All functions

{width=360px}



shimlab/riboHMM2 documentation built on May 19, 2019, 6:23 p.m.