riboHMM2: Analysis tool for Ribo-seq

title: "Package riboHMM2" author: "Jiangrong Ouyang" date: "r Sys.Date()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Package ‘riboHMM2’} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )


The package **riboHMM2** is an R package designed to find the locations of upstream Open Reading Frames (uORFs) and the main coding region in RNA.  

Given the ribo-seq counts, together with the RNA information, one cane use **riboHMM2** to estimate the paramters of the assumed hidden Markov model, and further to obtain the the locations of uORFs and the main coding region in RNA.  


## Installation  
Start R and run:   
```R
library(devtools)  
install_github("shimlab/riboHMM2")

We demonstrate 3 main functions in this package.

generate_uORF() would generate a sequence of hidden chain ${Z_i}{i\ge 1}$ and the corresponding observed count data ${X_i}{i\ge 1}$.
estimate() would take the observed count data and the corresponding RNA information, and then estimate the parameters of the underlying hidden Markov model.
optimalPath() would reveal the hidden chain based on the observed count data and the estimated parameters.

To see the performance of estimation and prediction on this model,

we first set the true parameters and then use these parameters to generate the data.
Now we have the data coming from a known underlying model, we can estimate the parameters, and check against the true parameters.
Futher we use the estimated parameters to reveal the hidden states, and compare against the true hidden states.

We set the true parametersthe as follow:

Expected lengths for 5 main parts of a transcript:

20 for 5'U (front-end untranslated region)

36 for uORF

20 for 5'U2 (middle untranslated region)

96 for main coding region

10 for 3'U (back-end untranslated region)

Expected heights for all 21 states:

5 for 5'U, 5'U2 and 3'U

45, 30, 15 for uI's (3 bases of initiation of uORF)

40, 20, 10 for uE's (3 bases of elongation of uORF)

45, 30, 15 for uT's (3 bases of termination of uORF)

90, 75, 30 I's (3 bases of initiation of main coding region)

60, 50, 20 E's (3 bases of elongation of main coding region)

90, 75, 30 T's (3 bases of termination of main coding region)

Shape parameters used in Gamma distribution for all 21 states are all 10.

In the following example, we generate num=500 transcripts, 20% of the transcripts do not contain uORF.

For the normalizing constant E:

1 for 1-100th transcripts

5 for 101-200th transcripts

10 for 201-300th transcripts

15 for 301-400th transcripts

20 for 401-500th transcripts

library(riboHMM2)

```{r, eval=FALSE} set.seed(2019)

num <- 500 E <- c(rep(1,num/5), rep(5,num/5), rep(10,num/5), rep(15,num/5), rep(20,num/5)) r <- 4 # Ratio of chians with uORF and non-uORF eL <- c(20, 36, 20, 96, 10) eH <- c(5, 45,30,15, 40,20,10, 45,30,15, 5, 90,75,30, 60,50,20, 90,75,30, 5) v <- rep(10, 21)

uORF <- list() for (i in 1:num){ uORF[[i]] <- generate_uORF(eL, eH, v, E[i], r) }


The data list can also found in the package by calling `uORF`.  
```{r, eval=FALSE}
df <- uORF

for(i in seq(1,length(uORF), by=100)){
   barplot(uORF[[i]]$x, names.arg=uORF[[i]]$z)
}

For demonstration, we only select part of the transcripts

k <- 6  # number of transcripts from each group
ii <- c()
for (i in 1:5){
  ii <- c(ii, ((i-1)*100+1):((i-1)*100+k))
}
print(ii)

Extract count data, RNA and normalizing constant from data list

df <- uORF[ii]
X <- RNA <- list(); E <- c()
for (i in 1:length(df)){
  X[[i]] <-  df[[i]]$x
  RNA[[i]] <- df[[i]]$RNA
  E[i] <- df[[i]]$E
}

Make an initial guess of the parameters

init <- init_generate(df, cutoff=10, r=4)

Start to estimate. For huge data, we can set the option detail=TRUE, so that we can keep track of the process of the estimation.

res <- estimate(X, RNA, E, init, tol=0.001, detail=TRUE)
print(res)

The result is save as a list, one can obtain parameters by ```{r, eval=FALSE} res$v_hat res$m_hat res$trans_hat res$v_history res$m_histroy res$trans_history

Compare with true parameters
```{r}
trans_hat <- res$trans_hat
trans <- df[[1]]$trans
print((trans_hat-trans)/trans)

m <- df[[1]]$m; v <- df[[1]]$v
m_hat <- res$m_hat; v_hat <- res$v_hat
cbind(c(m[1:10],NA), c(m_hat[1:10],NA), c(m_hat[1:10],NA)-c(m[1:10],NA),
      m[11:21],      m_hat[11:21],      m_hat[11:21]-m[11:21])

The function optimalPath uses the estimated parameters to find the optimal hidden states ```{r, eval=FALSE} optimalPath(x, RNA, E, v_hat, m_hat, trans_hat)


We summarise the accuracy of estimation in a matrix `res`. For each transcript,  

  *  1st column: percentage of correct prediction  
  *  2nd column: number of correct prediction  
  *  3rd column: number of incorrect prediction

```{r}
num <- k*5  # number of transcripts used
res <- matrix(0, num ,3)
colnames(res) <- c("rate", "TRUE", "FASLE")
for (i in 1:num){
  n <- length(df[[i]]$x)
  optPth <- optimalPath(df[[i]]$x, df[[i]]$RNA, E[i], 
                        v_hat, m_hat, trans_hat)$optPth
  res[i,2] <- sum(optPth == df[[i]]$z)
  res[i,3] <- n - res[i,2]
  res[i,1] <- res[i,2]/n
}
print(res)

{width=360px}