title: "Package riboHMM2"
author: "Jiangrong Ouyang"
date: "r Sys.Date()
"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Package ‘riboHMM2’}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The package **riboHMM2** is an R package designed to find the locations of upstream Open Reading Frames (uORFs) and the main coding region in RNA.
Given the ribo-seq counts, together with the RNA information, one cane use **riboHMM2** to estimate the paramters of the assumed hidden Markov model, and further to obtain the the locations of uORFs and the main coding region in RNA.
## Installation
Start R and run:
```R
library(devtools)
install_github("shimlab/riboHMM2")
We demonstrate 3 main functions in this package.
generate_uORF()
would generate a sequence of hidden chain ${Z_i}{i\ge 1}$ and the corresponding observed count data ${X_i}{i\ge 1}$. estimate()
would take the observed count data and the corresponding RNA information, and then estimate the parameters of the underlying hidden Markov model. optimalPath()
would reveal the hidden chain based on the observed count data and the estimated parameters. To see the performance of estimation and prediction on this model,
we first set the true parameters and then use these parameters to generate the data.
Now we have the data coming from a known underlying model, we can estimate the parameters, and check against the true parameters.
We set the true parametersthe as follow:
Expected lengths for 5 main parts of a transcript:
- 20 for 5'U (front-end untranslated region)
- 36 for uORF
- 20 for 5'U2 (middle untranslated region)
- 96 for main coding region
- 10 for 3'U (back-end untranslated region)
Expected heights for all 21 states:
- 5 for 5'U, 5'U2 and 3'U
- 45, 30, 15 for uI's (3 bases of initiation of uORF)
- 40, 20, 10 for uE's (3 bases of elongation of uORF)
- 45, 30, 15 for uT's (3 bases of termination of uORF)
- 90, 75, 30 I's (3 bases of initiation of main coding region)
- 60, 50, 20 E's (3 bases of elongation of main coding region)
- 90, 75, 30 T's (3 bases of termination of main coding region)
Shape parameters used in Gamma distribution for all 21 states are all 10.
In the following example, we generate num=500
transcripts, 20% of the transcripts do not contain uORF.
For the normalizing constant E:
- 1 for 1-100th transcripts
- 5 for 101-200th transcripts
- 10 for 201-300th transcripts
- 15 for 301-400th transcripts
- 20 for 401-500th transcripts
library(riboHMM2)
```{r, eval=FALSE} set.seed(2019)
num <- 500 E <- c(rep(1,num/5), rep(5,num/5), rep(10,num/5), rep(15,num/5), rep(20,num/5)) r <- 4 # Ratio of chians with uORF and non-uORF eL <- c(20, 36, 20, 96, 10) eH <- c(5, 45,30,15, 40,20,10, 45,30,15, 5, 90,75,30, 60,50,20, 90,75,30, 5) v <- rep(10, 21)
uORF <- list() for (i in 1:num){ uORF[[i]] <- generate_uORF(eL, eH, v, E[i], r) }
The data list can also found in the package by calling `uORF`.
```{r, eval=FALSE}
df <- uORF
for(i in seq(1,length(uORF), by=100)){
barplot(uORF[[i]]$x, names.arg=uORF[[i]]$z)
}
For demonstration, we only select part of the transcripts
k <- 6 # number of transcripts from each group
ii <- c()
for (i in 1:5){
ii <- c(ii, ((i-1)*100+1):((i-1)*100+k))
}
print(ii)
Extract count data, RNA and normalizing constant from data list
df <- uORF[ii]
X <- RNA <- list(); E <- c()
for (i in 1:length(df)){
X[[i]] <- df[[i]]$x
RNA[[i]] <- df[[i]]$RNA
E[i] <- df[[i]]$E
}
Make an initial guess of the parameters
init <- init_generate(df, cutoff=10, r=4)
Start to estimate. For huge data, we can set the option detail=TRUE
, so that we can keep track of the process of the estimation.
res <- estimate(X, RNA, E, init, tol=0.001, detail=TRUE)
print(res)
The result is save as a list, one can obtain parameters by ```{r, eval=FALSE} res$v_hat res$m_hat res$trans_hat res$v_history res$m_histroy res$trans_history
Compare with true parameters
```{r}
trans_hat <- res$trans_hat
trans <- df[[1]]$trans
print((trans_hat-trans)/trans)
m <- df[[1]]$m; v <- df[[1]]$v
m_hat <- res$m_hat; v_hat <- res$v_hat
cbind(c(m[1:10],NA), c(m_hat[1:10],NA), c(m_hat[1:10],NA)-c(m[1:10],NA),
m[11:21], m_hat[11:21], m_hat[11:21]-m[11:21])
The function optimalPath
uses the estimated parameters to find the optimal hidden states
```{r, eval=FALSE}
optimalPath(x, RNA, E, v_hat, m_hat, trans_hat)
We summarise the accuracy of estimation in a matrix `res`. For each transcript,
* 1st column: percentage of correct prediction
* 2nd column: number of correct prediction
* 3rd column: number of incorrect prediction
```{r}
num <- k*5 # number of transcripts used
res <- matrix(0, num ,3)
colnames(res) <- c("rate", "TRUE", "FASLE")
for (i in 1:num){
n <- length(df[[i]]$x)
optPth <- optimalPath(df[[i]]$x, df[[i]]$RNA, E[i],
v_hat, m_hat, trans_hat)$optPth
res[i,2] <- sum(optPth == df[[i]]$z)
res[i,3] <- n - res[i,2]
res[i,1] <- res[i,2]/n
}
print(res)
{width=360px}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.