knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "100%"
)

Introduction

rseAnalysis (RNA structure and Expression analysis) package includes series of utility function including stander file reader, structural prediction, RNA distance calculation, and analysis package. rseAnalysis provides an all-in-one solution for gene expression and secondary structure mutation correlation analysis by automating the data processing and analysis of gene expression and mutation data from the well-acknowledged database such as mirBase and TCGA.

To download MPLNClust, use the following commands:

# install.packages("devtools")
devtools::install_github("JackXu2333/rseAnalysis")

Data Input

The function vcf2df, fasta2df, and bed2df provides measures to read file with vcf, fasta, and bed extensions. The output file structures is as followed:

fasta object \itemize{ \item NAME - Corresponding of the RNA sequence \item SEQ - The original RNA sequence }

bed object \itemize{ \item CHROM - Located in chromosome CHROM \item STAPOS - The start location of the sequence \item ENDPOS - The end location of the sequence \item DIR - The direction of the sequence \item TYPE - The type of the sequence \item ID - ID of the RNA (if avilable) \item ALIAS - The alias of the sequence \item NAME - The name of the sequence }

vcf object \itemize{ \item CHROM - Located in chromosome CHROM \item POS - The start position of the mutation \item ID - ID of the mutation (if available) \item REF - The reference base(s) on the mutation point \item ALT - The alternative base(s) appeared in during mutation \item CONSEQUENCE - Record the effect of mutation under biological studies \item OCCURRENCE - Record the associated cancer and its distribution \item affected_donors - Record the total number of donor affected \item project_count - Number of project associated with current studies }

#Source library
library(rseAnalysis)
library(ggplot2)

#Load sample data file
vcf <- rseAnalysis::vcf2df(system.file("extdata", "hsa_GRCh37.vcf", package = "rseAnalysis"))
fasta <- rseAnalysis::fasta2df(system.file("extdata", "hsa_GRCh37.fasta", package = "rseAnalysis"))
bed <- rseAnalysis::bed2df(system.file("extdata", "hsa_GRCh37.bed", package = "rseAnalysis"))

#Inspect the imported file
head(vcf)
head(fasta)
head(bed)

Calculate RNA distance for mutated sequence

Find mutated RNA sequence based on the fasta, bed and vcf files

``` {r mutate, warning=FALSE}

Mutate RNA using mutation from vcf files

RNA.mutated <- RNA.validate(fasta = fasta, vcf = vcf, bed = bed)

Noted that message stated that the mutation has matching rate 0.714, this represents that only 71.4% of the mutation reference matches the gene sequence on sequence provided (fasta), and the remaining 28.6% of the sequence failed to match due to misalignment, differences in genome assemble in bed and fasta file, or potentially different representation of wildtype among vcf file and fasta file. For tutorial purposes, we will skip the mismatches and predict the secondary structure for the remaining. 


``` {r structure}

# ================== Sample code for RNA secondary structure prediction ==========================
#
#  struct.ori <- suppressMessages(predictStructure(executable.path = "../inst/extdata/exe"
#                                   , rna.name = RNA.mutated$NAME, rna.seq = RNA.mutated$SEQ))
#  struct.alt <- suppressMessages(predictStructure(executable.path = "../inst/extdata/exe"
#                                   , rna.name = RNA.mutated$NAME, rna.seq = RNA.mutated$MUT.SEQ))

# Read prerun result from the predictStructure
RNA.mutated <- subset(RNA.mutated, MATCH)[1:200,]
struct.ori <- read.csv(system.file("extdata", "vignetteSampleORI.csv", package = "rseAnalysis"))
struct.alt <- read.csv(system.file("extdata", "vignetteSampleALT.csv", package = "rseAnalysis"))

head(struct.ori)
head(struct.alt)

Calculation RNADistance based on the result from structural prediction is straightforward, the result help determines how much of a structural difference there are among the original and mutated RNA sequence. Executable.path is omitted for mac or Unix user who has RNAStructure installed see more at ?predict.distance

``` {r distance, message=FALSE, warning=FALSE}

Run prediction

RNA.distance <- predictDistance(name = RNA.mutated$NAME , struct.ori = struct.ori$struct.ori , struct.alt = struct.alt$struct.alt , method = "gsc")

## RNA distance and gene expression analysis 

``` {r analysis,  warning=FALSE}

#Load expression data

expression <- read.csv(system.file("extdata", "test.csv", package = "rseAnalysis"), header = TRUE)

#Use only standardize read
expression <- subset(expression, Read.Type == "reads_per_million_miRNA_mapped")[1:200, ]

result <- Analysis.DISEXP(dis.name = RNA.mutated$NAME, dis.distance = RNA.distance, 
                exp.tumor = expression$Sample, exp.sample = expression$Normal, method = "linear", showPlot = FALSE)

#Display statistical result
result$stats

#Display images from result
#result$plots

Differential normalize gene expression vs RNA Gene Distance Linear Model Boxplot of Differential normalize gene expression read by RNA Boxplot of RNA gene distance read by RNA Density plot of RNA gene distance read by RNA Density plot of RNA gene distance read by RNA

Analysis.DISEXP uses the absolute difference in expression in modeling change in expression between the tumour and normal samples from BRCA patients. Here we use "reads_per_million_miRNA_mapped" as the input because it is standardized and can use to compare between cases. !Beware that for the analysis to work, one has to make sure the gene expression data and distance data are collected from the same type of mutation, or from the same sample (usually a sample will be insufficient in retrieving both expression and sequencing information), or it affects the outcome of the analysis dramatically.

Gere the Analysis.DISEXP generate both text and graphical output, with text output indicating the beta and p_value of the resulting regression model. From this example, the correlation between RNA distance is -0.11 with p-value of 0.12. The graphical output shows the prediction model and confidence interval on the scatter plot, RNA distance distribution by RNA type and boxplots showing the potential outliers from gene expression and RNA distance data set.

Package References

Sijie Xu (NA). rseAnalysis: Correlation analysis of RNA secondary structure mutation and differential expression. R package version 0.1.0.


Other References

Kozomara, A., & Griffiths-Jones, S. (2011). miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic acids research, 39(Database issue), D152–D157. https://doi.org/10.1093/nar/gkq1027

Wickham, H. and Bryan, J. (2019). R Packages (2nd edition). Newton, Massachusetts: O'Reilly Media. https://r-pkgs.org/

TCGA Research Network: https://www.cancer.gov/tcga.

Zhiwen. T, Sijie Xu (2020) miRNA Motif Analysis https://github.com/Deemolotus/BCB330Y-and-BCB430Y/tree/master/Main

[END]


sessionInfo()


JackXu2333/dseAnalysis documentation built on Dec. 31, 2020, 1:09 p.m.