Identification of alternative splicing from transcript sequences without a reference genome
AStrap implements a de novo approach to detect alternative splicing (AS) from transcript sequences without a reference genome, including identification of AS events by extensive pair-wise alignments of transcript sequences from SMRT sequencing data and prediction of AS types by a machine-learning model integrating more than 500 assembled features. AS events of four types including intron retention (IR), exon skipping (ES), alternative donor sites (AltD), and alternative acceptor sites (AltA) were considered. AStrap consists of four main stages: data preprocessing, feature construction, classification model building, identification of AS events and prediction of AS types. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources.
install.packages("devtools")
library(devtools)
install_github("BMILAB/AStrap")
library(AStrap)
In order to facilitate user understanding, we use the provided example dataset to illustrate the standard analysis work-flow of AStrap. Please refer to the User Guide for full details.
For identification AS events and prediction AS types, first the user should load data into AStrap. * Use function "readDNAStringSet" to read transcriptome sequences (FASTA format).
##Loading transcript sequences
trSequence.path <- system.file("extdata","example_TRsequence.fasta",package = "AStrap")
trSequence <- readDNAStringSet(trSequence.path,format = "fasta")
##Loading the file of a list of clusters generated by CD-HIT-EST
cdhit.path <- system.file("extdata","example_cdhitest.clstr",package = "AStrap")
raw.cluster <- readCDHIT(cdhit.path)
##Loading the alignment file in GFF3 format generated by GMAP
gmap.path <- system.file("extdata","example_gmap.gff3",package = "AStrap")
cluster.align <- readGMAP(gmap.path,raw.cluster, recluster = TRUE, recluster.identity = 0.7,recluster.coverage = 0.7)
#Pairwise alignment of isoforms in the same cluster
alignment <- cluster.align$alignment
#Adujust clusters
rew.cluster <- cluster.align$cluster
##Plotting a network graph
gg1 <- plotCluster(raw.cluster,cluster.id=c("7"))
plot(gg1)
gg2 <- plotAlign(alignment,cluster.id=c("7"))
plot(gg2)
In AStrap, we have compiled a compendium of 511 unique features that covers major factors known to shape introns and/or exons. In fact, feature construction has been embedded in the function AStrap (see below), users therefore don��t need to carry out this step. * Use function "extract_IsoSeq_tr" to extract sequence around splice sites based on the transcript sequences.
##Loading example data
load(system.file("data","sample_Aligndata.Rdata",package = "AStrap"))
##Extracting sequence around splice sites based on the transcript sequences
Aligndata <- extract_IsoSeq_tr(Aligndata,trSequence)
##Loading the consensus matrix of sequences of the [-2,+3] region of acceptor sites.
load(system.file("data","example_PWM_acceptor.Rdata",package = "AStrap"))
##Loading the consensus matrix of the sequences of the [-2,+3] region of donor sites
load(system.file("data","example_PWM_donor.Rdata",package = "AStrap"))
##Constructing the feature space
feature <- getFeature(Aligndata)
Two classification models trained on collected AS data from rice and human were integrated in AStrap, which could be directly applied for distinguishing among AS types for other species. For classification of AS types, we applied and compared three widely used machine-learning techniques, including support vector machine (SVM), random forests (RF), and adaptive boosting (AdaBoost). According to our analysis (see our paper), the RF-based model performed the best, followed by the AdaBoost-based model, and the SVM-based model performed the worst. Therefore, it is recommended that users adopt RF-based model for prediction of AS types. * Use rice classification model, including SVM, RF, AdaBoost.
rice_model<- load(system.file("data","rice_model.Rdata",package = "AStrap"))
human_model<- load(system.file("data","human_model.Rdata",package = "AStrap"))
Meanwhile, users can also train a specific classification model on their own data sets. * Use function "extract_IsoSeq_ge" to extract sequence around splice sites based on genome.
##Loading example alternative splicing data
path <- system.file("extdata","sample_riceAS.txt",package = "AStrap")
rice_ASdata <-read.table(path,sep="\t",head = TRUE,stringsAsFactors = FALSE)
##Loading genome using the package BSgenome
library("BSgenome.Osativa.MSU.MSU7")
##Extracting sequence around splice sites based on the genome
rice_ASdata<- extract_IsoSeq_ge(rice_ASdata,Osativa)
library(randomForest)
library(ROCR)
library(ggplot2)
model <- buildTrainModel(rice_ASdata, chooseNum = 100,
proTrain = 2/3, proTest = 1/3, ASlength =0,
classifier = "rf", use.all = FALSE)
This section describes the identification of AS events based on pairwise alignment of isoforms of the same cluster and prediction of AS types based on the fitted model. * User function "AStrap" to identify AS events and predict AS types.
##Loading rice model
rice_model<- load(system.file("data","rice_model.Rdata",package = "AStrap"))
##Identification and prediction based on RF-based model of rice
result <- AStrap(alignment,trSequence,rice_RFmodel)
library(Gviz)
plotAS(result$ASevent, id = 1)
plotAS(result$ASevent, id = 7)
plotAS(result$ASevent, id = 13)
plotAS(result$ASevent, id = 21)
If you are using AStrap, please cite: Ji G, Ye W, Su Y, Chen M, Huang G and Wu X* (2019) AStrap: identification of alternative splicing from transcript sequences without a reference genome, Bioinformatics, 35, 2654-2656.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.