README.md

AStrap R package

Identification of alternative splicing from transcript sequences without a reference genome

About

AStrap implements a de novo approach to detect alternative splicing (AS) from transcript sequences without a reference genome, including identification of AS events by extensive pair-wise alignments of transcript sequences from SMRT sequencing data and prediction of AS types by a machine-learning model integrating more than 500 assembled features. AS events of four types including intron retention (IR), exon skipping (ES), alternative donor sites (AltD), and alternative acceptor sites (AltA) were considered. AStrap consists of four main stages: data preprocessing, feature construction, classification model building, identification of AS events and prediction of AS types. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources.

Installing AStrap

Mandatory

Required R Packages

Suggested R Packages

Installation

install.packages("devtools")
library(devtools)
install_github("BMILAB/AStrap")
library(AStrap)

Using AStrap

In order to facilitate user understanding, we use the provided example dataset to illustrate the standard analysis work-flow of AStrap. Please refer to the User Guide for full details.

Section 1 Data loading

For identification AS events and prediction AS types, first the user should load data into AStrap. * Use function "readDNAStringSet" to read transcriptome sequences (FASTA format).

##Loading transcript sequences
trSequence.path <- system.file("extdata","example_TRsequence.fasta",package = "AStrap")
trSequence <-  readDNAStringSet(trSequence.path,format = "fasta")
##Loading the file of a list of clusters generated by CD-HIT-EST
cdhit.path <- system.file("extdata","example_cdhitest.clstr",package = "AStrap")
raw.cluster <- readCDHIT(cdhit.path)
##Loading the alignment file in GFF3 format generated by GMAP
gmap.path <- system.file("extdata","example_gmap.gff3",package = "AStrap")
cluster.align <- readGMAP(gmap.path,raw.cluster, recluster = TRUE, recluster.identity = 0.7,recluster.coverage = 0.7)
#Pairwise alignment of isoforms in the same cluster
alignment <- cluster.align$alignment
#Adujust  clusters
rew.cluster <- cluster.align$cluster
##Plotting a network graph
gg1 <- plotCluster(raw.cluster,cluster.id=c("7"))
plot(gg1)
gg2 <- plotAlign(alignment,cluster.id=c("7"))
plot(gg2)

Section 2 Feature construction

In AStrap, we have compiled a compendium of 511 unique features that covers major factors known to shape introns and/or exons. In fact, feature construction has been embedded in the function AStrap (see below), users therefore don��t need to carry out this step. * Use function "extract_IsoSeq_tr" to extract sequence around splice sites based on the transcript sequences.

##Loading example data
load(system.file("data","sample_Aligndata.Rdata",package = "AStrap"))
##Extracting sequence around splice sites based on the transcript sequences
Aligndata <- extract_IsoSeq_tr(Aligndata,trSequence)
##Loading the consensus matrix of sequences of the [-2,+3] region of acceptor sites.
load(system.file("data","example_PWM_acceptor.Rdata",package = "AStrap"))
##Loading the consensus matrix of the sequences of the [-2,+3] region of donor sites
load(system.file("data","example_PWM_donor.Rdata",package = "AStrap"))
##Constructing the feature space
feature <- getFeature(Aligndata)

Section 3 Model building and performance evaluation

Two classification models trained on collected AS data from rice and human were integrated in AStrap, which could be directly applied for distinguishing among AS types for other species. For classification of AS types, we applied and compared three widely used machine-learning techniques, including support vector machine (SVM), random forests (RF), and adaptive boosting (AdaBoost). According to our analysis (see our paper), the RF-based model performed the best, followed by the AdaBoost-based model, and the SVM-based model performed the worst. Therefore, it is recommended that users adopt RF-based model for prediction of AS types. * Use rice classification model, including SVM, RF, AdaBoost.

rice_model<- load(system.file("data","rice_model.Rdata",package = "AStrap"))

human_model<- load(system.file("data","human_model.Rdata",package = "AStrap"))

Meanwhile, users can also train a specific classification model on their own data sets. * Use function "extract_IsoSeq_ge" to extract sequence around splice sites based on genome.

##Loading example alternative splicing data
path <- system.file("extdata","sample_riceAS.txt",package = "AStrap")
rice_ASdata <-read.table(path,sep="\t",head = TRUE,stringsAsFactors = FALSE)
##Loading genome using the package BSgenome
library("BSgenome.Osativa.MSU.MSU7")
##Extracting sequence around splice sites based on the genome
rice_ASdata<- extract_IsoSeq_ge(rice_ASdata,Osativa)
library(randomForest)
library(ROCR)
library(ggplot2)
model <- buildTrainModel(rice_ASdata, chooseNum = 100,
                          proTrain = 2/3, proTest = 1/3, ASlength =0,
                          classifier = "rf", use.all = FALSE)

Section 4 Identification of AS events and prediction of AS types

This section describes the identification of AS events based on pairwise alignment of isoforms of the same cluster and prediction of AS types based on the fitted model. * User function "AStrap" to identify AS events and predict AS types.

##Loading rice model
rice_model<- load(system.file("data","rice_model.Rdata",package = "AStrap"))   
##Identification and prediction based on RF-based model of rice
result <- AStrap(alignment,trSequence,rice_RFmodel)

library(Gviz)
plotAS(result$ASevent, id = 1)
plotAS(result$ASevent, id = 7)
plotAS(result$ASevent, id = 13)
plotAS(result$ASevent, id = 21)

Citation

If you are using AStrap, please cite: Ji G, Ye W, Su Y, Chen M, Huang G and Wu X* (2019) AStrap: identification of alternative splicing from transcript sequences without a reference genome, Bioinformatics, 35, 2654-2656.



BMILAB/AStrap documentation built on Nov. 20, 2020, 4:03 p.m.