README.md
In BANDITS: BANDITS: Bayesian ANalysis of DIfferenTial Splicing

BANDITS: Bayesian ANalysis of DIfferenTial Splicing

BANDITS is a Bayesian hierarchical model for detecting differential splicing of genes and transcripts, via differential transcript usage (DTU), between two or more conditions. The method uses a Bayesian hierarchical framework, which allows for sample specific proportions in a Dirichlet-multinomial model, and samples the allocation of fragments to the transcripts. Parameters are inferred via Markov chain Monte Carlo (MCMC) techniques and a DTU test is performed via a multivariate Wald test on the posterior densities for the average relative abundance of transcripts.

Simone Tiberi and Mark D Robinson (2020). BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty.

Genome Biology 21 (69). doi: 10.1186/s13059-020-01967-8

BANDITS is available on Bioconductor and can be installed with the command:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("BANDITS")

To install the latest development version of the package from github, use devtools (available here):

devtools::install_github("SimoneTiberi/BANDITS")

To install the package jointly with its vignette remove --no-build-vignettes from build_opts:

devtools::install_github("SimoneTiberi/BANDITS", build_opts = c("--no-resave-data", "--no-manual"))

The vignette illustrating how to use the package can be accessed on the Bioconductor website or from R via:

vignette("BANDITS")

browseVignettes("BANDITS")

The package inputs the equivalence classes and respective counts, representing what transcripts each read is compatible with. These can be obtained by aligning reads either directly to a reference transcriptome with pseudo-alignmers, via salmon or kallisto, or to a reference genome with splice-aware genome alignment algorithms, via STAR, and checking the transcripts compatible with each genome alignment with salmon

NOTE: when using salmon, use the option --dumpEq to obtain the equivalence classes, when using STAR, use the option --quantMode TranscriptomeSAM to obtain alignments translated into transcript coordinates, and when using kallisto, run both the quant and pseudo modes to obtain the transcript estimated counts and equivalence classes, respectively.

Below we show three pipelines for aligning reads with salmon, kallisto and STAR.

To obtain the example raw data, download or clone the ARMOR github repository:

git clone https://github.com/csoneson/ARMOR.git

# set a base_dir variable to the downloaded repo
base_dir="~/ARMOR"

# input reads:
fastq_files=$base_dir/example_data/FASTQ

The example data consits of four paired-end RNA-seq reads of 63 base pairs.

Create a variable for the fasta format reference transcriptome (cDNA):

fasta_tr=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz

Make the directory for the salmon output and the genome index

# create a directory for salmon
mkdir $base_dir/salmon

# create a directory for the genome index
mkdir $base_dir/salmon/Salmon_index
idx=$base_dir/salmon/Salmon_index

# create a directory for the output of the alignment
mkdir $base_dir/salmon/alignment
out_Salmon=$base_dir/salmon/alignment

Build salmon index

salmon index -i $idx -t $fasta_tr -p 4 --type quasi -k 31

Align reads and quantify transcript abundance with salmon:

salmon quant -i $idx -l A -1 $fastq_files/SRR1039508_R1.fastq.gz -2 $fastq_files/SRR1039508_R2.fastq.gz \
-p 4 -o $out_Salmon/sample1 --seqBias --gcBias --dumpEq

The option --dumpEq is essential to obtain the equivalence classes from salmon.

In the output folder ($out_Salmon/sample1), the file quant.sf contains the estimated transcripts abundances, while the equivalence classes (and respective counts) are stored in aux_info/eq_classes.txt.

Create a variable for the fasta format reference transcriptome (cDNA):

fasta_tr=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz

Make the directory for the kallisto output and the genome index

# create a directory for kallisto
mkdir $base_dir/kallisto

# create a directory for the genome index
mkdir $base_dir/kallisto/kallisto_index
idx=$base_dir/kallisto/kallisto_index

# create a directory for the output of the alignment
mkdir $base_dir/kallisto/alignment
out_kallisto_alignment=$base_dir/kallisto/alignment

# create a directory for the output of the equivalence classes
mkdir $base_dir/kallisto/pseudo
out_kallisto_pseudo=$base_dir/kallisto/pseudo

Build kallisto index

kallisto index -i $idx/transcripts.idx $fasta_tr

Align reads and quantify transcript abundance with kallisto:

kallisto quant -i $idx/transcripts.idx -o $out_kallisto_alignment/sample1 --bias --threads 4 \
$fastq_files/SRR1039508_R1.fastq.gz $fastq_files/SRR1039508_R2.fastq.gz

In the output folder ($out_kallisto_alignment/sample1), the file abundance.tsv contains the estimated transcripts abundances.

Compute the equivalence classes and respective counts with kallisto:

kallisto pseudo -i $idx/transcripts.idx -o $out_kallisto_pseudo/sample1  \
$fastq_files/SRR1039508_R1.fastq.gz $fastq_files/SRR1039508_R2.fastq.gz

In the output folder ($out_kallisto_pseudo/sample1), the file pseudoalignments.ec contains the transcripts forming each equivalence class, while pseudoalignments.tsv contains the equivalence classes counts.

Create a variable for the fasta format reference genome (DNA) and gtf file:

fasta=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.dna.chromosome.1.1.10M.fa
gtf=$base_dir/example_data/reference/Ensembl.GRCh38.93/Homo_sapiens.GRCh38.93.1.1.10M.gtf

Make the directory for the STAR output and the genome index

# create a directory for STAR
mkdir $base_dir/STAR

# create a directory for the genome index
mkdir $base_dir/STAR/genome_index
GDIR=$base_dir/STAR/genome_index

# create a directory for the output of the alignment
mkdir $base_dir/STAR/alignment
outDir=$base_dir/STAR/alignment

Generate the Genome index:

STAR --runMode genomeGenerate --runThreadN 4 --genomeDir $GDIR  \
       --genomeFastaFiles $fasta --sjdbGTFfile $gtf --sjdbOverhang 62

Note that sjdbOverhang ideally should be set to the lenght of the reads -1 (our reads are 63 bps).

Align reads with STAR:

cd $outDir
STAR --runMode alignReads --runThreadN 4 --genomeDir $GDIR \
--readFilesIn <(zcat $fastq_files/SRR1039508_R1.fastq.gz) <(zcat $fastq_files/SRR1039508_R2.fastq.gz) \
--outFileNamePrefix sample1 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM

When running STAR --quantMode TranscriptomeSAM is essential to obtain the transcript alignments.

Use gffread to build a reference transcriptome (fasta format) compatible with the DNA fasta and gtf files used for STAR:

gffread -w cDNA.fa -g $fasta $gtf
cdna=$outDir/cDNA.fa

Use salmon on the transcript alignments to compute the equivalence classes:

salmon quant -t $cdna -l A -a sample1Aligned.toTranscriptome.out.bam -o sample1 -p 4 --dumpEq

The option --dumpEq is essential to obtain the equivalence classes from salmon.

In the output folder ($outDir/sample1), the file quant.sf contains the estimated transcripts abundances, while the equivalence classes (and respective counts) are stored in aux_info/eq_classes.txt.