In Oshlack/Baller: Classification of RNA-seq data from ALL samples

Hello and welcome to the Acute Lymphoblastic Leukemia (ALL) Classifier. The aim of this tool is simple: to help classify different sub-types of ALL by gene expression from RNA seqeunced data. The tool will successfully classify a sample into one of the following four categories:

Phlike: Phildelphia like, a gene expression similar to that of the famous Philidelphia fuison (BCR-ABL)

ERG: A fusion resulting in an expression

ETV: A fusion resulting in an expression profile similar or same as that of ETV-RUNX1.

Other: A miscelleneous class containing a mixture of other sub-classes (e.g. MLL, High Hyperdiploidy, T-ALL e.t.c)

NOTE: By definition the classifier can ONLY designate a sample to one of the above four classes it was built on. However, if the probability for classification is less than threshold for all four categories the classifier returns as "Unclassified". This indicates that the gene expression profile was not similar enough to any of the classes with any confidence, however that does not mean it could not in reality be one of those types.

Right, so lets get stuck in to the functionality. This might be what a standard workflow could look like:

Read in the data and get in format:

cf <- system.file("data","test_data.txt",package="AllSorts") #Get path to raw text file (a tsv)
counts <- read.table(file=cf,sep=' ',stringsAsFactors = FALSE,header=TRUE)
head(counts)

Note that the row-names of counts must either be Gene Symbols or Ensembl Gene ID's.

Now use the streamline function to produce the log fpkm matrix with the genes required for classification:

library(AllSorts)
sfpkm <- streamline(counts[,c(1:6)],counts$Gene_Length)
head(sfpkm)

Once the FPKM has been constructed and subsetted purely on the genes required by the classifier one can simply input the dataset into the classifier:

threshes <- c(0.25,0.25,0.75,0.75)
classed <- classify(sfpkm,threshes)
classed

So as we can see, in this toy dataset, we classify two samples as ETV, one as ERG, and three others.

We may want to visualise our samples to see how different they are from themselves and whether the samples with same classification actually cluster together nicely:

visualise(sfpkm,classed)

Here the separation is clear and gives us confidence that we are distinguishing the classes well!

But if we have plenty of samples way may wish to get an overview of the probability landscape of which samples are more likely to be classed as one particular class as well as compare samples which have been classified as the same class. AllSorts provides a nice visaulisation for this: