SUMMA is an R package for unsupervised ensemble learning which combines predictions of multiple predictors. Ensemble learning is an elegant solution for the over-fitting problem in machine learning which is an important challenge in computational biology. The biggest advantage of SUMMA is that it is unsupervised so it does not require gold standard labels associated with each sample which is not available in most biomedical problems. Therefore, SUMMA can be applied to a wide range of biomedical problems including Network Inference, Differential Expression, Cancer Diagnosis or any other two class classification problem for which there is more than a single algorithm is available.
SUMMA needs as an input a prediction matrix, denoted as P, which represents the predictions of individual methods. The matrix P is of size m by n, where m is the number of samples (observations) and n is the number of methods. SUMMA supports two modes for the prediction matrix it can either have binary numbers {-1,1} in which case SUMMA will run the SML algorithm of Parisi et.al., or P may consists of real numbers that allows the algorithm to rank samples, in which case the SUMMA algorithm will be run on the data. In the rank prediction case by convention we assume the positive samples are more likely to have higher scores than samples belonging to the negative class. With the prediction matrix as an input, SUMMA will do the following:
SUMMA will rank methods based on their performance:
SUMMA will calculate weights for each methods proportional to its performance which also corresponds to the maximum likelihood estimator. Using the weights SUMMA will calculate an unsupervised ensemble classifier.
For more details about the theory of the SUMMA algorithm please check our paper [Ahsen et. al.] (https://arxiv.org/abs/1802.04684). In the SUMMA package, we also include the so-called WOC (Wisdom of Crowds) ensemble which assigns equal weight to each method. The WOC ensemble has shown robust performance throughout various crowdsourcing challenges such as the DREAM5 Network Inference Challenge [Marbach et. al.] (https://www.nature.com/nmeth/journal/v9/n8/abs/nmeth.2016.html). Next two sections contains some example codes for the binary and rank prediction cases.
For the convenience of the user, we have added create_predictions(m,n,p,type)
, where m is the number samples, n is the number of methods, p is the prevalence and type is type of output which either
the "binary"
or "rank"
. We can create a binary classification problem with m=3000,n=30
and
p=0.3
.
data_binary=create_predictions(3000,30,0.3,"binary")
For consistency issues, we will load the data included with the package
load(system.file("extdata", "data_binary.Rdata", package="summa"))
prediction_matrix=data_binary$predictions
actual_labels=data_binary$actual_labels
The function create_predictions
outputs a list where the first element of the list is the prediction matrix and second element is the gold standard labels associated with each sample. Note that the actual labels is not necessary to run SUMMA, however if you have it SUMMA has a convenience function to evaluate the results for the user. Next let us run summa:
summa=summa(prediction_matrix,"binary")
which gives an S4 class summa object.
The most relevant parts of the summa object is [email protected]_performance
which is the estimated balanced accuracy, and [email protected]_label
which is the estimated labels for each sample. If we also have the gold standard labels we can use the calculate_performance
function to calculate the actual performances as well as the performance of the unsupervised ensemble classifier. To do this we run
summa=calculate_performance(summa,actual_labels)
which returns again a summa object in which [email protected]_performance
performance is the actual performance of each method and [email protected]_performance
is the performance of the unsupervised ensemble. Moreover, [email protected]_performance
corresponds to the performance of the WOC (wisdom of the crowds) where one assign equal weights to each classifier. Finally, we have also added a convenience plot function which plots the actual v.s. estimated performances of each method as well as the performance of unsupervised ensemble and WOC
summa_plot(summa)
When the prediction matrix have continuous values that allows the samples to be ranked, the package runs the summa algorithm. The intuition and outputs in this case
are exactly the same as in the binary case but in this case the performance is the AUCROC instead of balanced accuracy. Again we create predictions using create_predictions
functions but this time we set
type=''rank''
and also we can create a prediction
matrix with m=2000
samples and n=40
methods with
prevalence p=0.7
data_rank=create_predictions(2000,40,0.7,"rank")
We will load the data included with the package
load(system.file("extdata", "data_rank.Rdata", package="summa"))
prediction_matrix=data_rank$predictions
actual_labels=data_rank$actual_labels
Next we run SUMMA using the simulated data
summa=summa(prediction_matrix,"rank")
where the only change from the binary case is that we set the second argument as "rank"
.
If we have the actual labels, we can again use the
calculate_performance
function to calculate the performances
summa=calculate_performance(summa,actual_labels)
where again [email protected]_performance
is the estimated AUCROC, and [email protected]_label
is the estimated labels for each sample. Different from the binary case, [email protected]_rank
gives a ranking of the samples predicted by the summa algorithm. This might be useful for comparing different samples. Finally, we can use the summa_plot
function to plot the performances
summa_plot(summa)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.