The predomics package offers access to an original Machine Learning framework implementing several heuristics that allow discovering sparse and interpretable models in large datasets. These models are efficient and adapted for classification and regression tasks in metagenomics and other datasets with commensurable variables. We introduce the custom BTR (BIN, TER, RATIO) languages that describe different types of associations between variables. Moreover, in the same framework we implemented several state-of-the-art methods (SOTA) including RF, ENET and SVM. The predomics
package started in 2015 and has evolved quickly since. A major improvement came in 2023. The package comes also with predomicsapp, a R Shiny application for easy training and exploration of results.
We introduce here the predomics package, which is designed to search for simple and interpretable predictive models from omics data and more specifically metagenomics. These models, called BTR (for Bin/Ter/Ratio) are based on a novel family of languages designed to represent the microbial interactions in microbial ecosystems. Moreover, in this package we have proposed four different optimization heuristics that allow to discover some of the best predictive models. A model in predomics is a set of indexes from the dataset (i.e. variables) along with the respective coefficients belonging to the ternary set {-1, 0, 1}
and an intercept of the form (A + B + C - K - L - M < intercept)
. The number of variables in a model, also known as model size, sparsity or parsimony, can vary in a range provided as a parameters to the experiment.
In predomics we have impemented the following types of object:
model
, which can be tested with isModel()
, is a list, which contains information on the features, the languages, the fitting scores etc.population of models
, which can be tested with isPopulation()
is a list of model
objects.model collection
, which can be tested with isModelCollection()
is a list of population of models
objects. They are grouped by model size.classifier
, which can be tested with isClf()
is a set of parameters, which defines a learner ready to be run. This object will contain also all the information used during the training. This object is passed over to most of the functions for easy handling of parameters.experiment
, which can be tested with isExperiment()
is a top level object, which contains a classifier
object along with the learned models organized as a model collection
object.All these objects can be quickly viewed with the printy()
function. Other existing functions allow conversion from one object type to the other as for instance modelCollectionToPopulation()
. An experiment can be explored using the digest()
routine along with many other functions implemented in more than 18K lines of code that compose this package.
In this package we have proposed four different heuristics to search for the best predictive models.
(a, b)
for two consecutive model-size (k, k+1)
. The best models of model-size k
are used to generate combinations of model-size k+1
and the best ones are kept for the next round. The size of the windows is fixed in parameters at the beginning of the experiment. The results can be also considered as a population of models. This method shows the best performance so far for a given budget of time.{-1, 0, 1}
.A predomics model is coded in R as a S3 object, which contains a certain number of attributes among which the learner
(algorithm) that generated it but also the language
that is used. The languages we have proposed in the current version are the following.
(A + B + C < intercept)
. In a bin/bininter language we have only two coefficients from the binary set {0, 1}
. Features that do not appear in the model have the coefficient 0
, while the others have the coefficient 1
. The difference between bin and bininter is that the intercept for the former is set to zero.(A + B + C - K - L - M < intercept)
we can see that in this model of size k=6
, we have coefficients from the ternary set {-1, 0, 1}
. Features that do not appear in the model have the coefficient 0
while the others have 1
or -1
. Models that have only positive or only negative coefficients are not considered as they would be bin models. The difference between ter and terinter is that the intercept for the former is set to zero.(A+B+C)/(K+L+M) < intercept
.If you have any questions or feedback, please contact us at:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.