knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(phyloseq2ML) library(phyloseq)
You can also go directly to Let's start
This vignette part serves as link between the different sections and lays out some prerequisites.
First look how to prepare your data coming from a phyloseq object for Machine Learning.
We then go into the Random Forest specific part of the workflow running a Ranger classification.
Complementary to that, I will explain how to run a classification with keras here. Please note that re-used functions explained in one of the vignettes above will not be explained again in detail.
I furthermore added some code examples on how to run both multiclass classification and regression with ranger and keras. As this only requires minimal adjustments, comments are kept to a minimum.
If you are completely new to Machine Learning (ML) or even to R at all, you might want to do some reading first (you probably already have), such as 10 tips for machine learning with biological data. You should know what supervised machine learning is and at least roughly how Random Forests work. Check out this youtube channel for the best statistics and machine learning videos I've seen (there a lot of bad ones online!). Double Bam!
And for neural networks: my keras code is mostly based on this book Deep Learning with R. But I also included a lot of (very shallow) explanations along the way to give you the key words for easier recherche.
If you know your way around R, you should have a look at packages like caret
,
parsnip
etc which try to provide a standardized interface to various machine
learning and other regression methods.
In short: be able to use or test machine learning (currently Random Forest and Neural Networks) with microbiome data and corresponding environmental data.
How: By providing some functions that make use of phyloseqs standard data format. Phyloseq is the starting point as you can basically get all kinds of data into phyloseq format.
Additional: Providing wrappers to actual run ML using this package and get results with metrics. Maybe some more functions to look into the results. This is especially for not-hardcore users, who probably want to specify their own ML approaches. Using this package, the data is ready to go for that.
For historical reasons. I used Random Forest and Artificial Neurals Networks in my first research paper based on a lab experiment. We were now curious to see how the same methods perform in a real world setting. As this package mainly is a way of distributing the code that I use for my current work, these are the supported methods at the moment.
phyloseq
, ranger
and keras
with tensorflow
as backenddata.table
for the oversampling functionfutile.logger
, you can set the information
urgency thresholdpurrr
and tidyr
are used for the ML wrapper functions. ggplot2
for plottingfastDummies
is used to turn factor columns into integer dummy columnsspeedyseq
available on github, you can install it
using:devtools::install_github("mikemc/speedyseq")
It replaces a couple of phyloseq included functions with a faster version. For
this package, we are only interested in a faster version of tax_glom
, but as
a phyloseq user, having a quicker psmelt
function might be even more
interesting.
purrr:pmap
-called ranger_*
and keras_*
functions
to your needsIf I find some time I would like to test and include gradient boosting
The example phyloseq object is based upon a real data set, which is not published yet. As this package deals with the FORMAT rather then CONTENT I massively reduced and randomized the values here presented. This way, we have an interesting use case to be demonstrated in the vignette. Sample and microbial data was partly generated in the UDEMM project. I will explain below what the sample data involves:
The following elements were measured in each sediment. They are named after this scheme: First the periodic table abbreviations are given (and optionally the atomic number), after the underscore the measurement unit.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.