knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(phyloseq2ML)
library(phyloseq)

Before starting

You can also go directly to Let's start

Vignette organization

This vignette part serves as link between the different sections and lays out some prerequisites.

First look how to prepare your data coming from a phyloseq object for Machine Learning.

We then go into the Random Forest specific part of the workflow running a Ranger classification.

Complementary to that, I will explain how to run a classification with keras here. Please note that re-used functions explained in one of the vignettes above will not be explained again in detail.

I furthermore added some code examples on how to run both multiclass classification and regression with ranger and keras. As this only requires minimal adjustments, comments are kept to a minimum.

Background

If you are completely new to Machine Learning (ML) or even to R at all, you might want to do some reading first (you probably already have), such as 10 tips for machine learning with biological data. You should know what supervised machine learning is and at least roughly how Random Forests work. Check out this youtube channel for the best statistics and machine learning videos I've seen (there a lot of bad ones online!). Double Bam!

And for neural networks: my keras code is mostly based on this book Deep Learning with R. But I also included a lot of (very shallow) explanations along the way to give you the key words for easier recherche.

If you know your way around R, you should have a look at packages like caret, parsnip etc which try to provide a standardized interface to various machine learning and other regression methods.

Purpose of this package

In short: be able to use or test machine learning (currently Random Forest and Neural Networks) with microbiome data and corresponding environmental data.

How: By providing some functions that make use of phyloseqs standard data format. Phyloseq is the starting point as you can basically get all kinds of data into phyloseq format.

Additional: Providing wrappers to actual run ML using this package and get results with metrics. Maybe some more functions to look into the results. This is especially for not-hardcore users, who probably want to specify their own ML approaches. Using this package, the data is ready to go for that.

What does the package contain

Why are these specific methods implemented?

For historical reasons. I used Random Forest and Artificial Neurals Networks in my first research paper based on a lab experiment. We were now curious to see how the same methods perform in a real world setting. As this package mainly is a way of distributing the code that I use for my current work, these are the supported methods at the moment.

What does this package require?

devtools::install_github("mikemc/speedyseq")

It replaces a couple of phyloseq included functions with a faster version. For this package, we are only interested in a faster version of tax_glom, but as a phyloseq user, having a quicker psmelt function might be even more interesting.

Future vignette plans

Future package plans

If I find some time I would like to test and include gradient boosting

The sample data explained

The example phyloseq object is based upon a real data set, which is not published yet. As this package deals with the FORMAT rather then CONTENT I massively reduced and randomized the values here presented. This way, we have an interesting use case to be demonstrated in the vignette. Sample and microbial data was partly generated in the UDEMM project. I will explain below what the sample data involves:

Data describing the sediment sampling:

Sequencing related information:

The measured concentrations o unexploded ordnance (UXO):

Element concentrations and ratios:

The following elements were measured in each sediment. They are named after this scheme: First the periodic table abbreviations are given (and optionally the atomic number), after the underscore the measurement unit.

Sum parameters:

Grain size distribution:



RJ333/phyloseq2ML documentation built on June 2, 2020, 9:25 p.m.