In RJ333/phyloseq2ML: Connects Phyloseq To Ranger And Keras

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(phyloseq2ML)
library(phyloseq)

Before starting

You can also go directly to Let's start

Vignette organization

This vignette part serves as link between the different sections and lays out some prerequisites.

First look how to prepare your data coming from a phyloseq object for Machine Learning.

We then go into the Random Forest specific part of the workflow running a Ranger classification.

Complementary to that, I will explain how to run a classification with keras here. Please note that re-used functions explained in one of the vignettes above will not be explained again in detail.

I furthermore added some code examples on how to run both multiclass classification and regression with ranger and keras. As this only requires minimal adjustments, comments are kept to a minimum.

Background

If you are completely new to Machine Learning (ML) or even to R at all, you might want to do some reading first (you probably already have), such as 10 tips for machine learning with biological data. You should know what supervised machine learning is and at least roughly how Random Forests work. Check out this youtube channel for the best statistics and machine learning videos I've seen (there a lot of bad ones online!). Double Bam!

And for neural networks: my keras code is mostly based on this book Deep Learning with R. But I also included a lot of (very shallow) explanations along the way to give you the key words for easier recherche.

If you know your way around R, you should have a look at packages like caret, parsnip etc which try to provide a standardized interface to various machine learning and other regression methods.

Purpose of this package

In short: be able to use or test machine learning (currently Random Forest and Neural Networks) with microbiome data and corresponding environmental data.

How: By providing some functions that make use of phyloseqs standard data format. Phyloseq is the starting point as you can basically get all kinds of data into phyloseq format.

Additional: Providing wrappers to actual run ML using this package and get results with metrics. Maybe some more functions to look into the results. This is especially for not-hardcore users, who probably want to specify their own ML approaches. Using this package, the data is ready to go for that.

What does the package contain

Phyloseq object modification
Data preparation for binary and multi-class classification and regression
Generating a list of input data frames for ML
Running the full list and getting results with metrics for Random Forest and Artificial Neural Networks

Why are these specific methods implemented?

For historical reasons. I used Random Forest and Artificial Neurals Networks in my first research paper based on a lab experiment. We were now curious to see how the same methods perform in a real world setting. As this package mainly is a way of distributing the code that I use for my current work, these are the supported methods at the moment.

What does this package require?

obviously phyloseq, ranger and keras with tensorflow as backend
also using data.table for the oversampling function
logging (printing messages) via futile.logger, you can set the information urgency threshold
purrr and tidyr are used for the ML wrapper functions.
ggplot2 for plotting
fastDummies is used to turn factor columns into integer dummy columns
There is a package called speedyseq available on github, you can install it using:

devtools::install_github("mikemc/speedyseq")

It replaces a couple of phyloseq included functions with a faster version. For this package, we are only interested in a faster version of tax_glom, but as a phyloseq user, having a quicker psmelt function might be even more interesting.

Future vignette plans

Add a data augmentation step and evaluate results using unsupervised Random Forests
Use set.seed to run Random Forest and Neural networks with identically splitted input data and directly compare the results
Use some phyloseq included data sets for demonstration
Show how to modify the purrr:pmap-called ranger_* and keras_* functions to your needs
Include variable importance analysis for ranger results
Calculate p values for your models

Future package plans

If I find some time I would like to test and include gradient boosting

The sample data explained

The example phyloseq object is based upon a real data set, which is not published yet. As this package deals with the FORMAT rather then CONTENT I massively reduced and randomized the values here presented. This way, we have an interesting use case to be demonstrated in the vignette. Sample and microbial data was partly generated in the UDEMM project. I will explain below what the sample data involves:

Data describing the sediment sampling:

Sample_Type: Sediments from the surface or from a multicorer?
Collection: In what way was the sample retrieved?
Munition_near: Is this sediment known to be “near” ammunition (“near” refers to up to several meters).
Biological_replicate: Describes multiple sediments from the same station.
Depth_cm: Used for describing the depth of the multicorer core slices.
Metagenome: Does metagenomic data exist for this DNA extract?
Weight_loss: The difference in g mass before and after sediment freeze drying.
Notes: Additional remarks on collected sediments.
Direction: Cardinal directions, only applies to specific experiments, were sediments were taken in orientation to a mine.
Distance_from_mine: The distance in meters between a mine and the sediment. A very high value such as 99999999 stands for “not close to the sediment, but not exactly known”.
Distance_rel: If applicable, how much relative distance between two samples, e.g between between the two replicates for each multicorer core or along a transect.

Sequencing related information:

Extraction_ID: A unique identifier for each sequencing library.
Library: The name of each library.
Primerset: The name of the primer sets used: V4 stands for 5' GTGCCAGCMGCCGCGGTAA 3' reverse 5' GGACTACHVGGGTWTCTAAT 3' and V34 for forward: 5' CCTACGGGNGGCWGCAG 3', reverse: 5' GACTACHVGGGTATCTAATCC 3'.
Run: The Miseq sequencing run on which this sample’s data was generated.
Library_purpose: Negative/Positive/Blank control or actual sediment sample.
Nucleic_acid: 16S rRNA (RNA) or 16S rRNA gene (DNA).
Technical_replicate: Describes multiple libraries from the same sediment.
Kit: Which kit was used for extraction?
Kit_batch: Describes what number kit was used for extraction.
Extract_concentration_ng_microL: How much nucleic acid was extracted in ng/µL.

The measured concentrations o unexploded ordnance (UXO):

UXO_sum: The sum of all UXOs measured in pmol/g wet sediment.

Element concentrations and ratios:

Hg_microg_kg : Mercury in µg/kg dried sediment
Pb206_207_ratio: The ratio of lead isotopes 206 to 207 as indicator of anthropogenic lead contamination.

The following elements were measured in each sediment. They are named after this scheme: First the periodic table abbreviations are given (and optionally the atomic number), after the underscore the measurement unit.

P_percent
Ca_percent
Fe_percent
Sc45_ppm
V51_ppm
Cr52_ppm
Mn55_ppm
Co59_ppm
Ni60_ppm
Cu63_ppm
Zn66_ppm
As75_ppm
Sr86_ppm
Zr90_ppm
Mo95_ppm
Ag107_ppm
Cd111_ppm
Sn118_ppm
Sb121_ppm
Cs133_ppm
Ba137_ppm
W186_ppm
Tl205_ppm
Pb207_ppm
Bi209_ppm
Th232_ppm
U238_ppm
Sum_ppm_ICP_MS: The sum of all element concentrations in ppm.

Sum parameters:

TIC: Total inorganic carbon
TN: Total nitrogen
TC: Total carbon
TS: Total sulfur
TOC: Total organic carbon, calculated by TC – TIC.

Grain size distribution:

Microm_001_mean: fraction of sediment grain size above a mean of 0.01 µm.
Microm_63_mean: fraction of sediment grain size above a mean 63 µm.
Microm_125_mean fraction of sediment grain size above a mean 125 µm.
Microm_250_mean fraction of sediment grain size above a mean 250 µm.
Microm_500_mean fraction of sediment grain size above a mean 500 µm.
Microm_1000_mean fraction of sediment grain size above a mean 1000 µm.

RJ333/phyloseq2ML documentation built on June 2, 2020, 9:25 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RJ333/phyloseq2ML
Connects Phyloseq To Ranger And Keras

In RJ333/phyloseq2ML: Connects Phyloseq To Ranger And Keras

Before starting

Vignette organization

Background

Purpose of this package

What does the package contain

Why are these specific methods implemented?

What does this package require?

Future vignette plans

Future package plans

The sample data explained

Data describing the sediment sampling:

Sequencing related information:

The measured concentrations o unexploded ordnance (UXO):

Element concentrations and ratios:

Sum parameters:

Grain size distribution:

R Package Documentation

Browse R Packages

We want your feedback!

RJ333/phyloseq2ML Connects Phyloseq To Ranger And Keras

In RJ333/phyloseq2ML: Connects Phyloseq To Ranger And Keras

Before starting

Vignette organization

Background

Purpose of this package

What does the package contain

Why are these specific methods implemented?

What does this package require?

Future vignette plans

Future package plans

The sample data explained

Data describing the sediment sampling:

Sequencing related information:

The measured concentrations o unexploded ordnance (UXO):

Element concentrations and ratios:

Sum parameters:

Grain size distribution:

R Package Documentation

Browse R Packages

We want your feedback!

RJ333/phyloseq2ML
Connects Phyloseq To Ranger And Keras