knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Let's start

First we load the required libraries. I have speedyseq installed, but you can also use phyloseq. I'm also using a logger package named futile.logger (flog). It has different notification settings, and you can choose if you only want to see warnings, or already information etc. My package is quite verbose to make you aware of potential problems early on, so you might want to have less output printed.

The roadmap to prepare phyloseq data for machine learning

  1. define the phyloseq objects, meaning we use phyloseq functions for subsetting by sample_data entries.
  2. Abundance based adjustments and per taxonomic rank for each list item.
  3. Finally, we can add data from the sample_data slot to our input.
  4. Then we only need to add variables from our sample_data which should be predicted by ML.
  5. Split data set in training and test sets: Each named list item becomes a list itself which contains two data frames for training/valiation and for testing.
  6. For neural networks, we add one hot encoding, scaling and keras formatting to the list.

Modification of the phyloseq object

The example set uses ASVs generated by dada2, but if you have OTUs (mothur) or genera (SILVAngs) or whatever count data, just replace it in your head. This package is not limited to ASVs.

library(phyloseq)  # or library(speedyseq)
library(phyloseq2ML)
futile.logger::flog.threshold("INFO")

We can load the included phyloseq object and have a look at it dimensions.

data(TNT_communities)
TNT_communities

As you can see it is a very small data set which would not make any sense for ML, but it should serve for demonstration purposes. The data set is about microbial communities in sediments where ammunition was sea-dumped after the WWII, which took place in the Kolberger Heide near Kiel, Germany. We will now add unique lineages for each ASV, independent of its taxonomic annotation. This way, we don't lose taxa that are not annotated e.g. at Genus rank. Check the difference here:

# modify phyloseq object
TNT_communities2 <- add_unique_lineages(TNT_communities)
colnames(tax_table(TNT_communities))
colnames(tax_table(TNT_communities2))

Depending on your phyloseq object, it could be useful to re-order your lowest rank taxa organisation. In this case, we already have ASVs, but after I subsetted to generate this example data, the numbering is off. If you import data from dada2 your names might even still be the actual sequences. In this case, you can use_sequences = TRUE, so the ASVs get moved to the refseq() slot of the phyloseq object (The example data contained the refseq() data, but it makes the phyloseq object quite large, so I removed them). You can specify the name prefix such as "OTU" or "ASV" and the function adds the appropriate amount of leading 0s to the ongoing number. Your taxa_prefix should not contain _. An example of a dada2 workflow creating actual sequences as taxa names is here

taxa_names(TNT_communities)[1:10]
testps <- standardize_phyloseq_headers(
  phyloseq_object = TNT_communities2, taxa_prefix = "ASV", use_sequences = FALSE)
taxa_names(testps)[1:10]

Another useful function is a tax dictionary, which allows you to translate your ASV to the corresponding taxonomic ranks, one tax rank at a time.

# translate ASVs to genus
levels_tax_dictionary <- c("Order", "Family", "Genus")
taxa_vector_list <- create_taxonomy_lookup(testps, levels_tax_dictionary)
translate_ID(ID = c("ASV01", "ASV03", "ASV17"), tax_rank = "Genus", taxa_vector_list)

This is weird, we only get two return values for the 3 ASVs. If we look into the function's documentation using ?translate_IDwe can see that the default value for na.rm = TRUE, which means that not annotated ranks are removed. Let's see which ASV was not annotated at genus level:

translate_ID(ID = c("ASV01", "ASV03", "ASV17"), tax_rank = "Genus", taxa_vector_list, na.rm = FALSE)

Create a list of subsetted phyloseq objects

Let's have an (incomplete) look what our sample_data slot contains:

head(names(sample_data(testps)), 10)

So there is a lot of additional context data. If you want to know what each column name refers to, have a look The sample data explained. The example data set was generated by subsetting Primerset == "V4" and Sample_Type == "Surface", that is why I would call this phyloseq object "vignette_V4_surface".

We will use the object names to store information about the parameters we used along the workflow. The pieces of information are separated by "_". This allows us in the end to retrieve specific values by position.

This happens automatically if you adhere to this naming pattern: Your initial name for each phyloseq object needs to have three sections separated by two underscores, e.g. vignette_V4_surface. The first part, "vignette" will be extracted as Subset_1, "V4" as Subset_2 and "surface" as Subset_3. The functions used in this package will then add more information to the name, each separated by a new "_".

Let's create a second phyloseq object and then put both as named list items in a list. (We will use named lists a lot, so you should know a little bit about them!)

# only replicate #2 of sediment samples
second_phyloseq_object <- subset_samples(testps, Technical_replicate == 2)
subset_list <- list(
  vignette_V4_all = testps,
  vignette_V4_replicate2 = second_phyloseq_object
)
str(subset_list, max = 2)  # max defines how many list levels we want to print

A quick note here: If you want to predict several variables based on different conditons, do this in several analyses. So don't mix e.g. classification and regression tasks.

Also: for classification, we will use a function to categorize continuous values into classes. The limits need to be the same for all response variables. So if you have e.g. variable 1 and variable 2 containing nutrient concentrations and variable 1 should be split at a concentration of 5 into two classes "below_5" and "above_5" whereas variable 2 should be split at 3 into "below_3", "above_3", these should be separated runs of your workflow.

Filter the phyloseq objects and choose taxonomic rank

After we have defined the phyloseq objects, we can now manipulate their community tables (the content of the otu_table() slots). For once, we can select how abundant ASVs have to be to be kept. The counts are converted to relative abundances in percentage. The filter function wants to know, how often an ASV has to be how abundant to keep the ASV. In code this looks like sum(x > threshold) >= num_samples with x being the current ASV abundance. As example, a threshold of 0.01 and num_samples of 2 would mean that each ASV has to have more than 0.01 % abundance in at least 2 samples.

If this requirement is fulfilled, this ASV is kept in all samples, including those where it is less abundant than 100 counts. This way we can keep the full range of the ASV's abundance in different samples, which is valuable informaton for ML. And then we can also cumulate the abundance of ASVs by a higher taxonomic rank. This means for e.g. the genus rank that each ASV of the same genus is summed up and we continue on genus rank taxonomic resolution. Here we can make use of our added "To_Genus" columns in the tax_table(), which allows us to keep taxa such as ..._Pseudomonadales_Pseudomonadaceae_NA_ASV0007 and ..._Pseudomonadales_Moraxellaceae_NA_ASV0013 separate, although they are both annotated as NA on "Genus". We should specify how the list items' names should be complemented. These names should be the original tax rank names, here "Genus" and "Family". The original rank, such as ASV, is already included.

# define subsetting parameters
selected_taxa_1 <- setNames(c("To_Genus", "To_Family"), c("Genus", "Family"))
subset_list_rel <- to_relative_abundance(subset_list)
subset_list_tax <- create_community_table_subsets(
  subset_list = subset_list_rel, 
  thresholds = c(5, 1.5),
  taxa_prefix = "ASV",
  num_samples = 1,
  tax_ranks = selected_taxa_1)

You can see how the applied filter threshold and tax rank values are stored in the name, ready for later retrieval.

names(subset_list_tax)

Note: First the ASVs are filtered, then the remaining entries are summed on the specified tax ranks.

After the filter and accumulation step we extract the otu_table() slots from the phyloseq objects and turn them into regular data frames.

subset_list_df <- otu_table_to_df(subset_list = subset_list_tax)
head(subset_list_df[[1]], 2)[1:10] # look into the first list item

Add sample data columns to our community data

So, well done, we have a list of community table data frames. However, you maybe want to use some e.g. environmental variables you measured to support you ML with additional information. As phyloseq organizes this data already for us, we can make use of it and select which sample_data() columns we want to include. You can use names(sample_data(testps)) again to see what is available and str(sample_data(testps)) to see whether a column contains numerical or categorical data.

Note: This is still the setup for the "independent" or "predictor" variables. The "dependent", "response" or "target" variable will be specified afterwards.

Let's pick two numerical columns, the amount of phosphorus measured after extracting the sediment with 0.5 M HCl and the total organic carbon. There is also a factor column telling us which samples where taken on which cruise and another specifying in which sequencing run it was sequenced. The function automatically merges these columns to each data frame in our list of community tables. We can check this using sapply

desired_sample_data <- c("TOC", "P_percent", "Munition_near", "Run")
subset_list_extra <- add_sample_data(phyloseq_object = testps, 
  community_tables = subset_list_df, sample_data_names = desired_sample_data)
sapply(subset_list_extra[1:2], names)

Note: A very important issue to consider are underlying patterns in your provided data that you are not aware of. Just assume all samples that contained the explosive where (maybe on accident) sampled in winter time and in summer time you we only collected samples free of explosives. The ML algorithm would just need to figure out if it looks at a summer or winter community (or as a more drastic example, communities from oxic or anoxic sediments) to correctly predict the presence of explosives. There are still explosives in winter times, though, and what the ML used has nothing to do with the influence of explosives on microorganisms. Just remember: ML looks for correlations, there is nothing about causality to be made from ML.

If you e.g. add from which campaign a sample stems the ML could learn which cruises correlate with explosives positive samples (or with winter or summer season), which again is not bad for the prediction quality, but it is not related to the sample properties such as geological parameters or microbial composition at all. THe same is true for the sequencing run information. Variable importance in Random Forest allows to check which variable was important for a given classification or regression.

From Forbes: "Putting this all together, the ease with which modern machine learning pipelines can transform a pile of data into a predictive model without requiring an understanding of statistics or even programming has been a key driving force in its rapid expansion into industry. At the same time, it has eroded the distinction between correlation and causation as the new generation of data scientists building and deploying these models conflate their predictive prowess with explanatory power."

Define response variables

Now it is getting interesting. Let us choose the variables from the sample_data() slot we want to predict. As mentioned above, only pick those where you want to apply the same prediction conditions. Here we want to start with binary classification with the explosives sum categorized into classes "absent", and "present" referring to their concentration in sediment expressed in pmol/g (yes, that is not much!). We use a function to get the column from the phyloseq object we used in the beginning and then another function to cut the values into intervals. The labels become the factor levels. -Inf and Inf make sure that we include all values and obviously, you need as many labels as intervals are generated.

# get response variables
desired_response_vars <- c("UXO_sum")  # sum of explosives measured
response_variables <- extract_response_variable(
  response_variables = desired_response_vars, phyloseq_object = testps)
# cut response numeric values into 2 classes
responses_binary <- categorize_response_variable(
  ML_mode = "classification", 
  response_data = response_variables, 
  my_breaks = c(-Inf, 20, Inf), 
  class_labels = c("below_20", "above_20"))
head(responses_binary, 10)

Note: For regression, you either skip this step and use the "response_variables" as input to the next function below, or in a fixed workflow script, you choose "regression" as mentioned above and it returns the altered list. This step does not add anything to the list item names, so you can safely skip it. The other functions detect for themselves if classification or regression is the goal based on the type of response column (factor or numeric).

Merge independent and dependent variables

Now we can add these response columns to our predictor tables. As we will only predict one response at a time, each response variable will be merged with each input table.

merged_input_binary <- merge_input_response(subset_list_extra, responses_binary)
# check if number of list items fits combination of input list and response vars
length(merged_input_binary) == length(subset_list_extra) * ncol(responses_binary)

You can verify the addition of the response variables using sapply(merged_input_binary, names)

These were the steps to be performed for both Random Forest and Neural Networks and in the same order. The nexts steps are covered in the follow up vignette on running a ranger binary classification, where also some preparation steps and theory relevant for neural networks are/is addressed.



RJ333/phyloseq2ML documentation built on June 2, 2020, 9:25 p.m.