knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
First we load the required libraries. I have speedyseq installed, but you can also use phyloseq. I'm also using a logger package named futile.logger (flog). It has different notification settings, and you can choose if you only want to see warnings, or already information etc. My package is quite verbose to make you aware of potential problems early on, so you might want to have less output printed.
The example set uses ASVs generated by dada2
, but if you have OTUs (mothur
)
or genera (SILVAngs
) or whatever count data, just replace it in your head.
This package is not limited to ASVs.
library(phyloseq) # or library(speedyseq) library(phyloseq2ML) futile.logger::flog.threshold("INFO")
We can load the included phyloseq object and have a look at it dimensions.
data(TNT_communities)
TNT_communities
As you can see it is a very small data set which would not make any sense for ML, but it should serve for demonstration purposes. The data set is about microbial communities in sediments where ammunition was sea-dumped after the WWII, which took place in the Kolberger Heide near Kiel, Germany. We will now add unique lineages for each ASV, independent of its taxonomic annotation. This way, we don't lose taxa that are not annotated e.g. at Genus rank. Check the difference here:
# modify phyloseq object TNT_communities2 <- add_unique_lineages(TNT_communities) colnames(tax_table(TNT_communities)) colnames(tax_table(TNT_communities2))
Depending on your phyloseq object, it could be useful to re-order your lowest
rank taxa organisation. In this case, we already have ASVs, but after I
subsetted to generate this example data, the numbering is off. If you import
data from dada2
your names might even still be the actual sequences. In this
case, you can use_sequences = TRUE
, so the ASVs get moved to the refseq() slot
of the phyloseq object (The example data contained the refseq() data, but it
makes the phyloseq object quite large, so I removed them). You can specify the
name prefix such as "OTU" or "ASV" and the function adds the appropriate amount
of leading 0s to the ongoing number. Your taxa_prefix should not contain _
. An
example of a dada2
workflow creating actual sequences as taxa names is here
taxa_names(TNT_communities)[1:10] testps <- standardize_phyloseq_headers( phyloseq_object = TNT_communities2, taxa_prefix = "ASV", use_sequences = FALSE) taxa_names(testps)[1:10]
Another useful function is a tax dictionary, which allows you to translate your ASV to the corresponding taxonomic ranks, one tax rank at a time.
# translate ASVs to genus levels_tax_dictionary <- c("Order", "Family", "Genus") taxa_vector_list <- create_taxonomy_lookup(testps, levels_tax_dictionary) translate_ID(ID = c("ASV01", "ASV03", "ASV17"), tax_rank = "Genus", taxa_vector_list)
This is weird, we only get two return values for the 3 ASVs. If we look into the
function's documentation using ?translate_ID
we can see that the default value
for na.rm = TRUE
, which means that not annotated ranks are removed. Let's see
which ASV was not annotated at genus level:
translate_ID(ID = c("ASV01", "ASV03", "ASV17"), tax_rank = "Genus", taxa_vector_list, na.rm = FALSE)
Let's have an (incomplete) look what our sample_data slot contains:
head(names(sample_data(testps)), 10)
So there is a lot of additional context data. If you want to know what each
column name refers to, have a look The sample data explained.
The example data set was generated by subsetting Primerset == "V4"
and
Sample_Type == "Surface"
, that is why I would call this phyloseq object
"vignette_V4_surface".
We will use the object names to store information about the parameters we used along the workflow. The pieces of information are separated by "_". This allows us in the end to retrieve specific values by position.
This happens automatically if you adhere to this naming pattern: Your
initial name for each phyloseq object needs to have three sections separated by
two underscores, e.g. vignette_V4_surface
. The first part, "vignette" will be
extracted as Subset_1
, "V4" as Subset_2
and "surface" as Subset_3
. The
functions used in this package will then add more information to the name, each
separated by a new "_".
Let's create a second phyloseq object and then put both as named list items in a list. (We will use named lists a lot, so you should know a little bit about them!)
# only replicate #2 of sediment samples second_phyloseq_object <- subset_samples(testps, Technical_replicate == 2) subset_list <- list( vignette_V4_all = testps, vignette_V4_replicate2 = second_phyloseq_object ) str(subset_list, max = 2) # max defines how many list levels we want to print
A quick note here: If you want to predict several variables based on different conditons, do this in several analyses. So don't mix e.g. classification and regression tasks.
Also: for classification, we will use a function to categorize continuous values into classes. The limits need to be the same for all response variables. So if you have e.g. variable 1 and variable 2 containing nutrient concentrations and variable 1 should be split at a concentration of 5 into two classes "below_5" and "above_5" whereas variable 2 should be split at 3 into "below_3", "above_3", these should be separated runs of your workflow.
After we have defined the phyloseq objects, we can now manipulate their
community tables (the content of the otu_table()
slots). For once, we can
select how abundant ASVs have to be to be kept. The counts are converted to
relative abundances in percentage. The filter function wants to know, how often
an ASV has to be how abundant to keep the ASV. In code this looks like
sum(x > threshold) >= num_samples
with x being the current ASV abundance.
As example, a threshold of 0.01 and num_samples of 2 would mean that each ASV
has to have more than 0.01 % abundance in at least 2 samples.
If this requirement is fulfilled, this ASV is kept in all samples, including
those where it is less abundant than 100 counts. This way we can keep the full
range of the ASV's abundance in different samples, which is valuable informaton
for ML.
And then we can also cumulate the abundance of ASVs by a higher taxonomic rank.
This means for e.g. the genus rank that each ASV of the same genus is summed up
and we continue on genus rank taxonomic resolution. Here we can make use of our
added "To_Genus" columns in the tax_table()
, which allows us to keep taxa such
as ..._Pseudomonadales_Pseudomonadaceae_NA_ASV0007
and
..._Pseudomonadales_Moraxellaceae_NA_ASV0013
separate, although they are both
annotated as NA
on "Genus". We should specify how the list items' names should
be complemented. These names should be the original tax rank names, here "Genus"
and "Family". The original rank, such as ASV, is already included.
# define subsetting parameters selected_taxa_1 <- setNames(c("To_Genus", "To_Family"), c("Genus", "Family")) subset_list_rel <- to_relative_abundance(subset_list) subset_list_tax <- create_community_table_subsets( subset_list = subset_list_rel, thresholds = c(5, 1.5), taxa_prefix = "ASV", num_samples = 1, tax_ranks = selected_taxa_1)
You can see how the applied filter threshold and tax rank values are stored in the name, ready for later retrieval.
names(subset_list_tax)
Note: First the ASVs are filtered, then the remaining entries are summed on the specified tax ranks.
After the filter and accumulation step we extract the otu_table()
slots from
the phyloseq objects and turn them into regular data frames.
subset_list_df <- otu_table_to_df(subset_list = subset_list_tax) head(subset_list_df[[1]], 2)[1:10] # look into the first list item
So, well done, we have a list of community table data frames. However, you maybe
want to use some e.g. environmental variables you measured to support you ML
with additional information. As phyloseq organizes this data already for us, we
can make use of it and select which sample_data()
columns we want to include.
You can use names(sample_data(testps))
again to see what is available and
str(sample_data(testps))
to see whether a column contains numerical or
categorical data.
Note: This is still the setup for the "independent" or "predictor" variables. The "dependent", "response" or "target" variable will be specified afterwards.
Let's pick two numerical columns, the amount of phosphorus measured after
extracting the sediment with 0.5 M HCl and the total organic carbon. There is
also a factor column telling us which samples where taken on which cruise and
another specifying in which sequencing run it was sequenced. The function
automatically merges these columns to each data frame in our list of community
tables. We can check this using sapply
desired_sample_data <- c("TOC", "P_percent", "Munition_near", "Run") subset_list_extra <- add_sample_data(phyloseq_object = testps, community_tables = subset_list_df, sample_data_names = desired_sample_data) sapply(subset_list_extra[1:2], names)
Note: A very important issue to consider are underlying patterns in your provided data that you are not aware of. Just assume all samples that contained the explosive where (maybe on accident) sampled in winter time and in summer time you we only collected samples free of explosives. The ML algorithm would just need to figure out if it looks at a summer or winter community (or as a more drastic example, communities from oxic or anoxic sediments) to correctly predict the presence of explosives. There are still explosives in winter times, though, and what the ML used has nothing to do with the influence of explosives on microorganisms. Just remember: ML looks for correlations, there is nothing about causality to be made from ML.
If you e.g. add from which campaign a sample stems the ML could learn which cruises correlate with explosives positive samples (or with winter or summer season), which again is not bad for the prediction quality, but it is not related to the sample properties such as geological parameters or microbial composition at all. THe same is true for the sequencing run information. Variable importance in Random Forest allows to check which variable was important for a given classification or regression.
From Forbes: "Putting this all together, the ease with which modern machine learning pipelines can transform a pile of data into a predictive model without requiring an understanding of statistics or even programming has been a key driving force in its rapid expansion into industry. At the same time, it has eroded the distinction between correlation and causation as the new generation of data scientists building and deploying these models conflate their predictive prowess with explanatory power."
Now it is getting interesting. Let us choose the variables from the
sample_data()
slot we want to predict. As mentioned above, only pick those
where you want to apply the same prediction conditions. Here we want to start
with binary classification with the explosives sum categorized into classes "absent",
and "present" referring to their concentration in sediment expressed in
pmol/g (yes, that is not much!). We use a function to get the column from the
phyloseq object we used in the beginning and then another function to cut the
values into intervals. The labels become the factor levels. -Inf
and Inf
make sure that we include all values and obviously, you need as many labels as
intervals are generated.
# get response variables desired_response_vars <- c("UXO_sum") # sum of explosives measured response_variables <- extract_response_variable( response_variables = desired_response_vars, phyloseq_object = testps) # cut response numeric values into 2 classes responses_binary <- categorize_response_variable( ML_mode = "classification", response_data = response_variables, my_breaks = c(-Inf, 20, Inf), class_labels = c("below_20", "above_20")) head(responses_binary, 10)
Note: For regression, you either skip this step and use the "response_variables" as input to the next function below, or in a fixed workflow script, you choose "regression" as mentioned above and it returns the altered list. This step does not add anything to the list item names, so you can safely skip it. The other functions detect for themselves if classification or regression is the goal based on the type of response column (factor or numeric).
Now we can add these response columns to our predictor tables. As we will only predict one response at a time, each response variable will be merged with each input table.
merged_input_binary <- merge_input_response(subset_list_extra, responses_binary) # check if number of list items fits combination of input list and response vars length(merged_input_binary) == length(subset_list_extra) * ncol(responses_binary)
You can verify the addition of the response variables using
sapply(merged_input_binary, names)
These were the steps to be performed for both Random Forest and Neural Networks and in the same order. The nexts steps are covered in the follow up vignette on running a ranger binary classification, where also some preparation steps and theory relevant for neural networks are/is addressed.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.