knitr::opts_chunk$set(collapse = T, comment = "#>", tidy.opts = list(width.cutoff = 70)) options(tibble.print_min = 4L, tibble.print_max = 4L) library(peppuR) library(ggplot2) library(dplyr) library(kableExtra) set.seed(1014)
Predictive analysis of biological data genearlly requires a processing pipline that takes raw data through ingestion and organization, preprocessing, and learning phases. One can imagine that these phases are highly subject to the analyst or analysts handling the data. Packages like peppuR provide tools to create well documented pipelines so that analyses are easily understood and completely reproducible. We will introduce three simple peppuR pipelines in the following sections for biological use cases of:
We will utilize the MASS::birthwt
dataset for a lightweight example. The data consit of 189 rows corresponding to 189 individuals. Each of the 10 columns include information about a subject's maternal factors and birth conditions. The first column in the dataset, low
, is a binary indicator of a birth weight less than 2.5 kg and is the outcome we would like to predict using the remaining 9 columns. For more information about the dataset, use ?MASS::birthwt
.
library(MASS) dim(birthwt) head(birthwt)
When we talk about a single data source, we mean the data naturally divide into a single outcome and a block of covariates related to the outcome.
The ingestion step of peppuR utilizes an as.MLinput()
input function to organize the data into outcome and covariates. A few important notes about putting data into peppuR:
sample_cname
. If there are multiple sources, this column must have the same name across sourcesY
or specified by name in meta-colnames
.# Add subject names to the data birthweight_data <- birthwt birthweight_data$ID <- paste("ID",1:nrow(birthweight_data), sep = "_") birthweight_data$low <- as.factor(birthweight_data$low) # Make categorical columns factors birthweight_data[, colnames(birthweight_data) %in% c("race", "smoke", "ht", "ui")] <- lapply(birthweight_data[, colnames(birthweight_data) %in% c("race", "smoke", "ht", "ui")], function(x) as.factor(x)) # Create an organized data object single_source_peppuRobj <- as.MLinput(X = birthweight_data, Y = NULL, meta_colnames = c("ID", "low"), categorical_features = TRUE, sample_cname = "ID", outcome_cname = "low")
The result of as.MLinput()
is both a list and an MLinput
object with special properties, like the following attributes:
attributes(single_source_peppuRobj)
Notice that we're slightly luck here in that there are no missing data. We'll look later into the functionality of peppuR that will handle missing data for us.
Without the need for preprocessing, the next logical step in a machine learning pipeline is to create cross-validation folds. Again utilizing the caret
package, the dataPationing
function will add an attribute to the data object that separates the data into cross validation folds. In the case where there are not enough data to have a true test set, we allow the option for repeated cross-validation with the repeats
argument.
single_source_peppuRobj <- dataPartitioning(single_source_peppuRobj, folds = 5, repeats = 10)
After the data are partioned, a single machine learning algorithm can be trained and evaluated on the partitions with the MLWrapper
. Below we show how to train a Random Forest using the 5-fold cross validation we created in the previous step, repeated 10 times.
rf_results <- MLwrapper(data_object = single_source_peppuRobj, methods = "rf") #----Take a look at the results ------# output_probabilities <- attr(rf_results, "ML_results")
The output of the MLwrapper adds attributes to the data object. The most important of the attributes is the ML_results
which contains a sample-wise class probability for each observation in our validation fold.
temp <- data.frame(output_probabilities)[1:10,1:5] colnames(temp) <- gsub(pattern= '1\\.rf', replacement = '1\nrf', x = colnames(temp)) knitr::kable(temp, row.names = FALSE, align = 'l', booktabs = TRUE) %>% kableExtra::column_spec(column = c(1:5), width = '7cm')
We can look at receiver operating characteristics (ROC) curve with the plotting method for any MLWrapper()
object. These methods are built using ggplot2
and can accept styling layers such as the title addition shown below.
library(ggplot2) plot(rf_results)[[1]]+ggtitle("Random Forest ROC")
peppuR utilizes a Repeated Optimization for Feature Interpretation (ROFI) method described in Webb-Roberston et al 2016 to identify feature importance from single- or multi-data source machine learning. The rofi()
function takes in a data object created with as.MLinput
and a named vector of the machine learning algorithm/s, one of "rf", "lda", "svm", "nb", or "knn" and plural for multiple sources only, where the name of the vector is also the name of the data source as shown below. For details on the other parameters use ?rofi
.
#-----ROFI------# source_alg_pairs <- "rf" names(source_alg_pairs) <- "DS_1" rofi_results <- rofi(single_source_peppuRobj, source_alg_pairs, nn = 5, f_prob=0.4, nu=1/100, max_iter = 35)
The output of rofi is a list of 1) feature importance metrics by iteration and 2) machine learning performance by iteration. This output also has a plotting method that produces and order plot as shown below.
plot(rofi_results)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.