In kota7/MLPipe: Machine Learning Pipeline

knitr::opts_chunk$set(echo = TRUE)

Overview

MLPipe privides interface for creating machine learning pipeline in the style of Python's scikit-learn library.

Installation

Install from GitHub:

devtools::install_github('kota7/MLPipe')

Quick Start

Let's use Sonar data from mlbench package as an example. Sonar data contain 60 numeric variables and 1 label (factor variable). The label represents the types of sonar targets: "rock" by "R" and "metal" by "M".
Our goal is to predict the label from the other variables. See ?(mlbench::Sonar) for more details about the data.

set.seed(123)
data(Sonar, package='mlbench')
print(dim(Sonar))
head(Sonar[c(1:4, 57:61)])

We pick 80% of the samples as the training data, and use the rest for the performance test.

X <- Sonar[, -ncol(Sonar)]
y <- Sonar[, ncol(Sonar)]
tr <- c(sample(1:111,111*0.8), sample(112:200,89*0.8))

Suppose that we would like to first conduct the dimensionality reduction of the features, and then fit to a classification model. For example, the first step can be done by extraction of principal components, and the second step can be a neural network model. We can construct a pipeline of this modelling procedure by the code below:

library(MLPipe)
p <- pipeline(pc=pca_extractor(ncomp=30),
              ml=mlp_classifier(hidden_sizes=c(5, 5), num_epoch=1000))

\code{pipeline} function takes arbitrary number of pipe components objects. Each pipe component represents a modelling step, and they are in the order of procedure.

We can now fit the the model to the training data by fit method of the pipeline. fit method takes exactly two inputs, x and y,

p$fit(X[tr,], y[tr])

By this single call of fit, each component of the pipeline is fitted.
In this example, first, principal component analysis for x is conducted, and the first 30 (specified as ncomp) principal components are extracted as the features. Then, the neural network model is estimated with these new features and y.

mlp_classifier has the predict function, which returns the predicted labels for new data. We can make a confusion matrix using it.