In vjcitn/BiocSklearn: interface to python sklearn via Rstudio reticulate

Introduction

Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction.

A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data.

Basic concepts

Module references

The package includes a list of references to python modules.

library(BiocSklearn)
SklearnEls()

Python documentation

We can acquire python documentation of included modules with reticulate's py_help:

Help on package sklearn.decomposition in sklearn:

NAME
    sklearn.decomposition

FILE
    /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py

DESCRIPTION
    The :mod:`sklearn.decomposition` module includes matrix decomposition
    algorithms, including among others PCA, NMF or ICA. Most of the algorithms of
    this module can be regarded as dimensionality reduction techniques.

PACKAGE CONTENTS
    _online_lda
    base
    cdnmf_fast
    dict_learning
    factor_analysis
    fastica_
    incremental_pca
...

Importing data for direct handling by python functions

The reticulate package is designed to limit the amount of effort required to convert data from R to python for natural use in each language.

irloc = system.file("csv/iris.csv", package="BiocSklearn")
irismat = SklearnEls()$np$genfromtxt(irloc, delimiter=',')

To examine a submatrix, we use the take method from numpy. The bracket format notifies us that we are not looking at data native to R.

SklearnEls()$np$take(irismat, 0:2, 0L )

Dimension reduction with sklearn: illustration with iris dataset

We'll use R's prcomp as a first test to demonstrate performance of the sklearn modules with the iris data.

fullpc = prcomp(data.matrix(iris[,1:4]))$x

PCA

We have a python representation of the iris data. We compute the PCA as follows:

ppca = skPCA(irismat)
ppca

This returns an object that can be reused through python methods. The numerical transformation is accessed via getTransformed.

tx = getTransformed(ppca)
dim(tx)
head(tx)

The native methods can be applied to the pyobj output.

pyobj(ppca)$fit_transform(irismat)[1:3,]

Concordance with the R computation can be checked:

round(cor(tx, fullpc),3)

Incremental PCA

A computation supporting a priori bounding of memory consumption is available. In this procedure one can also select the number of principal components to compute.

ippca = skIncrPCA(irismat) #
ippcab = skIncrPCA(irismat, batch_size=25L)
round(cor(getTransformed(ippcab), fullpc),3)

Manual incremental PCA with explicit chunking

This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified.

ta = SklearnEls()$np$take # provide slicer utility
ipc = skPartialPCA_step(ta(irismat,0:49,0L))
ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc)
ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc)
ipc$transform(ta(irismat,0:5,0L))
fullpc[1:5,]