Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction.

A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data.

The package includes a list of references to python modules.

library(BiocSklearn) SklearnEls()

We can acquire python documentation of included modules with
reticulate's `py_help`

:

Help on package sklearn.decomposition in sklearn: NAME sklearn.decomposition FILE /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py DESCRIPTION The :mod:`sklearn.decomposition` module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques. PACKAGE CONTENTS _online_lda base cdnmf_fast dict_learning factor_analysis fastica_ incremental_pca ...

The reticulate package is designed to limit the amount of effort required to convert data from R to python for natural use in each language.

irloc = system.file("csv/iris.csv", package="BiocSklearn") irismat = SklearnEls()$np$genfromtxt(irloc, delimiter=',')

To examine a submatrix, we use the take method from numpy. The bracket format notifies us that we are not looking at data native to R.

SklearnEls()$np$take(irismat, 0:2, 0L )

We'll use R's prcomp as a first test to demonstrate performance of the sklearn modules with the iris data.

fullpc = prcomp(data.matrix(iris[,1:4]))$x

We have a python representation of the iris data. We compute the PCA as follows:

ppca = skPCA(irismat) ppca

This returns an object that can be reused through python methods.
The numerical transformation is accessed via `getTransformed`

.

tx = getTransformed(ppca) dim(tx) head(tx)

The native methods can be applied to the `pyobj`

output.

pyobj(ppca)$fit_transform(irismat)[1:3,]

Concordance with the R computation can be checked:

round(cor(tx, fullpc),3)

A computation supporting *a priori* bounding of memory
consumption is available. In this procedure one can
also select the number of principal components to compute.

ippca = skIncrPCA(irismat) # ippcab = skIncrPCA(irismat, batch_size=25L) round(cor(getTransformed(ippcab), fullpc),3)

This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified.

ta = SklearnEls()$np$take # provide slicer utility ipc = skPartialPCA_step(ta(irismat,0:49,0L)) ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc) ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc) ipc$transform(ta(irismat,0:5,0L)) fullpc[1:5,]

We need more applications and profiling.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.