knitr::opts_chunk$set(python.reticulate = TRUE)
if (identical(Sys.info()[['sysname']], "Windows")) {
    knitr::opts_chunk$set(eval = FALSE)                                      
    msg <- paste("Note: Some examples in this vignette require Python", 
                 "but you are running this vignette on Windows where Python",
                 "is much less likely to be present, or even known to be",
                 "missing (i.e. win-builder) so examples will not be evaluated.")
    msg <- paste(strwrap(msg), collapse="\n")
    message(msg) 
}

Motivation

\proglang{Python}\footnote{\url{http://www.python.org}} is a widely-used and popular programming language. It is deployed in use cases ranging from simple scripting to larger-scale application development. \proglang{Python} is also popular for quantitative and scientific application due to the existence of extension modules such as \pkg{NumPy}\footnote{\url{http://numpy.scipy.org/}} (which is shorthand for Numeric Python) and many other packages for data analysis.

\pkg{NumPy} is used to efficiently represent $N$-dimensional arrays, and provides an efficient binary storage model for these files. In practice, $N$ is often equal to two, and matrices processed or generated in \proglang{Python} can be stored in this form. As \pkg{NumPy} is popular, many project utilize this file format.

\R has no dedicated reading or writing functionality for these type of files. However, Carl Rogers has provided a small \proglang{Cpp} library called \pkg{cnpy}\footnote{\url{https://github.com/rogersce/cnpy}} which is released under the MIT license. Using the `Rcpp modules' feature in \pkg{Rcpp} \shortcites{CRAN:Rcpp} \citep{JSS:Rcpp,Eddelbuettel:2013:Rcpp,CRAN:Rcpp}, we provide (some) features of this library to \Rns.

Examples

Data creation in \proglang{Python}

The first code example simply creates two files in \proglang{Python}: a two-dimensional rectangular array as well as a vector.

import numpy as np

mat = np.arange(12).reshape(3,4) * 1.1
np.save("fmat.npy", mat)
print(mat)

vec = np.arange(5) * 1.1
np.save("fvec.npy", vec)
print(vec)

As illustrated, \proglang{Python} uses the \proglang{Fortran} convention for storing matrices and higher-dimensional arrays: a matrix constructed from a single sequence has its first consecutive elements in its first row---whereas \Rns, following the \proglang{C} convention, has these first few values in its first column. This shows that to go back and forth we need to transpose these matrices (which represented internally as two-dimensional arrays).

Data reading in \R

We can read the same data in \R using the \code{npyLoad()} function provided by the \pkg{RcppCNPy} package:

library(RcppCNPy)
mat <- npyLoad("fmat.npy")
mat
vec <- npyLoad("fvec.npy")
vec

The \proglang{Fortran}-order of the matrix is preserved; we obtain the exact same data as we stored.

Reading compressed data in \R

A useful extension to the \pkg{cnpy} library is the support of \pkg{gzip}-compressed data.

mat2 <- npyLoad("fmat.npy.gz")
mat2

Support for writing compressed files has been added in version 0.2.0.

Data writing in \R

Matrices and vectors can be written to files using the \code{npySave()} function.

set.seed(42)
m <- matrix(sort(rnorm(6)), 3, 2)
m
npySave("randmat.npy", m)
v <- seq(10, 12)
v
npySave("simplevec.npy", v)

Data reading in \proglang{Python}

Reading the data back in \proglang{Python} is also straightforward as shown in the following example:

```{python pyex2} import numpy as np
m = np.load("randmat.npy") print(m) v = np.load("simplevec.npy") print(v)

## Integer support

Support for integer data types has been conditional on use of either
the \code{-std=c++0x} or the \code{-std=c++11} compiler extensions.
Only these standards support the \code{long long int} type needed to
represent \code{int64} data on a 32-bit OS.  Following the release of
\R 3.1.0, it has been enabled by default in \pkg{RcppCNPy} (whereas it
previously required a manual rebuild), and following the release of R
3.3.0 with its updated Windows toolchain, C++11 is now available on
all common R platforms. Consequently, support for large integers in
\pkg{RcppCNPy} is no longer just a compile-time option for some
platforms, but generally available on all (current) R installations.


## Performance

The \R script \code{timing} in the \code{demo/} directory of the package
\pkg{RcppCNPy} provides a simple benchmark.  Given two values $n$ and $k$, a
matrix of size $n \times k$ is created with $n$ rows and $k$ columns. It is
written to temporary files in
i) ascii format using \code{write.table()};
ii) \code{NumPy} format using \code{npySave()}; and
iii) \code{NumPy} format using \code{npySave()} with compression via
the \code{zlib} library (used also by \code{gzip}).

Table \ref{tab:benchmark} shows some timing comparisons for a matrix with
five million elements.  Reading the \code{npy} data is clearly fastest as it
required only parsing of the header, followed by a single large binary read
(and the transpose required to translate the representation used by \Rns). The
compressed file requires only one-fourth of the disk space, but takes
approximately 2.5 times as long to read as the binary stream has be
transformed.  Lastly, the default ascii reading mode is clearly by far the
slowest.

\begin{table}[bt]
  \begin{center}
    \begin{small}
      \begin{tabular}{rrr}
        \toprule
        {\bf Access method \phantom{X}} & {\bf Time in sec.} & {\bf Relative to best} \\
        \cmidrule(r){1-3}
     \code{npyLoad(pyfile)}   &    0.074 &  1.000  \\
   \code{npyLoad(pygzfile)}   &    0.190 &  2.568 \\
 \code{read.table(}txtfile)   &    4.189 & 56.608 \\
        \bottomrule
      \end{tabular}
    \end{small}
    \caption{Performance comparison of data reads using a matrix of size
      $10^5 \times 50$. File size are 39.7mb for ascii, 40.0mb for npy and
      10.8mb for npy.gz. Ten replications were performed, and total times
      are shown. R 3.3.1 was used on a laptop with an SSD disk. }
    \label{tab:benchmark}
  \end{center}
\end{table}


# Limitations

## Higher-dimensional arrays

\pkg{Rcpp} supports three-dimensional arrays, this could be support in
\pkg{RcppCNPy} as well.

## \code{npz} files

The \pkg{cnpy} library supports reading and writing of sets of arrays; this
feature could also be exported.

# Summary

The \pkg{RcppCNPy} package provides simple reading and writing of
\pkg{NumPy} files, using the \pkg{cnpy} library. Reading of compressed
files is also supported as an extension, offering more compact storage
at the cost of slightly longer read times.

```r
unlink("fmat.npy")
unlink("fvec.npy")
unlink("randmat.npy")
unlink("simplevec.npy")


eddelbuettel/rcppcnpy documentation built on Jan. 4, 2024, 6:34 p.m.