knitr::opts_chunk$set(cache=TRUE) cwd <- getwd()
The process of building a new package with Rcpp can range from the very easy---a single simple C++ function---to the very complex. If, and how, external resources are utilised makes a big difference as this too can range from the very simple---making use of a header-only library, or directly including a few C++ source files without further dependencies---to the very complex.
Yet a lot of the important action happens in the middle ground. Packages may bring their own source code, but also depend on just one or two external libraries. This paper describes one such approach in detail: how we turned the Corels application \citep{arxiv:corels,github:corels} (provided as a standalone C++-based executable) into an R-callable package \pkg{RcppCorels} \citep{github:rcppcorels} via \pkg{Rcpp} \citep{CRAN:Rcpp,JSS:Rcpp}.
Before embarking on such a journey, it is best to ensure that the licensing framework is suitable. Many different open-source licenses exists, yet a few key ones dominate and can generally be used with each other. There is however a fair amount of possible legalese involved, so it is useful to check inter-license compatibility, as well as general usability of the license in question. Several sites can help via license recommendations, and checks for interoperability. One example is the site at choosealicense.com (which is backed by GitHub) can help, as can tldrlegal.com. License choice is a complex topic, and general recommendations are difficult to make besides the key point of sticking to already-established and known licenses.
In order to see how hard it may to combine an external entity, either a program a library, with R, it helps to ensure that the external entity actually still builds and runs.
This may seem like a small and obvious steps, but experience suggests that it worth asserting the ability to build with current tools, and possibly also with more than one compiler or build-system. Consideration to other platforms used by R also matter a great deal as one of the strengths of the R package system is its ability to cover the three key operating system families.
This may seem like a variation on the previous point, but besides the ability to build we also need to ensure the ability to run the software. If the external entity has tests and demo, it is highly recommended to run them. If there are reference results, we should ensure that they are still obtained, and also that the run-time performance it still (at a minimum) reasonable.
This is of course a very basic litmus test: is the new software relevant? Is is helpful? Would others benefit from having it packaged and maintained?
The first step in getting a new package combing R and C++ is often the creation of a new
Rcpp package. There are several helper functions to choose from. A natural first choice
is Rcpp.package.skeleton()
from the \pkg{Rcpp} package \citep{CRAN:Rcpp}. It can be
improved by having the optional helper package \pkg{pkgKitten} \citep{CRAN:pkgKitten}
around as its kitten()
function smoothes some rougher edges left by the underlying Base
R function package.skeleton()
. This step is shown below in then appendix, and
corresponds to the first commit, followed by a first edit of file DESCRIPTION
.
Any code added by the helper functions, often just a simple helloWorld()
variant, can be
run to ensure that the package is indeed functional. More importantly, at this stage, we
can also start building the package as a compressed tar archive and run the R checker on
it.
Given a basic package with C++ support, we can now turn to integrating the external
package. This complexity of this step can, as alluded to earlier, vary from very easy to
very complex. Simple cases include just dependending on library headers which can either
be copied to the package, or be provided by another package such as \pkg{BH}
\citep{CRAN:BH}. It may also be a dependency on a fairly standard library available on
most if not all systems. The graphics formats bmp, jpeg or png may be example; text
formats like JSON or XML are another. One difficulty, though, may be that run-time
support does not always guarantee compile-time support. In these cases, a -dev
or
-devel
package may need to be installed.
In the concrete case of Corels, we
src/
directory;*.hh
to *.h
to comply with an R preference;src/Makevars
file, initially with link instructions for GMP later relaxed to
conditional use of GMP (see below);main.cc
to a subdirectory as we cannot build with another main()
function (and R will not include files from subdirectories);logger
instance.Here, the last step was needed as the file main.cc
provided a global instance referred
to from other files. Hence, a minimal R-callable wrapper is being added at this stage
(shown in the appendix as well). Actual functionality will be added later.
We will come back to the step concerning the link instructions.
As this point we have a package for R also containing the library we want to add.
R has fairly strict guidelines, defined both in the CRAN Repository Policy document at
the CRAN website, and in the manual Writing R Extension. Certain standard C and C++
functions are not permitted as their use could interfere with running code from R. This
includes somewhat obvious recommendations ("do not call abort
" as it would terminate the
R sessions) but extends to not using native print methods in order to cooperate better
with the input and output facilities of R. So here, and reflecting that last aspect, we
changed all calls to printf()
to calls to Rprintf()
. Similarly, R prefers its own
(well-tested) random-number generators so we replaced one (scaled) call to random() /
RAND_MAX
with the equivalent call to R's unif_rand()
. We also avoided one use of
stdout
in rulelib.h
.
The requirement for such changes may seem excessive at first, but the value added stemming from consistent application of the CRAN Policies is appreciated by most R users.
In order to further test the package, and of course also for actual use, we need to expose the key parameters and arguments. Corels parsed command-line arguments; we can translate this directly into suitable arguments for the main function. At a first pass, we created the following interface:
// [[Rcpp::export]] bool corels(std::string rules_file, std::string labels_file, std::string log_dir, std::string meta_file = "", bool run_bfs = false, bool calculate_size = false, bool run_curiosity = false, int curiosity_policy = 0, bool latex_out = false, int map_type = 0, int verbosity = 0, int max_num_nodes = 100000, double regularization = 0.01, int logging_frequency = 1000, int ablation = 0) { // actual function body omitted }
Rcpp facilities the integration by adding another wrapper exposing all the function
arguments, and setting up required arguments without default (the first three) along with
optional arguments given a default. The user can now call corels()
from R with three
required arguments (the two input files plus the log directory) as well as number of
optional arguments.
R package can access data files that are shipped with them. That is very useful feature,
and we therefore also copy in the files include in the Corels repository and its data/
directory.
fs::dir_tree("../../../rcppcorels/inst/sample_data")
Combining the two preceding steps, we can now offer an illustrative example. It is
included in the helpd page for function corels()
and can be run from R via
example("corels")
.
library(RcppCorels) .sysfile <- function(f) # helper function system.file("sample_data",f,package="RcppCorels") rules_file <- .sysfile("compas_train.out") label_file <- .sysfile("compas_train.label") meta_file <- .sysfile("compas_train.minor") logdir <- tempdir() stopifnot(file.exists(rules_file), file.exists(labels_file), file.exists(meta_file), dir.exists(logdir)) corels(rules_file, labels_file, logdir, meta_file, verbosity = 100, regularization = 0.015, curiosity_policy = 2, # by lower bound map_type = 1) # permutation map cat("See ", logdir, " for result file.")
In the example, we pass the two required arguments for rules and labels files, the optional argument for the 'meta' file as well as an added required argument for the output directory. R policy prohibits writing in user-directories, we default to using the temporary directory of the current session, and report its value at the end. For other arguments default values are used.
One common difficulty when bringing an extermal library to R via a package consists in dealing with an external dependency. In the case of 'Corels', the GNU GMP library for multi-precision arithmetic is an optional extension which, if available, improves and accelerates internal processing.
The simplest approach is to declare a compile-time variable in the src/Makevars
file. Using
-DGMP
defines the GMP
variable at the level of the C and C++ code. One can then condition on
the variable. A very standard approach, also used here is #if defined(GMP) ... #else ... #endif
where one of the two code branches is in effect depending on whether the GMP
variable is defined
or not.
In order to detect presence of a required (or optional) library, tools like 'autoconf' or 'cmake' are often used. For example, one can rely of an existing 'autoconf' macro provided by the GMP documentation to detect presence of the the GNU GMP header and library. We are making use of this facility here to deploy GMP when it is available. As 'Corels' can be built with and without GMP, the build and installation succeeds either way---but deployment of the more-featureful variant with use GMP is automated.
It is good (and common) practice to clearly attribute authorship. Here, credit is given to
the 'Corels' team and authors as well as to the authors of the underlying 'rulelib' code
used by 'Corels' via the file inst/AUTHORS
(which will be installed as AUTHORS
with
the package. In addition, the file inst/LICENSE
clarifies the GNU GPL-3 license for 'RcppCorels' and 'Corels', and the MIT license for 'rulelib'.
Several files help to improve the package. For example, .Rbuildignore
allows to exclude
listed files from the resulting R package keeping it well-defined. Similarly, .gitignore
can exclude files from being added to the git
repository. We also like .editorconfig
for consistent editing default across a range of modern editors.
We describe s series of steps to turn the standalone library 'Corels' describes by \citet{arxiv:corels} into a R package \pkg{RcppCorels} using the facilities offered by \pkg{Rcpp} \citep{CRAN:Rcpp}. Along the way, we illustrate key aspects of the R package standards and CRAN Repository Policy proving a template for other research software wishing to provide their implementations in a form that is accessibly by R users.
\bibliography{Rcpp} \bibliographystyle{jss}
\newpage \onecolumn
edd@rob:~/git$ r --packages Rcpp --eval 'Rcpp.package.skeleton("RcppCorels")' Attaching package: ‘utils’ The following objects are masked from ‘package:Rcpp’: .DollarNames, prompt Creating directories ... Creating DESCRIPTION ... Creating NAMESPACE ... Creating Read-and-delete-me ... Saving functions and data ... Making help files ... Done. Further steps are described in './RcppCorels/Read-and-delete-me'. Adding Rcpp settings >> added Imports: Rcpp >> added LinkingTo: Rcpp >> added useDynLib directive to NAMESPACE >> added importFrom(Rcpp, evalCpp) directive to NAMESPACE >> added example src file using Rcpp attributes >> added Rd file for rcpp_hello_world >> compiled Rcpp attributes edd@rob:~/git$ edd@rob:~/git$ mv RcppCorels/ rcppcorels # prefer lowercase directories edd@rob:~/git$
In the file shown here, use of GMP is unconditional: we define GMP
as a compiler flag, and
instruct the linker to link with the GMP library.
CXX_STD = CXX11 PKG_CXXFLAGS = -I. -DGMP PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) -lgmp
#include "queue.h" #include <Rcpp.h> /* * Logs statistics about the execution of the algorithm and dumps it to a file. * To turn off, pass verbosity <= 1 */ NullLogger* logger; // [[Rcpp::export]] bool corels() { return true; // more to fill in, naturally }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.