Supporting arbitrary matrix classes

require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Background

Note: this document refers to version 2 of the r Biocpkg("beachmat") API, which is still supported but no longer under active development. Developers writing new code are encouraged to use version 3, which is much more streamlined.

Ordinarily, direct support for a matrix representation would require the appropriate methods to be defined in r Biocpkg("beachmat") at compile time. This is the case for the most widely used matrix classes but is somewhat restrictive for other community-contributed matrix representations. Fortunately, R provides a mechanism to link across shared libraries from different packages. This means that package developers who define a R matrix representation can also define C++ methods for native read/write support in r Biocpkg("beachmat")-dependent code. By doing so, we can improve efficiency of access to these new classes by avoiding the need for block processing via R.

A functioning demonstration of this approach is available in the extensions test package. This vignette will provide an explanation of the code in extensions, and we suggest examining the source code at the same time:

system.file("extensions", package="beachmat")

External linkage for input

Setting up in R

Assume that we have already defined a new matrix-like S4 class (here, AaronMatrix). To notify the r Biocpkg("beachmat") API that direct input support is available, we need to:

It is possible to only have direct support for particular data types of the given matrix representation. The example in extensions only directly supports integer and character AaronMatrix objects^[Because I was too lazy to add all of them.] and will only return TRUE for such types.

Setting up in C++

We will use integer matrices for demonstration, though it is simple to generalize this to all types by replacing _integer with, e.g., _character^[Some understanding of C++ templates will greatly simplify the definition of the same methods for different types.]. First, we define a create() function that takes a SEXP object and returns a void pointer. This should presumably point to some C++ class that can contain intermediate data structures for efficient access.

void * ptr = AaronMatrix_integer_input_create(in /* SEXP */);

We define a clone() function that performs a deep copy of the aforementioned pointer.

void * ptr_copy = AaronMatrix_integer_input_clone(ptr /* void* */);

We define a destroy() function that frees the memory pointed to by ptr.

AaronMatrix_integer_input_destroy(ptr /* void* */);

We define a get_dim() function that records the number of rows and columns in the object pointed to by ptr. Note the pointers for nrow and ncol.

AaronMatrix_integer_input_dim(
    ptr, /* void* */
    nrow, /* size_t* */
    ncol /* size_t* */
);

A systematic naming scheme is used for all functions, consisting of:

Defining getter methods

For all types

In general, the getter functions follow the same structure as that described for the r Biocpkg("beachmat", vignette="input.html", label="input API"). We expect a get function to obtain a specified entry of the matrix:

AaronMatrix_integer_intput_get(
    ptr, /* void* */
    r, /* size_t */
    c, /* size_t */
    val /* int* */
);

Note that val is a pointer to the matrix type. For example, val should be a Rcpp::String* for character matrices, a double* for numeric matrices, and an int* for logical matrices.

Developers can assume that r and c are valid, i.e., within [0, nrow) and [0, ncol) respectively. These checks are performed by r Biocpkg("beachmat") and do not have to be repeated within developer-defined functions^[Obviously, the dimensions of the matrix pointed to by ptr should not change!].

For non-numeric types

Here, we will use character matrices^[Character matrices tend to require some special attention, as character arrays need to be coerced to Rcpp::String objects to be returned in in.] as an example. We expect a getCol function to obtain a column of the matrix:

AaronMatrix_character_input_getCol(
    ptr, /* void* */
    c, /* size_t */
    in, /* Rcpp::StringVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

... and another getRow function to obtain a row of the matrix:

AaronMatrix_character_input_getRow(
    ptr, /* void* */
    r, /* size_t */
    in*, /* Rcpp::StringVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

These are in camelcase to simplify parsing of the function names. We further expect a getCols function to obtain multiple columns:

AaronMatrix_character_input_getCols(
    ptr, /* void* */
    c, /* size_t */
    indices, /* Rcpp::IntegerVector::iterator* */
    n, /* size_t */
    in, /* Rcpp::StringVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

... and a getRows function to obtain multiple rows:

AaronMatrix_character_input_getRows(
    ptr, /* void* */
    r, /* size_t */
    indices, /* Rcpp::IntegerVector::iterator* */
    n, /* size_t */
    in, /* Rcpp::StringVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

In all cases, first and last can be assumed to be valid, i.e., first <= last and both in [0, nrow) or [0, ncol) (for column and row access, respectively). Indices in indices can also be assumed to be valid, i.e., within matrix dimensions and strictly increasing.

We stress that the various iterator arguments are pointers to iterators rather than the iterators themselves. This is to avoid potential issues with C++ classes when using C-style linkage via R's R_GetCCallable() framework.

Numeric types

For integer, logical or numeric matrices, we need to account for type conversions. This is done by defining the following functions (using integer matrices as an example):

Taking the single-column getter as an example:

AaronMatrix_integer_input_getCol_integer(
    ptr, /* void* */
    c, /* size_t */
    in, /* Rcpp::IntegerVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

AaronMatrix_integer_input_getCol_numeric(
    ptr, /* void* */
    c, /* size_t */
    in, /* Rcpp::NumericVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

The function name now has an additional suffix to denote the destination type. We explicitly define conversions here as the cross-library linking framework does not support templating or overloading of in.

External linkage for output

Setting up in R

To notify the r Biocpkg("beachmat") API that direct output support is available, we need to define flags in our package's namespace.

beachmat_AaronMatrix_integer_output <- TRUE
beachmat_AaronMatrix_character_output <- TRUE

This indicates that support is available for AaronMatrix integer and character outputs. Missing or FALSE flags indicate that no support is available, in which case r Biocpkg("beachmat") will write to an ordinary matrix by default.

Setting up in C++

Again, we will use integer matrices for demonstration. The required functions are mostly similar to the input case. For creation, we expect to have the number of rows nr and columns nc:

void * ptr = AaronMatrix_integer_output_create(
   nr /* size_t */,
   nc /* size_t */
);

We define a clone() function to perform a deep copy:

void * ptr_copy = AaronMatrix_integer_output_clone(ptr /* void* */);

We also define a destroy() function to free memory:

AaronMatrix_integer_output_destroy(ptr /* void* */);

In all cases, we use _output_ to indicate that we are dealing with an output matrix class.

Defining setter methods

For all types

In general, the setter functions follow the same structure as that described for the r Biocpkg("beachmat", vignette="output.html", label="out API"). We expect a set function to obtain a specified entry of the matrix:

AaronMatrix_integer_intput_set(
    ptr, /* void* */
    r, /* size_t */
    c, /* size_t */
    val /* int* */
);

Again, note that val is a pointer to the matrix type.

For non-numeric types

Here, we will use character matrices^[Character matrices tend to require some special attention, as character arrays need to be coerced to Rcpp::String objects to be returned in in.] as an example. We expect a setCol function to obtain a column of the matrix:

AaronMatrix_character_output_setCol(
    ptr, /* void* */
    c, /* size_t */
    in, /* Rcpp::StringVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

... and another setRow function to obtain a row of the matrix:

AaronMatrix_character_output_setRow(
    ptr, /* void* */
    r, /* size_t */
    in*, /* Rcpp::StringVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

These are in camelcase to simplify parsing of the function names. We further expect a setColIndexed function to set specific elements of a column:

AaronMatrix_character_output_setColIndexed(
    ptr, /* void */
    c, /* size_t */
    n, /* size_t */
    idx, /* Rcpp::IntegerVector::iterator */
    in /* Rcpp::StringVector::iterator */
)

... where idx points to an array of n zero-indexed row indices and val points to an array of values. The function should assign each value to the corresponding row at column c of the output matrix.

Similarly, we expect a setRowIndexed function to set specific elements of a row:

AaronMatrix_character_output_setRowIndexed(
    ptr, /* void */
    r, /* size_t */
    n, /* size_t */
    idx, /* Rcpp::IntegerVector::iterator */
    in /* Rcpp::StringVector::iterator */
)

... where idx now contains column indices.

Numeric types

For integer, logical or numeric matrices, we need to account for type conversions. This is done by defining the following functions (using integer matrices as an example):

Taking the single-column setter as an example:

AaronMatrix_integer_output_setCol_integer(
    ptr, /* void* */
    c, /* size_t */
    in, /* Rcpp::IntegerVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

AaronMatrix_integer_output_setCol_numeric(
    ptr, /* void* */
    c, /* size_t */
    in, /* Rcpp::NumericVector::iterator* */
    first, /* size_t */
    last /* size_t */
);

Defining getter methods

All single-element and single-row/column getters should be supported:

For numeric types, convertible getters should also be supported:

Ensuring discoverability

We use the R_RegisterCCallable() function from the R API to register the above functions (see here for an explanation). This ensures that they can be found by r Biocpkg("beachmat") when an AaronMatrix instance is encountered. Note that the functions must be defined with C-style linkage in order for this procedure to work properly, hence the use of extern "C" in the extensions test package.

Needless to say, the NAMESPACE should contain an appropriate useDynLib command. This means that shared library will be loaded along with the package, allowing r Biocpkg("beachmat") to access the registered routines within. However, the supportCppAccess method and output flags do not need to be exported, as these will be directly recovered from the package's namespace.

Testing

We suggest using the r Rpackage("beachtest") package to test correct input and output via external linkage to a custom matrix representation. When using the r CRANpkg("testthat") framework, this can be added to setup.R:

testpkg <- system.file("testpkg", package="beachmat")
devtools::install(testpkg, quick=TRUE)
library(beachtest)

It is simple to write test scripts using functions like check_read_all and check_write_all to quickly verify that linkage works correctly. Developers are again referred to the extensions test package for a working example.



Try the beachmat package in your browser

Any scripts or data that you put into this service are public.

beachmat documentation built on Dec. 22, 2020, 2 a.m.