--- title: "Supporting arbitrary matrix classes" author: "Aaron Lun" package: beachmat output: BiocStyle::html_document: toc_float: yes vignette: > %\VignetteIndexEntry{4. Supporting arbitrary matrix classes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` # Background Ordinarily, direct support for a matrix representation would require the appropriate methods to be defined in `r Biocpkg("beachmat")` at compile time. This is the case for the most widely used matrix classes but is somewhat restrictive for other community-contributed matrix representations. Fortunately, R provides a mechanism to link across shared libraries from different packages. This means that package developers who define a R matrix representation can also define C++ methods for native read/write support in `r Biocpkg("beachmat")`-dependent code. By doing so, we can improve efficiency of access to these new classes by avoiding the need for block processing via R. A functioning demonstration of this approach is available in the `extensions` test package. This vignette will provide an explanation of the code in `extensions`, and we suggest examining the source code at the same time: ```{r} system.file("extensions", package="beachmat") ``` # External linkage for input ## Setting up in R Assume that we have already defined a new matrix-like S4 class (here, `AaronMatrix`). To notify the `r Biocpkg("beachmat")` API that direct input support is available, we need to: - define a method for the `supportCppAccess()` generic (from `r Biocpkg("beachmat")`) for this class. This should return `TRUE` if direct support is available (obviously). - define a method for the `type()` generic from the `r Biocpkg("DelayedArray")` package. This should return the type of the matrix, i.e., integer, logical, numeric or character. It is possible to only have direct support for particular data types of the given matrix representation. The example in `extensions` only directly supports integer and character `AaronMatrix` objects^[Because I was too lazy to add all of them.] and will only return `TRUE` for such types. ## Setting up in C++ We will use integer matrices for demonstration, though it is simple to generalize this to all types by replacing `_integer` with, e.g., `_character`^[Some understanding of C++ templates will greatly simplify the definition of the same methods for different types.]. First, we define a `create()` function that takes a `SEXP` object and returns a `void` pointer. This should presumably point to some C++ class that can contain intermediate data structures for efficient access. ```cpp void * ptr = AaronMatrix_integer_input_create(in /* SEXP */); ``` We define a `clone()` function that performs a deep copy of the aforementioned pointer. ```cpp void * ptr_copy = AaronMatrix_integer_input_clone(ptr /* void* */); ``` We define a `destroy()` function that frees the memory pointed to by `ptr`. ```cpp AaronMatrix_integer_input_destroy(ptr /* void* */); ``` We define a `get_dim()` function that records the number of rows and columns in the object pointed to by `ptr`. Note the pointers for `nrow` and `ncol`. ```cpp AaronMatrix_integer_input_dim( ptr, /* void* */ nrow, /* size_t* */ ncol /* size_t* */ ); ``` A systematic naming scheme is used for all functions, consisting of: - The name of the matrix representation, i.e., `AaronMatrix`. - The data type, i.e., `integer`. - Whether it is an `input` or `output` class. - The purpose of the function, e.g., `destroy`. ## Defining getter methods ### For all types In general, the getter functions follow the same structure as that described for the `r Biocpkg("beachmat", vignette="input.html", label="input API")`. We expect a `get` function to obtain a specified entry of the matrix: ```cpp AaronMatrix_integer_intput_get( ptr, /* void* */ r, /* size_t */ c, /* size_t */ val /* int* */ ); ``` Note that `val` is a **pointer** to the matrix type. For example, `val` should be a `Rcpp::String*` for character matrices, a `double*` for numeric matrices, and an `int*` for logical matrices. Developers can assume that `r` and `c` are valid, i.e., within `[0, nrow)` and `[0, ncol)` respectively. These checks are performed by `r Biocpkg("beachmat")` and do not have to be repeated within developer-defined functions^[Obviously, the dimensions of the matrix pointed to by `ptr` should not change!]. ### For non-numeric types Here, we will use character matrices^[Character matrices tend to require some special attention, as character arrays need to be coerced to `Rcpp::String` objects to be returned in `in`.] as an example. We expect a `getCol` function to obtain a column of the matrix: ```cpp AaronMatrix_character_input_getCol( ptr, /* void* */ c, /* size_t */ in, /* Rcpp::StringVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` ... and another `getRow` function to obtain a row of the matrix: ```cpp AaronMatrix_character_input_getRow( ptr, /* void* */ r, /* size_t */ in*, /* Rcpp::StringVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` These are in camelcase to simplify parsing of the function names. We further expect a `getCols` function to obtain multiple columns: ```cpp AaronMatrix_character_input_getCols( ptr, /* void* */ c, /* size_t */ indices, /* Rcpp::IntegerVector::iterator* */ n, /* size_t */ in, /* Rcpp::StringVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` ... and a `getRows` function to obtain multiple rows: ```cpp AaronMatrix_character_input_getRows( ptr, /* void* */ r, /* size_t */ indices, /* Rcpp::IntegerVector::iterator* */ n, /* size_t */ in, /* Rcpp::StringVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` In all cases, `first` and `last` can be assumed to be valid, i.e., `first <= last` and both in `[0, nrow)` or `[0, ncol)` (for column and row access, respectively). Indices in `indices` can also be assumed to be valid, i.e., within matrix dimensions and strictly increasing. We stress that the various iterator arguments are _pointers to iterators_ rather than the iterators themselves. This is to avoid potential issues with C++ classes when using C-style linkage via R's `R_GetCCallable()` framework. ### Numeric types For integer, logical or numeric matrices, we need to account for type conversions. This is done by defining the following functions (using integer matrices as an example): - `AaronMatrix_integer_input_getCol_integer`, for getting a single column's values as integers. - `AaronMatrix_integer_input_getCol_numeric`, for getting a single column's values as double-precision values. - `AaronMatrix_integer_input_getRow_integer`, for getting a single row's values as integers. - `AaronMatrix_integer_input_getRow_numeric`, for getting a single row's values as double-precision values. - `AaronMatrix_integer_input_getCols_integer`, for getting multiple columns' values as integers. - `AaronMatrix_integer_input_getCols_numeric`, for getting multiple columns' values as double-precision values. - `AaronMatrix_integer_input_getRows_integer`, for getting multiple rows' values as integers. - `AaronMatrix_integer_input_getRows_numeric`, for getting multiple rows' values as double-precision values. Taking the single-column getter as an example: ```cpp AaronMatrix_integer_input_getCol_integer( ptr, /* void* */ c, /* size_t */ in, /* Rcpp::IntegerVector::iterator* */ first, /* size_t */ last /* size_t */ ); AaronMatrix_integer_input_getCol_numeric( ptr, /* void* */ c, /* size_t */ in, /* Rcpp::NumericVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` The function name now has an additional suffix to denote the destination type. We explicitly define conversions here as the cross-library linking framework does not support templating or overloading of `in`. # External linkage for output ## Setting up in R To notify the `r Biocpkg("beachmat")` API that direct output support is available, we need to define flags in our package's namespace. ```{r} beachmat_AaronMatrix_integer_output <- TRUE beachmat_AaronMatrix_character_output <- TRUE ``` This indicates that support is available for `AaronMatrix` integer and character outputs. Missing or `FALSE` flags indicate that no support is available, in which case `r Biocpkg("beachmat")` will write to an ordinary matrix by default. ## Setting up in C++ Again, we will use integer matrices for demonstration. The required functions are mostly similar to the input case. For creation, we expect to have the number of rows `nr` and columns `nc`: ```cpp void * ptr = AaronMatrix_integer_output_create( nr /* size_t */, nc /* size_t */ ); ``` We define a `clone()` function to perform a deep copy: ```cpp void * ptr_copy = AaronMatrix_integer_output_clone(ptr /* void* */); ``` We also define a `destroy()` function to free memory: ```cpp AaronMatrix_integer_output_destroy(ptr /* void* */); ``` In all cases, we use `_output_` to indicate that we are dealing with an output matrix class. ## Defining setter methods ### For all types In general, the setter functions follow the same structure as that described for the `r Biocpkg("beachmat", vignette="output.html", label="out API")`. We expect a `set` function to obtain a specified entry of the matrix: ```cpp AaronMatrix_integer_intput_set( ptr, /* void* */ r, /* size_t */ c, /* size_t */ val /* int* */ ); ``` Again, note that `val` is a **pointer** to the matrix type. ### For non-numeric types Here, we will use character matrices^[Character matrices tend to require some special attention, as character arrays need to be coerced to `Rcpp::String` objects to be returned in `in`.] as an example. We expect a `setCol` function to obtain a column of the matrix: ```cpp AaronMatrix_character_output_setCol( ptr, /* void* */ c, /* size_t */ in, /* Rcpp::StringVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` ... and another `setRow` function to obtain a row of the matrix: ```cpp AaronMatrix_character_output_setRow( ptr, /* void* */ r, /* size_t */ in*, /* Rcpp::StringVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` These are in camelcase to simplify parsing of the function names. We further expect a `setColIndexed` function to set specific elements of a column: ```cpp AaronMatrix_character_output_setColIndexed( ptr, /* void */ c, /* size_t */ n, /* size_t */ idx, /* Rcpp::IntegerVector::iterator */ in /* Rcpp::StringVector::iterator */ ) ``` ... where `idx` points to an array of `n` zero-indexed row indices and `val` points to an array of values. The function should assign each value to the corresponding row at column `c` of the output matrix. Similarly, we expect a `setRowIndexed` function to set specific elements of a row: ```cpp AaronMatrix_character_output_setRowIndexed( ptr, /* void */ r, /* size_t */ n, /* size_t */ idx, /* Rcpp::IntegerVector::iterator */ in /* Rcpp::StringVector::iterator */ ) ``` ... where `idx` now contains column indices. ### Numeric types For integer, logical or numeric matrices, we need to account for type conversions. This is done by defining the following functions (using integer matrices as an example): - `AaronMatrix_integer_output_setCol_integer`, for setting a single column's values from integers. - `AaronMatrix_integer_output_setCol_numeric`, for setting a single column's values from double-precision values. - `AaronMatrix_integer_output_setRow_integer`, for setting a single row's values from integers. - `AaronMatrix_integer_output_setRow_numeric`, for setting a single row's values from double-precision values. - `AaronMatrix_integer_output_setColIndexed_integer`, for indexed setting of a single column's values from integers. - `AaronMatrix_integer_output_setColIndexed_numeric`, for indexed setting of a single column's values from double-precision values. - `AaronMatrix_integer_output_setRowIndexed_integer`, for indexed setting of a single row's values from integers. - `AaronMatrix_integer_output_setRowIndexed_numeric`, for indexed setting of a single row's values from double-precision values Taking the single-column setter as an example: ```cpp AaronMatrix_integer_output_setCol_integer( ptr, /* void* */ c, /* size_t */ in, /* Rcpp::IntegerVector::iterator* */ first, /* size_t */ last /* size_t */ ); AaronMatrix_integer_output_setCol_numeric( ptr, /* void* */ c, /* size_t */ in, /* Rcpp::NumericVector::iterator* */ first, /* size_t */ last /* size_t */ ); ``` ## Defining getter methods All single-element and single-row/column getters should be supported: - `AaronMatrix_character_output_get` - `AaronMatrix_character_output_getRow` - `AaronMatrix_character_output_getCol` For numeric types, convertible getters should also be supported: - `AaronMatrix_character_output_get` - `AaronMatrix_character_output_getRow_integer` - `AaronMatrix_character_output_getRow_numeric` - `AaronMatrix_character_output_getCol_integer` - `AaronMatrix_character_output_getCol_numeric` # Ensuring discoverability We use the `R_RegisterCCallable()` function from the R API to register the above functions (see [here](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Linking-to-native-routines-in-other-packages) for an explanation). This ensures that they can be found by `r Biocpkg("beachmat")` when an `AaronMatrix` instance is encountered. Note that the functions must be defined with C-style linkage in order for this procedure to work properly, hence the use of `extern "C"` in the `extensions` test package. Needless to say, the `NAMESPACE` should contain an appropriate `useDynLib` command. This means that shared library will be loaded along with the package, allowing `r Biocpkg("beachmat")` to access the registered routines within. However, the `supportCppAccess` method and output flags do not need to be exported, as these will be directly recovered from the package's namespace. # Testing We suggest using the `r Rpackage("beachtest")` package to test correct input and output via external linkage to a custom matrix representation. When using the `r CRANpkg("testthat")` framework, this can be added to `setup.R`: ```{r, eval=FALSE} testpkg <- system.file("testpkg", package="beachmat") devtools::install(testpkg, quick=TRUE) library(beachtest) ``` It is simple to write test scripts using functions like `check_read_all` and `check_write_all` to quickly verify that linkage works correctly. Developers are again referred to the `extensions` test package for a working example.