Package: RcppColMetric 0.1.0 Author: Xiurui Zhu Modified: 2025-03-08 18:17:18 Compiled: 2025-03-08 18:17:53
The goal of RcppColMetric
is to efficiently compute metrics between
various vectors and a common vector. This is common in data science,
such as computing performance metrics between each feature and a common
response. Rcpp
is used to
efficiently iterate over vectors through compiled code. You may extend
its utilities by providing custom metrics that fit into the framework.
You can install the released version of RcppColMetric
from
CRAN with:
install.packages("RcppColMetric")
Alternatively, you can install the developmental version of
RcppColMetric
from github with:
remotes::install_github("zhuxr11/RcppColMetric")
library(cbbinom)
We use cats
from MASS
to
illustrate the use of the package.
library(MASS)
data(cats)
print(head(cats))
#> Sex Bwt Hwt
#> 1 F 2.0 7.0
#> 2 F 2.0 7.4
#> 3 F 2.0 9.5
#> 4 F 2.1 7.2
#> 5 F 2.1 7.3
#> 6 F 2.1 7.6
In binary classification modelling, it is a common practice to compute
ROC-AUC of each feature (usually columns) against a common target.
RcppColMetric
provides a much faster version than its commonly used
counterparts, e.g. caTools::colAUC()
.
library(caTools)
(col_auc_bench <- microbenchmark::microbenchmark(
col_auc_r = caTools::colAUC(cats[, 2L:3L], cats[, 1L]),
col_auc_cpp = col_auc(cats[, 2L:3L], cats[, 1L]),
times = 100L,
check = "identical"
))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> col_auc_r 514.600 544.2005 623.3241 613.2515 666.1505 985.601 100
#> col_auc_cpp 200.401 222.2010 244.9679 236.0010 266.8010 391.501 100
As can be seen, the median speed of computation from RcppColMetric
is
2.599 times faster.
If there are multiple sets of features and responses, you may use the
vectorized version col_auc_vec()
, which uses compiled code to speed up
iterations and returns a list.
col_auc_vec(list(cats[, 2L:3L]), list(cats[, 1L]))
#> [[1]]
#> Bwt Hwt
#> F vs. M 0.8338451 0.759048
In classification modelling, it is another common practice to assess
mutual information between features and a response if the features are
discrete. RcppColMetric
provides a much faster version than its
commonly used counterparts, e.g. infotheo::mutinformation()
.
library(infotheo)
(col_mut_info_bench <- microbenchmark::microbenchmark(
col_mut_info_r = sapply(round(cats[, 2L:3L]), infotheo::mutinformation, cats[, 1L]) %>%
{matrix(., nrow = 1L, dimnames = list(NULL, names(.)))},
col_mut_info_cpp = col_mut_info(round(cats[, 2L:3L]), cats[, 1L]),
times = 100L,
check = "identical"
))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> col_mut_info_r 1587.200 1685.351 1884.131 1766.001 1964.9510 4803.800 100
#> col_mut_info_cpp 618.401 641.551 691.325 659.651 721.6015 943.101 100
As can be seen, the median speed of computation from RcppColMetric
is
2.677 times faster.
If there are multiple sets of features and responses, you may use the
vectorized version col_mut_info_vec()
, which uses compiled code to
speed up iterations and returns a list.
col_mut_info_vec(list(round(cats[, 2L:3L])), list(cats[, 1L]))
#> [[1]]
#> Bwt Hwt
#> [1,] 0.1346783 0.1620514
You may implement your own metric by inheriting from
RcppColMetric::Metric
class with template arguments as feature SEXP
(input, numeric here) and response SEXP
(input, factor as integer
here) types. For example, to compute range of each feature, define a
RangeMetric
class.
#include <RcppColMetric.h>
#include <Rcpp.h>
using namespace Rcpp;
// x: numeric (REALSXP), y: factor -> integer (INTSXP), output: numeric (REALSXP)
class
RangeMetric: public RcppColMetric::Metric<REALSXP, INTSXP, REALSXP>
{
public:
// Constructor
RangeMetric(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) {
// This parameter is inherited from `Metric`, determining output dimension (number of rows)
// For RangeMetric, the output dimension is 2 (min & max)
output_dim = 2;
}
virtual Nullable<CharacterVector> row_names(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) const override {
// Determine the row names
// If not used, it may return R_NilValue
CharacterVector out = {"min", "max"};
return out;
}
virtual NumericVector calc_col(const NumericVector& x, const IntegerVector& y, const R_xlen_t& i, const Nullable<List>& args = R_NilValue) const override {
// Derive output value for each feature and the common response
// For RangeMetric, the output is min & max
NumericVector out = {min(x), max(x)};
return out;
}
};
Then, define the main function calling RcppColMetric::col_metric()
,
with corresponding feature SEXP
(input, numeric here), response SEXP
(input, factor as integer here) and output SEXP
(output, numeric here)
types.
NumericMatrix col_range(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) {
RangeMetric range_metric(x, y, args);
NumericMatrix out = RcppColMetric::col_metric<REALSXP, INTSXP, REALSXP>(x, y, range_metric, args);
return out;
}
Test this function with cats
:
col_range(cats[, 2L:3L], cats[, 1L])
#> Bwt Hwt
#> min 2.0 6.3
#> max 3.9 20.5
To define vectorized version of the function, a wrapper function is
defined to generate RangeMetric
object (taking only x
, y
and
args
), and then passed on to the workhorse
RcppColMetric::col_metric_vec()
.
RangeMetric gen_range_metric(const RObject& x, const IntegerVector& y, const Nullable<List>& args = R_NilValue) {
RangeMetric out(x, y, args);
return out;
}
// [[Rcpp::export]]
List col_range_vec(const List& x, const List& y, const Nullable<List>& args = R_NilValue) {
List out = RcppColMetric::col_metric_vec<REALSXP, INTSXP, REALSXP>(x, y, &gen_range_metric, args);
return out;
}
Test the vectorized function with cats
:
col_range_vec(list(cats[, 2L:3L]), list(cats[, 1L]))
#> [[1]]
#> Bwt Hwt
#> min 2.0 6.3
#> max 3.9 20.5
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.