knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

sparseprepr

R-CMD-check Codecov test coverage pkgdown

The goal of sparseprepr is to enable common pre-processing actions for sparse matrices and provide a more memory-efficient workflow for modeling at scale.

Installation

Install from Github with:

# install.packages(devtools)
devtools::install_github("dmolitor/sparseprepr")

Scope

sparseprepr functionality only supports sparse matrices coded in sorted compressed column-oriented form, formally of class CsparseMatrix. Although the Matrix package also defines sorted compressed row-oriented form (RsparseMatrix) and triplet form (TsparseMatrix) sparse matrices, it makes clear that "most operations with sparse matrices are performed using the compressed, column-oriented or CsparseMatrix representation," and that even when matrices are created in the TsparseMatrix or RsparseMatrix forms for convenience, "once it is created, however, the matrix is generally coerced to a CsparseMatrix for further operations."

x <- Matrix::rsparsematrix(10, 3, density = 0.9)
x <- cbind(x, x[, 3, drop = FALSE])
x[10, 3] <- x[10, 3] + 0.001
x <- cbind(x, 1, 0)
x <- cbind(x, x[, 2, drop = FALSE])
x <- cbind(x, c(rep(0, 9), 1))
colnames(x) <- c(
  paste0("x", 1:3),
  "cor_with_x3",
  "const_col",
  "const_col2",
  "dup_x2",
  "sparse_col"
)

Core Functionality

The following toy example shows a number of the pre-processing features that sparseprepr provides.

library(sparseprepr)

x

The matrix shown above has a number of contrived features; column cor_with_x3 is highly correlated with column x3, const_col and const_col2 are zero-variance columns, dup_x2 is identical to x2, and sparse_col is a highly sparse column. Common pre-processing steps provided by sparseprepr include:

Pipe Workflow

These same pre-processing steps can be utilized in a more user-friendly manner via the magrittr pipe (%>%) or the base pipe (|> - R 4.1 or greater).

x |>
  remove_constant() |>
  remove_correlated(threshold = 0.99) |>
  remove_duplicate() |>
  remove_sparse(threshold = 0.9) |>
  transform_cols(
    fns = list(function(i) i^2, function(i) i^3),
    which.cols = paste0("x", 2:3),
    name.sep = list("squared", "cubed")
  )


dmolitor/sparseprepr documentation built on Jan. 7, 2022, 9:58 p.m.