knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of sparseprepr is to enable common pre-processing actions for sparse matrices and provide a more memory-efficient workflow for modeling at scale.
Install from Github with:
# install.packages(devtools) devtools::install_github("dmolitor/sparseprepr")
sparseprepr functionality only supports sparse matrices coded in sorted
compressed column-oriented form, formally of class CsparseMatrix
. Although the
Matrix
package also defines sorted compressed row-oriented form
(RsparseMatrix
) and triplet form (TsparseMatrix
) sparse matrices, it makes
clear that "most operations with sparse matrices are performed using the
compressed, column-oriented or CsparseMatrix representation," and that even when
matrices are created in the TsparseMatrix
or RsparseMatrix
forms for
convenience, "once it is created, however, the matrix is generally coerced to a
CsparseMatrix for further operations."
x <- Matrix::rsparsematrix(10, 3, density = 0.9) x <- cbind(x, x[, 3, drop = FALSE]) x[10, 3] <- x[10, 3] + 0.001 x <- cbind(x, 1, 0) x <- cbind(x, x[, 2, drop = FALSE]) x <- cbind(x, c(rep(0, 9), 1)) colnames(x) <- c( paste0("x", 1:3), "cor_with_x3", "const_col", "const_col2", "dup_x2", "sparse_col" )
The following toy example shows a number of the pre-processing features that sparseprepr provides.
library(sparseprepr)
x
The matrix shown above has a number of contrived features;
column cor_with_x3
is highly correlated with column x3
, const_col
and
const_col2
are zero-variance columns, dup_x2
is identical to x2
,
and sparse_col
is a highly sparse column. Common pre-processing steps provided
by sparseprepr include:
Dropping highly correlated columns
r
remove_correlated(x, threshold = 0.99)
Dropping empty columns
r
remove_empty(x)
Dropping constant columns
r
remove_constant(x)
Dropping duplicated columns
r
remove_duplicate(x)
Dropping highly sparse columns
r
remove_sparse(x, threshold = 0.9)
Transforming matrix columns:
r
# Append squared and cubed terms
transform_cols(
x[, 1:3],
fns = list(function(i) i^2, function(i) i^3),
which.cols = paste0("x", 2:3),
name.sep = list("squared", "cubed")
)
These same pre-processing steps can be utilized in a more user-friendly manner
via the magrittr pipe (%>%
) or the base pipe (|>
- R 4.1 or greater).
x |> remove_constant() |> remove_correlated(threshold = 0.99) |> remove_duplicate() |> remove_sparse(threshold = 0.9) |> transform_cols( fns = list(function(i) i^2, function(i) i^3), which.cols = paste0("x", 2:3), name.sep = list("squared", "cubed") )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.