cvCovEst: Cross-Validated Covariance Matrix Estimation

title: "cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in R" tags: - R - covariance matrix - cross-validation - high-dimensional statistics - loss-based estimation - multivariate analysis authors: - name: Philippe Boileau orcid: 0000-0002-4850-2507 affiliation: "1, 2" - name: Nima S. Hejazi orcid: 0000-0002-7127-2789 affiliation: "1, 2" - name: Brian Collica orcid: 0000-0003-1127-2557 affiliation: 3 - name: Mark J. van der Laan orcid: 0000-0003-1432-5511 affiliation: "2, 3, 4" - name: Sandrine Dudoit orcid: 0000-0002-6069-8629 affiliation: "2, 3, 4" affiliations: - name: Graduate Group in Biostatistics, University of California, Berkeley index: 1 - name: Center for Computational Biology, University of California, Berkeley index: 2 - name: Department of Statistics, University of California, Berkeley index: 3 - name: Division of Biostatistics, School of Public Health, University of California, Berkeley index: 4 date: "\today" bibliography: paper.bib

Summary

Covariance matrices play fundamental roles in myriad statistical procedures. When the observations in a dataset far outnumber the features, asymptotic theory and empirical evidence have demonstrated the sample covariance matrix to be the optimal estimator of this parameter. This assertion does not hold when the number of observations is commensurate with or smaller than the number of features. Consequently, statisticians have derived many novel covariance matrix estimators for the high-dimensional regime, often relying on additional assumptions about the parameter's structural characteristics (e.g., sparsity). While these estimators have greatly improved the ability to estimate covariance matrices in high-dimensional settings, objectively selecting the best estimator from among the many possible candidates remains a largely unaddressed challenge. The cvCovEst package addresses this methodological gap through its implementation of a cross-validated framework for covariance matrix estimator selection. This data-adaptive procedure's selections are asymptotically optimal under minimal assumptions -- in fact, they are equivalent to the selections that would be made if given full knowledge of the true data-generating processes (i.e., an oracle selector) [@laan_dudoit:2003].

Statement of Need

When the number of observations in a dataset far exceeds the number of features, the estimator of choice for the covariance matrix is the sample covariance matrix. It is efficient under minimal regularity assumptions on the data-generating distribution. In high-dimensional regimes, however, its performance is unsatisfactory: the sample covariance matrix is highly variable, and produces estimates with diverging condition numbers and over-dispersed eigenvalues [@johnstone2001distribution]. Analyses employing this demonstrably poor estimator may be negatively impacted.

As high-dimensional data have become widespread, researchers have derived many novel covariance matrix estimators to remediate the sample covariance matrix's shortcomings. These estimators come in many flavors, though most are constructed by regularizing the sample covariance matrix. Comprehensive reviews are provided by @fan2016 and @pourahmadi2013, and these estimators are implemented across a diversity of R packages: CovTools [@lee2019], CVTuningCov [@wang2014], and nlshrink [@ramprasad2016] to name but a few.

This variety brings with it many challenges. Identifying an "optimal" estimator from among a collection of candidates can prove a daunting task, one whose objectivity is often compromised by the data analyst's decisions. Though data-driven approaches for selecting an optimal estimator from among estimators belonging to certain (limited) classes have been derived, the question of selecting from a diverse collection of candidate procedures remains unaddressed.

`cvCovEst` Framework

The solution provided by cvCovEst is a general, cross-validation-based, estimator-agnostic framework for covariance matrix estimator selection. The asymptotic optimality of selections are guaranteed under a few non-restrictive assumptions by extending the seminal work of @laan_dudoit:2003, @dudoit2005, and @vaart2006 on data-adaptive estimator selection to high-dimensional covariance matrix estimation [@boileau2021]. Here, optimality is defined as choosing an estimator with an equivalent risk difference to that which would have been selected were the underlying data-generating distribution completely known.

The cvCovEst software package implements this framework for the R language and environment for statistical computing [@R]. Included is a collection of covariance matrix estimators spanning the work of many researchers (Table 1). They may be employed independently of the cross-validation procedure. cvCovEst also provides a variety of plotting and summary functions. These diagnostic tools allow users to gauge the algorithm's performance, diagnose issues that might arise during estimation procedures, and build intuition about the many estimators' behaviors. Additionally, users have options to increase the cross-validation method's computational efficiency via parallel computation. Parallelization relies on the suite of future packages [@future] by way of the origami package [@origami].

Table 1: Covariance matrix estimators implemented as of version 1.0.0.

|Estimator | Implementation | Description | |----------|----------|-------------| | Sample covariance matrix | sampleCovEst() | The sample covariance matrix. | | Hard thresholding [@Bickel2008_thresh] | thresholdingEst() | Applies a hard thresholding operator to the entries of the sample covariance matrix. | | SCAD thresholding [@rothman2009;@fan2001] | scadEst() | Applies the SCAD thresholding operator to the entries of the sample covariance matrix.| | Adaptive LASSO [@rothman2009] | adaptiveLassoEst() | Applies the adaptive LASSO thresholding operator to the entries of the sample covariance matrix. | | Banding [@bickel2008_banding] | bandingEst() | Replaces the sample covariance matrix's off-diagonal bands by zeros. | | Tapering [@cai2010] | taperingEst() | Tapers the sample covariance matrix's off-diagonal bands, eventually replacing them by zeros. | | Optimal Linear Shrinkage [@Ledoit2004] | linearShrinkLWEst() | Asymptotically optimal shrinkage of the sample covariance matrix towards the identity. | | Linear Shrinkage [@Ledoit2004] | linearShrinkEst() | Shrinkage of the sample covariance matrix towards the identity, but the shrinkage is controlled by a hyperparameter. | | Dense Linear Shrinkage [@shafer2005] | denseLinearShrinkEst() | Asymptotically optimal shrinkage of the sample covariance matrix towards a dense matrix whose diagonal elements are the mean of the sample covariance matrix's diagonal and whose off-diagonal elements are the mean of the sample covariance matrix's off-diagonal elements. | | Nonlinear Shrinkage [@Ledoit2020] | nlShrinkLWEst() | Analytical estimator for the nonlinear shrinkage of the sample covariance matrix. | | POET [@fan2013] | poetEst() | An estimator based on latent variable estimation and thresholding. | | Robust POET [@fan2018] | robustPoetEst() | A robust (and more computationally taxing) take on the POET estimator. |

Examples

We briefly showcase cvCovEst's functionality through a toy example and an application to single-cell transcriptomic data.

Multivariate normal data are simulated using a covariance matrix with a Toeplitz structure and then fed to the cvCovEst function. A summary of the cross-validated estimation procedure is provided via the plot method.

library(MASS)
library(cvCovEst)
set.seed(1584)

# function to generate a toeplitz matrix
toep_sim <- function(p, rho, alpha) {

    times <- seq_len(p)
    H <- abs(outer(times, times, "-")) + diag(p)
    H <- H^-(1 + alpha) * rho
    covmat <- H + diag(p) * (1 - rho)

    sign_mat <- sapply(
      times,
      function(i) {
        sapply(
          times,
          function(j) {
            (-1)^(abs(i - j))
          }
        )
      }
    )
    return(covmat * sign_mat)
}

# generate a 100 x 100 covariance matrix
sim_covmat <- toep_sim(p = 100, rho = 0.6, alpha = 0.3)

# sample 75 observations from multivariate normal mean = 0, var = sim_covmat
sim_dat <-  MASS::mvrnorm(n = 75, mu = rep(0, 100), Sigma = sim_covmat)

# run CV-selector
cv_cov_est_sim <- cvCovEst(
  dat = sim_dat,
  estimators = c(
    linearShrinkEst, thresholdingEst, bandingEst, adaptiveLassoEst,
    sampleCovEst, taperingEst
  ),
  estimator_params = list(
    linearShrinkEst = list(alpha = seq(0.25, 0.75, 0.05)),
    thresholdingEst = list(gamma = seq(0.25, 0.75, 0.05)),
    bandingEst = list(k = seq(2L, 10L, 2L)),
    adaptiveLassoEst = list(lambda = c(0.1, 0.25, 0.5, 0.75, 1), n = seq(1, 5)),
    taperingEst = list(k = seq(2L, 10L, 2L))
  ),
  cv_scheme = "v_fold",
  v_folds = 5
)

# plot a summary of the results
plot(cv_cov_est_sim, data_orig = sim_dat)

{ width=95% }

Single-cell transcriptome sequencing (scRNA-seq) measures the gene expression profiles of individual cells within a given population, permitting the identification of rare cell types and the study of developmental trajectories. The datasets resulting from these experiments are typically high-dimensional: expression data for hundreds or thousands of cells are collected for tens of thousands of genes. A critical step in most analytic workflows is therefore that of dimension reduction. In addition to facilitating visualization, this reduction is thought to have a denoising effect. That is, the effects of uninteresting biological variation are typically mitigated in these lower-dimensional embeddings.

A popular method for the dimensionality reduction of scRNA-seq is uniform manifold approximation and projection (UMAP) [@mcinnes2018], capable of capturing non-linear relationships between features, applied to the dataset's leading principal components. Since these principal components (PCs) are derived from the sample covariance matrix, however, they are likely to be poor estimates of the true PCs when the number of genes exceeds the number of cells. Instead, the cvCovEst estimate could be used to compute the initial dimensionality reduction.

Indeed, we find that the two-dimensional UMAP embedding resulting from the cvCovEst-based approach improves upon that of the standard PCA-based approach when applied to a dataset of 285 mouse visual cortex's cells' 1,000 most variable genes [@tasic2016]. Fewer rare cells are misclustered, engendering a 47\% improvement in average silhouette width. For further discussion, see @boileau2021.

A comparison of UMAP embeddings using the 20 leading PCs from traditional PCA
and from cvCovEst-based PCA as initializations. { width=95% }

Availability

A stable release of the cvCovEst package is freely-available via the Comprehensive R Archive Network. Its development version can be found on GitHub. Documentation and examples are contained in each version's manual pages, vignette, and pkgdown [@pkgdown] website at https://philboileau.github.io/cvCovEst.

Acknowledgments

Philippe Boileau's contribution to this work was supported by the Fonds de recherche du Québec - Nature et technologies (B1X) and by the National Institute of Environmental Health Sciences [P42ES004705] Superfund Research Program at UC Berkeley.

We thank Jamarcus Liu for his contributions to the software package.