Adaptive Shrinkage of correlations using *CorShrink*

library("knitr")
opts_chunk$set(tidy=FALSE,tidy.opts=list(width.cutoff=30),dev="png",fig.show="hide",
               fig.width=4,fig.height=7,
               message=FALSE, warning = FALSE)

Introduction

Estimation of covariance or correlation matrices has widespread usage in a broad spectrum of statistical applications. The most commonly used estimator, namely the sample covariance or correlation matrix, is rank deficient and hence unstable in cases where the dimensionality of the problem (p) is greater than the number of samples (n). This problem has driven statisticians to suggest various alternative estimators in \textit{small n, large p} settings. Several estimators of correlation matrix have been proposed in such settings and their theoretical properties and performance comparisons have been studied comprehensively [@touloumis2015nonparametric, @ledoit2003improved, @bickel2008, @rothman2009]. Some of these methods are already available as R packages - corpcor [@schafer2004empirical], glasso [@friedman2008sparse], PDSCE [@rothman2009] etc.

These approaches, however, are not well suited for handling large scale missingness in data. Also, some of these methods work well under some specific sets of assumptions about the underlying matrix, for e.g. - thresholding estimators assume a banded structure of the correlation matrix. In this package, ee introduce a method CorShrink that adapts to varying degree of missingness in observations corresponding to each pair of features. Also, CorShrink can be applied directly to data consisting of missing values, as well as to derived quantities like vectors and matrices of correlations between features, and allows for two formulations - an asymptotic approach and a resampling based approach. Even in examples with no missing data, CorShrink estimated correlations are visibly closer to the true correlations compared to the standard methods. CorShrink also can be applied to other correlation-like quantities such as partial correlations, rank correlations and cosine similarity values from word2vec models.


CorShrink Installation

CorShrink is a companion package to ashr R package [@stephens2016false]. Before installing CorShrink, please make sure you have the latest version of ashr.

install.packages("ashr")

The other dependencies of this package include SQUAREM, reshape2 and Matrix. Next we install CorShrink.

install.packages("CorShrink")

The development version can be installed from Github as well.

library(devtools)
install_github("kkdey/CorShrink")

Then load the package with:

library(CorShrink)

Methods

The main steps in CorShrink are as follows

$$ Z_{ij} = 0.5 \log \left (\frac{1 + R_{ij}}{1 - R_{ij}} \right ) $$

$$ Z^{\star}{ij} : = ash \; (Z{ij}, s_{ij}) $$

The matrix format shrinkage is performed by the CorShrinkMatrix function while the vector format shrinkage is performed by the CorShrinkVector function.

$$ R^{\star}{ij} = \frac{exp \; (2 Z^{\star}{ij}) - 1}{exp \; (2 Z^{\star}_{ij}) + 1} $$


Illustration

We load an example data matrix - the person (544) by tissue samples (53) gene expression data for the gene ENSG00000166819 collected from the Genotype Tissue Expression (GTEx) Project .

data("sample_by_feature_data")

Just by checking the first few rows and columns, we see that the data contains many missing values. The data is

sample_by_feature_data[1:5,1:5]

Standard version CorShrink

CorShrinkData

We estimate the adaptively shrunk correlation matrix for this data using CorShrink.

out <- CorShrinkData(sample_by_feature_data, sd_boot = FALSE, image = "both",
                     image.control = list(tl.cex = 0.8))

The function outputs a list with two elements which are two versions of CorShrink estimated matrices - cor and cor_before_PD.cor is the nearest positive definite approximation ($R^{{\star}{\star}}$) to cor_before_PD version ($R^{{\star}}$) as described in the methods above. When image = "both", the function plots the images for both these versions.

To see whether the method works well, check if the these two versions are close to each other.

out$cor_before_PD[1:5,1:5]
out$cor[1:5, 1:5]

CorShrinkMatrix

CorShrink takes as input not just the samples by features data matrix but also a matrix of pairwise correlations with a matrix of number of samples for each pair contributing to the correlation.

data("pairwise_corr_matrix")
data("common_samples")
out <- CorShrinkMatrix(pairwise_corr_matrix, common_samples, image = "both",
                       image.control = list(tl.cex = 0.8))

CorShrinkVector

CorShrink can be applied to vectors of correlations as well.

cor_vec <- c(-0.56, -0.4, 0.02, 0.2, 0.9, 0.8, 0.3, 0.1, 0.4)
nsamp_vec <- c(10, 20, 30, 4, 50, 60, 20, 10, 3)
out <- CorShrinkVector(corvec = cor_vec, nsamp_vec = nsamp_vec)
out

Note that the correlations computed from adequate amount of data as for the 5th and 6th entries above, the amount of shrinkage is minimal, while it is substantial for the 4th and 9th entries which correspond to small number of samples.

Re-sampling version CorShrink

We have so far looked at CorShrinkData, CorShrinkMatrix and CorShrinkVector, three functions that provide adaptive shrinkage of correlations at the level of the data matrix, matrix of correlations and vector of correlations respectively. In the above examples, we have used the asymptotic version of our algorithm (see Methods). Next we show example usage of a resampling based version of CorShrink.

CorShrinkData - resampling

out <- CorShrinkData(sample_by_feature_data, sd_boot = TRUE, image = "both",
                     image.control = list(tl.cex = 0.8))

The algorithm works by first computing a Bootstrap estimate of the standard error of the Fisher z-scores for each pair and then using this estimate together with the correlations to shrink the latter.

CorShrinkMatrix - resampling

The breakdown can be formulated at the level of a correlation matrix as follows.

zscoreSDmat <- bootcorSE_calc(sample_by_feature_data, verbose = FALSE)
out <- CorShrinkMatrix(pairwise_corr_matrix, zscore_sd = zscoreSDmat, image = "both",
                       image.control = list(tl.cex = 0.8))

pCorShrink

We can use the pCorShrinkData function to adaptively shrink partial correlations. This approach is analogous to GLASSO, CLIME and other sparse graphical model methods, but pCorShrinkData shrinks the edge weights in the graph to 0 adaptively.

As per current implementation pCorShrinkData does not handle missing observations in the data matrix. So, we demonstrate the use of this method on a fully observed simulated data.

Simulated setting

library(Matrix)
n <- 500
P <- 100
block <- 10
mat <- 0.3*diag(1,block) + 0.7*rep(1,block) %*% t(rep(1, block))
Sigma <-   Matrix::bdiag(mat, mat, mat, mat, mat, mat, mat, mat, mat, mat)
corSigma <- cov2cor(Sigma)
pcorSigma <- corpcor::cor2pcor(corSigma)  ##  true partial correlation matrix

Generate Data

##################  Generate data   ################

data <- MASS::mvrnorm(n,rep(0,P),Sigma)

pCorShrinkData

out1 <- pCorShrinkData(data, reg_type = "lm")

Visualization

library(corrplot)
col2 <- c("blue", "white", "red")
par(mfrow=c(1,2))
corrplot::corrplot(pcorSigma, diag = FALSE,
         col = colorRampPalette(col2)(200),
         tl.pos = "td", tl.cex = 0.2, tl.col = "black",
         rect.col = "white",na.label.col = "white", mar=c(2,2,2,2),
         method = "color", type = "upper", title = "original")
corrplot::corrplot(out1, diag = FALSE,
         col = colorRampPalette(col2)(200),
         tl.pos = "td", tl.cex = 0.2, tl.col = "black",
         rect.col = "white",na.label.col = "white", mar=c(2,2,2,2),
         method = "color", type = "upper", title = "pCorShrink")

Extras

So far, in all our examples, we assumed that the estimated correlations between any pair of variables is shrunk towards 0. But CorShrink allows the user to choose a non-zero shrinkage target, estimated from the data, using the mode option in ash.control input.

One can choose a fixed non-zero target in mode as well.

par(mfrow=c(1,2))
out1 <- CorShrinkData(sample_by_feature_data, sd_boot = FALSE, image = "corshrink",             image.control = list(title = "CorShrink (target = 0)", tl.cex = 0.8))
out2 <- CorShrinkData(sample_by_feature_data, sd_boot = FALSE, image = "corshrink",                      ash.control = list(mode = "estimate"),
    image.control = list(title = "CorShrink (target = estimated)", tl.cex = 0.8))

The image = "output" option just outputs the image for the shrunk matrix without plotting it.

In general, CorShrink assumes a normal prior for the population Fisher z-scores. But under specific settings, a non-symmetric distribution , such as uniform or half-uniform could be a better fit. This can be achieved using the mixcompdist in ash.control.

par(mfrow=c(2,2))
out1 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink", 
                     ash.control = list(mixcompdist = "normal"),
                     image.control = list(tl.cex = 0.6))
out2 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink", 
                     ash.control = list(mixcompdist = "uniform"),
                     image.control = list(tl.cex = 0.6))
out3 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink", 
                     ash.control = list(mixcompdist = "halfuniform"),
                     image.control = list(tl.cex = 0.6))
out4 <- CorShrinkData(sample_by_feature_data,sd_boot = FALSE, image ="corshrink",  
                     ash.control = list(mixcompdist = "+uniform"),
                     image.control = list(tl.cex = 0.6))

Acknowledgements

We would like to thank the GTEx Consortium, John Blischak, Sarah Urbut, Chiaowen Joyce Hsiao, Peter Carbonetto and all members of the Stephens Lab.


Session Info

sessionInfo()

References



Try the CorShrink package in your browser

Any scripts or data that you put into this service are public.

CorShrink documentation built on July 13, 2018, 1 a.m.