vruv4: Calibrated RUV4 where the control genes are used to estimate...
In dcgerard/vicar: Various Ideas for Confounder Adjustment in Regression

Description Usage Arguments Details Value Author(s) References See Also Examples

This function will perform a variant of Removing Unwanted Variation 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes are used not only to estimate the hidden confounders, but to estimate a variance inflation parameter. This variance inflation step is akin to the "empirical null" approach of Efron (2004).

vruv4(
  Y,
  X,
  ctl,
  k = NULL,
  cov_of_interest = ncol(X),
  likelihood = c("t", "normal"),
  limmashrink = TRUE,
  degrees_freedom = NULL,
  include_intercept = TRUE,
  gls = TRUE,
  fa_func = pca_naive,
  fa_args = list(),
  adjust_bias = FALSE
)

`Y`	A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes.
`X`	A matrix of numerics. The covariates of interest.
`ctl`	A vector of logicals of length `ncol(Y)`. If position i is `TRUE` then position i is considered a negative control.
`k`	A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992).
`cov_of_interest`	A vector of positive integers. The column numbers of the covariates in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS.
`likelihood`	Either `"normal"` or `"t"`. If `likelihood = "t"`, then the user may provide the degrees of freedom via `degrees_freedom`.
`limmashrink`	A logical. Should we apply hierarchical shrinkage to the variances (`TRUE`) or not (`FALSE`)? If `degrees_freedom = NULL` and `limmashrink = TRUE` and `likelihood = "t"`, then we'll also use the limma returned degrees of freedom.
`degrees_freedom`	if `likelihood = "t"`, then this is the user-defined degrees of freedom for that distribution. If `degrees_freedom` is `NULL` then the degrees of freedom will be the sample size minus the number of covariates minus `k`.
`include_intercept`	A logical. If `TRUE`, then it will check `X` to see if it has an intercept term. If not, then it will add an intercept term. If `FALSE`, then `X` will be unchanged.
`gls`	A logical. Should we use generalized least squares (`TRUE`) or ordinary least squares (`FALSE`) for estimating the confounders? The OLS version is equivalent to using RUV4 to estimate the confounders.
`fa_func`	A factor analysis function. The function must have as inputs a numeric matrix `Y` and a rank (numeric scalar) `r`. It must output numeric matrices `alpha` and `Z` and a numeric vector `sig_diag`. `alpha` is the estimate of the coefficients of the unobserved confounders, so it must be an `r` by `ncol(Y)` matrix. `Z` must be an `r` by `nrow(Y)` matrix. `sig_diag` is the estimate of the column-wise variances so it must be of length `ncol(Y)`. The default is the function `pca_naive` that just uses the first `r` singular vectors as the estimate of `alpha`. The estimated variances are just the column-wise mean square.
`fa_args`	A list. Additional arguments you want to pass to fa_func.
`adjust_bias`	A logical. Should we also use the control genes to adjust for bias (`TRUE`) or not (`FALSE`)

The model is

Y = XB + ZA + E,

where Y is a matrix of responses (e.g. log-transformed gene expression levels), X is a matrix of covariates, B is a matrix of coefficients, Z is a matrix of unobserved confounders, A is a matrix of unobserved coefficients of the unobserved confounders, and E is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of Y are the observations (e.g. individuals) and the columns of Y are the response variables (e.g. genes).

This model is fit using a two-step approach proposed in Gagnon-Bartsch et al (2013) and described in Wang et al (2015), modified to include estimating a variance inflation parameter. An additional modification is to use a t-likelihood in the second step of the procedure, improving robustness to model misspecification.

For instructions and examples on how to specify your own factor analysis, run the following code in R: utils::vignette("customFA", package = "vicar"). If it doesn't work, then you probably haven't built the vignettes. To do so, see https://github.com/dcgerard/vicar#vignettes.

A list whose elements are:

multiplier A numeric. The estimated variance inflation parameter.

betahat_ols A vector of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest. This is when not including the estimated confounding variables.

sebetahat_ols A vector of positive numerics. The pre-inflation standard errors of ruv$betahat (NOT ruv$betahat_ols).

betahat A matrix of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest WHEN YOU ALSO INCLUDE THE ESTIMATES OF THE UNOBSERVED CONFOUNDERS.

sebetahat A matrix of positive numerics. This is equal to sebethat_ols * sqrt(multiplier). This is the post-inflation adjusted standard errors for ruv$betahat.

tstats A vector of numerics. The t-statistics for testing against the null hypothesis of the coefficient of the covariate of interest being zero. This is after estimating the variance inflation parameter.

pvalues A vector of numerics. The p-values of said test above.

alphahat A matrix of numerics. The estimates of the coefficients of the hidden confounders. Only identified up to a rotation on the rowspace.

Zhat A matrix of numerics. The estimates of the confounding variables. Only identified up to a rotation on the columnspace.

sigma2 A vector of positive numerics. The estimates of the variances PRIOR to inflation.

sigma2_adjusted A vector of positive numerics. The estimates of the variances AFTER to inflation. This is equal to sigma2 * multiplier.

mult_matrix A matrix of numerics. Equal to solve(t(R22) %*% R22). One multiplies sigma2 or simga2_adjused by the diagonal elements of mult_matrix to get the standard errors of betahat.

degrees_freedom The degrees of freedom. If likelihood = "t", then this was the degrees of freedom used.

R22 A matrix of numerics numeric. This is the submatrix of the R in the QR decomposition of X that corresponds to the covariates of interest. This is mostly returned for debugging purposes and may be removed in the future.

Z2 A matrix of numerics of length 1. This is the estimated confounders (after a rotation). Not useful on it's own and is mostly returned for debugging purposes. It may be removed in the future.

David Gerard

Buja, A. and Eyuboglu, N., 1992. "Remarks on parallel analysis." Multivariate behavioral research, 27(4), pp.509-540. doi: 10.1207/s15327906mbr2704_2
Efron, B., 2004. "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis." Journal of the American Statistical Association, 99(465), pp.96-104. doi: 10.1198/016214504000000089
Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. https://statistics.berkeley.edu/tech-reports/820
Gerard, David, and Matthew Stephens. 2021. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166. doi: 10.5705/ss.202018.0345
Wang, Jingshu, Qingyuan Zhao, Trevor Hastie, and Art B. Owen. 2017. "Confounder adjustment in multiple hypothesis testing." The Annals of Statistics 45, no. 5: 1863-1894. doi: 10.1214/16-AOS1511

ruv3 for a version of RUV4 that can also be considered a version of RUV2.

RUV4 For the version of RUV4 in the ruv package.

cate For the version of RUV4 in the cate package.

library(vicar)

## Generate data and controls ---------------------------------------------
set.seed(1327)
n <- 13
p <- 201
k <- 2
q <- 3
is_null       <- rep(FALSE, length = p)
is_null[1:101] <- TRUE
ctl           <- rep(FALSE, length = p)
ctl[1:73]     <- TRUE

X <- matrix(stats::rnorm(n * q), nrow = n)
B <- matrix(stats::rnorm(q * p), nrow = q)
B[2, is_null] <- 0
Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
     matrix(rnorm(n * k), nrow = n)
A <- matrix(stats::rnorm(k * p), nrow = k)
E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
Y <- X %*% B + Z %*% A + E

## Fit RUV4 assuming only second covariate is of interest -----------------
ruvout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, include_intercept = FALSE,
                likelihood = "normal", cov_of_interest = 2)
graphics::plot(ruvout$pvalue, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
                 pch = 1)
## should look uniform
graphics::hist(ruvout$pvalue[is_null], xlab = "pvalues", main = NULL)

## Compare to linear model ------------------------------------------------
lmout <- coefficients(summary(stats::lm(Y ~ X)))
lmp <- sapply(lmout, function(x) { x[3, 4] })
graphics::plot(lmp, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
                 pch = 1)
## should look uniform
graphics::hist(lmp[is_null], xlab = "pvalues", main = NULL)

## Other ways to fit RUV4 from various packages ------------------------------
ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE],
                          X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc",
                          adj.method = "nc", nc = ctl, calibrate = FALSE,
                          nc.var.correction = FALSE)
vruv4out <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
                  include_intercept = FALSE, ctl = ctl,
                  limmashrink = FALSE, likelihood = "normal")
vruv4ols <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
                  include_intercept = FALSE, ctl = ctl,
                  limmashrink = FALSE, likelihood = "normal",
                  gls = FALSE)
ruv4ols  <- ruv::RUV4(Y = Y, X = X[, 2, drop = FALSE],
                      Z = X[, -2, drop = FALSE], ctl = ctl, k = k)

ruv4p <- ruv4out$beta.p.value
vruv4p <- vruv4out$pvalues
ruv4olsp <- ruv4ols$p
vruv4olsp <- vruv4ols$pvalues

## These should be monotonically related
## They often aren't because CATE often returns p-values that are exactly 0.
graphics::plot(ruv4p, vruv4p)
cor(c(ruv4p), c(vruv4p), method = "kendall")

## These should be monotonically related
graphics::plot(ruv4olsp, vruv4olsp)
cor(c(ruv4olsp), c(vruv4olsp), method = "kendall")