vruv4: Calibrated RUV4 where the control genes are used to estimate...

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/ruv4.R

Description

This function will perform a variant of Removing Unwanted Variation 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes are used not only to estimate the hidden confounders, but to estimate a variance inflation parameter. This variance inflation step is akin to the "empirical null" approach of Efron (2004).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
vruv4(
  Y,
  X,
  ctl,
  k = NULL,
  cov_of_interest = ncol(X),
  likelihood = c("t", "normal"),
  limmashrink = TRUE,
  degrees_freedom = NULL,
  include_intercept = TRUE,
  gls = TRUE,
  fa_func = pca_naive,
  fa_args = list(),
  adjust_bias = FALSE
)

Arguments

Y

A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes.

X

A matrix of numerics. The covariates of interest.

ctl

A vector of logicals of length ncol(Y). If position i is TRUE then position i is considered a negative control.

k

A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992).

cov_of_interest

A vector of positive integers. The column numbers of the covariates in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS.

likelihood

Either "normal" or "t". If likelihood = "t", then the user may provide the degrees of freedom via degrees_freedom.

limmashrink

A logical. Should we apply hierarchical shrinkage to the variances (TRUE) or not (FALSE)? If degrees_freedom = NULL and limmashrink = TRUE and likelihood = "t", then we'll also use the limma returned degrees of freedom.

degrees_freedom

if likelihood = "t", then this is the user-defined degrees of freedom for that distribution. If degrees_freedom is NULL then the degrees of freedom will be the sample size minus the number of covariates minus k.

include_intercept

A logical. If TRUE, then it will check X to see if it has an intercept term. If not, then it will add an intercept term. If FALSE, then X will be unchanged.

gls

A logical. Should we use generalized least squares (TRUE) or ordinary least squares (FALSE) for estimating the confounders? The OLS version is equivalent to using RUV4 to estimate the confounders.

fa_func

A factor analysis function. The function must have as inputs a numeric matrix Y and a rank (numeric scalar) r. It must output numeric matrices alpha and Z and a numeric vector sig_diag. alpha is the estimate of the coefficients of the unobserved confounders, so it must be an r by ncol(Y) matrix. Z must be an r by nrow(Y) matrix. sig_diag is the estimate of the column-wise variances so it must be of length ncol(Y). The default is the function pca_naive that just uses the first r singular vectors as the estimate of alpha. The estimated variances are just the column-wise mean square.

fa_args

A list. Additional arguments you want to pass to fa_func.

adjust_bias

A logical. Should we also use the control genes to adjust for bias (TRUE) or not (FALSE)

Details

The model is

Y = XB + ZA + E,

where Y is a matrix of responses (e.g. log-transformed gene expression levels), X is a matrix of covariates, B is a matrix of coefficients, Z is a matrix of unobserved confounders, A is a matrix of unobserved coefficients of the unobserved confounders, and E is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of Y are the observations (e.g. individuals) and the columns of Y are the response variables (e.g. genes).

This model is fit using a two-step approach proposed in Gagnon-Bartsch et al (2013) and described in Wang et al (2015), modified to include estimating a variance inflation parameter. An additional modification is to use a t-likelihood in the second step of the procedure, improving robustness to model misspecification.

For instructions and examples on how to specify your own factor analysis, run the following code in R: utils::vignette("customFA", package = "vicar"). If it doesn't work, then you probably haven't built the vignettes. To do so, see https://github.com/dcgerard/vicar#vignettes.

Value

A list whose elements are:

multiplier A numeric. The estimated variance inflation parameter.

betahat_ols A vector of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest. This is when not including the estimated confounding variables.

sebetahat_ols A vector of positive numerics. The pre-inflation standard errors of ruv$betahat (NOT ruv$betahat_ols).

betahat A matrix of numerics. The ordinary least squares estimates of the coefficients of the covariate of interest WHEN YOU ALSO INCLUDE THE ESTIMATES OF THE UNOBSERVED CONFOUNDERS.

sebetahat A matrix of positive numerics. This is equal to sebethat_ols * sqrt(multiplier). This is the post-inflation adjusted standard errors for ruv$betahat.

tstats A vector of numerics. The t-statistics for testing against the null hypothesis of the coefficient of the covariate of interest being zero. This is after estimating the variance inflation parameter.

pvalues A vector of numerics. The p-values of said test above.

alphahat A matrix of numerics. The estimates of the coefficients of the hidden confounders. Only identified up to a rotation on the rowspace.

Zhat A matrix of numerics. The estimates of the confounding variables. Only identified up to a rotation on the columnspace.

sigma2 A vector of positive numerics. The estimates of the variances PRIOR to inflation.

sigma2_adjusted A vector of positive numerics. The estimates of the variances AFTER to inflation. This is equal to sigma2 * multiplier.

mult_matrix A matrix of numerics. Equal to solve(t(R22) %*% R22). One multiplies sigma2 or simga2_adjused by the diagonal elements of mult_matrix to get the standard errors of betahat.

degrees_freedom The degrees of freedom. If likelihood = "t", then this was the degrees of freedom used.

R22 A matrix of numerics numeric. This is the submatrix of the R in the QR decomposition of X that corresponds to the covariates of interest. This is mostly returned for debugging purposes and may be removed in the future.

Z2 A matrix of numerics of length 1. This is the estimated confounders (after a rotation). Not useful on it's own and is mostly returned for debugging purposes. It may be removed in the future.

Author(s)

David Gerard

References

See Also

ruv3 for a version of RUV4 that can also be considered a version of RUV2.

RUV4 For the version of RUV4 in the ruv package.

cate For the version of RUV4 in the cate package.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
library(vicar)

## Generate data and controls ---------------------------------------------
set.seed(1327)
n <- 13
p <- 201
k <- 2
q <- 3
is_null       <- rep(FALSE, length = p)
is_null[1:101] <- TRUE
ctl           <- rep(FALSE, length = p)
ctl[1:73]     <- TRUE

X <- matrix(stats::rnorm(n * q), nrow = n)
B <- matrix(stats::rnorm(q * p), nrow = q)
B[2, is_null] <- 0
Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
     matrix(rnorm(n * k), nrow = n)
A <- matrix(stats::rnorm(k * p), nrow = k)
E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
Y <- X %*% B + Z %*% A + E

## Fit RUV4 assuming only second covariate is of interest -----------------
ruvout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, include_intercept = FALSE,
                likelihood = "normal", cov_of_interest = 2)
graphics::plot(ruvout$pvalue, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
                 pch = 1)
## should look uniform
graphics::hist(ruvout$pvalue[is_null], xlab = "pvalues", main = NULL)

## Compare to linear model ------------------------------------------------
lmout <- coefficients(summary(stats::lm(Y ~ X)))
lmp <- sapply(lmout, function(x) { x[3, 4] })
graphics::plot(lmp, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
                 pch = 1)
## should look uniform
graphics::hist(lmp[is_null], xlab = "pvalues", main = NULL)

## Other ways to fit RUV4 from various packages ------------------------------
ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE],
                          X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc",
                          adj.method = "nc", nc = ctl, calibrate = FALSE,
                          nc.var.correction = FALSE)
vruv4out <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
                  include_intercept = FALSE, ctl = ctl,
                  limmashrink = FALSE, likelihood = "normal")
vruv4ols <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
                  include_intercept = FALSE, ctl = ctl,
                  limmashrink = FALSE, likelihood = "normal",
                  gls = FALSE)
ruv4ols  <- ruv::RUV4(Y = Y, X = X[, 2, drop = FALSE],
                      Z = X[, -2, drop = FALSE], ctl = ctl, k = k)

ruv4p <- ruv4out$beta.p.value
vruv4p <- vruv4out$pvalues
ruv4olsp <- ruv4ols$p
vruv4olsp <- vruv4ols$pvalues

## These should be monotonically related
## They often aren't because CATE often returns p-values that are exactly 0.
graphics::plot(ruv4p, vruv4p)
cor(c(ruv4p), c(vruv4p), method = "kendall")

## These should be monotonically related
graphics::plot(ruv4olsp, vruv4olsp)
cor(c(ruv4olsp), c(vruv4olsp), method = "kendall")

dcgerard/vicar documentation built on July 7, 2021, 1:08 p.m.