Description Usage Arguments Details Value Author(s) References See Also Examples
This function will perform a variant of Removing Unwanted Variation 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes are used not only to estimate the hidden confounders, but to estimate a variance inflation parameter. This variance inflation step is akin to the "empirical null" approach of Efron (2004).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Y |
A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes. |
X |
A matrix of numerics. The covariates of interest. |
ctl |
A vector of logicals of length |
k |
A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992). |
cov_of_interest |
A vector of positive integers. The column numbers of the covariates in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS. |
likelihood |
Either |
limmashrink |
A logical. Should we apply hierarchical
shrinkage to the variances ( |
degrees_freedom |
if |
include_intercept |
A logical. If |
gls |
A logical. Should we use generalized least squares
( |
fa_func |
A factor analysis function. The function must have
as inputs a numeric matrix |
fa_args |
A list. Additional arguments you want to pass to fa_func. |
adjust_bias |
A logical. Should we also use the control genes
to adjust for bias ( |
The model is
Y = XB + ZA + E,
where Y is a matrix of responses (e.g. log-transformed gene expression levels), X is a matrix of covariates, B is a matrix of coefficients, Z is a matrix of unobserved confounders, A is a matrix of unobserved coefficients of the unobserved confounders, and E is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of Y are the observations (e.g. individuals) and the columns of Y are the response variables (e.g. genes).
This model is fit using a two-step approach proposed in Gagnon-Bartsch et al (2013) and described in Wang et al (2015), modified to include estimating a variance inflation parameter. An additional modification is to use a t-likelihood in the second step of the procedure, improving robustness to model misspecification.
For instructions and examples on how to specify your own factor analysis, run the following code in R:
utils::vignette("customFA", package = "vicar")
. If it doesn't work, then you probably haven't built
the vignettes. To do so, see https://github.com/dcgerard/vicar#vignettes.
A list whose elements are:
multiplier
A numeric. The estimated variance inflation
parameter.
betahat_ols
A vector of numerics. The ordinary least
squares estimates of the coefficients of the covariate of
interest. This is when not including the estimated confounding
variables.
sebetahat_ols
A vector of positive numerics. The
pre-inflation standard errors of ruv$betahat
(NOT
ruv$betahat_ols
).
betahat
A matrix of numerics. The ordinary least squares
estimates of the coefficients of the covariate of interest WHEN
YOU ALSO INCLUDE THE ESTIMATES OF THE UNOBSERVED CONFOUNDERS.
sebetahat
A matrix of positive numerics. This is equal
to sebethat_ols * sqrt(multiplier)
. This is the
post-inflation adjusted standard errors for ruv$betahat
.
tstats
A vector of numerics. The t-statistics for
testing against the null hypothesis of the coefficient of the
covariate of interest being zero. This is after estimating the
variance inflation parameter.
pvalues
A vector of numerics. The p-values of said test
above.
alphahat
A matrix of numerics. The estimates of the
coefficients of the hidden confounders. Only identified up to a
rotation on the rowspace.
Zhat
A matrix of numerics. The estimates of the
confounding variables. Only identified up to a rotation on the
columnspace.
sigma2
A vector of positive numerics. The estimates of
the variances PRIOR to inflation.
sigma2_adjusted
A vector of positive numerics. The
estimates of the variances AFTER to inflation. This is equal to
sigma2 * multiplier
.
mult_matrix
A matrix of numerics. Equal to
solve(t(R22) %*% R22)
. One multiplies sigma2
or
simga2_adjused
by the diagonal elements of
mult_matrix
to get the standard errors of
betahat
.
degrees_freedom
The degrees of freedom. If
likelihood = "t"
, then this was the degrees of freedom
used.
R22
A matrix of numerics numeric. This is the submatrix
of the R in the QR decomposition of X that corresponds to the
covariates of interest. This is mostly returned for debugging
purposes and may be removed in the future.
Z2
A matrix of numerics of length 1. This is the
estimated confounders (after a rotation). Not useful on it's
own and is mostly returned for debugging purposes. It may be
removed in the future.
David Gerard
Buja, A. and Eyuboglu, N., 1992. "Remarks on parallel analysis." Multivariate behavioral research, 27(4), pp.509-540. doi: 10.1207/s15327906mbr2704_2
Efron, B., 2004. "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis." Journal of the American Statistical Association, 99(465), pp.96-104. doi: 10.1198/016214504000000089
Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. https://statistics.berkeley.edu/tech-reports/820
Gerard, David, and Matthew Stephens. 2021. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." Statistica Sinica, 31(3), 1145-1166. doi: 10.5705/ss.202018.0345
Wang, Jingshu, Qingyuan Zhao, Trevor Hastie, and Art B. Owen. 2017. "Confounder adjustment in multiple hypothesis testing." The Annals of Statistics 45, no. 5: 1863-1894. doi: 10.1214/16-AOS1511
ruv3
for a version of RUV4 that can also be considered a version of RUV2.
RUV4
For the version of RUV4 in the ruv package.
cate
For the version of RUV4 in the cate package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | library(vicar)
## Generate data and controls ---------------------------------------------
set.seed(1327)
n <- 13
p <- 201
k <- 2
q <- 3
is_null <- rep(FALSE, length = p)
is_null[1:101] <- TRUE
ctl <- rep(FALSE, length = p)
ctl[1:73] <- TRUE
X <- matrix(stats::rnorm(n * q), nrow = n)
B <- matrix(stats::rnorm(q * p), nrow = q)
B[2, is_null] <- 0
Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
matrix(rnorm(n * k), nrow = n)
A <- matrix(stats::rnorm(k * p), nrow = k)
E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
Y <- X %*% B + Z %*% A + E
## Fit RUV4 assuming only second covariate is of interest -----------------
ruvout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, include_intercept = FALSE,
likelihood = "normal", cov_of_interest = 2)
graphics::plot(ruvout$pvalue, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
pch = 1)
## should look uniform
graphics::hist(ruvout$pvalue[is_null], xlab = "pvalues", main = NULL)
## Compare to linear model ------------------------------------------------
lmout <- coefficients(summary(stats::lm(Y ~ X)))
lmp <- sapply(lmout, function(x) { x[3, 4] })
graphics::plot(lmp, col = is_null + 3, ylab = "pvalues")
graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
pch = 1)
## should look uniform
graphics::hist(lmp[is_null], xlab = "pvalues", main = NULL)
## Other ways to fit RUV4 from various packages ------------------------------
ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE],
X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc",
adj.method = "nc", nc = ctl, calibrate = FALSE,
nc.var.correction = FALSE)
vruv4out <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
include_intercept = FALSE, ctl = ctl,
limmashrink = FALSE, likelihood = "normal")
vruv4ols <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
include_intercept = FALSE, ctl = ctl,
limmashrink = FALSE, likelihood = "normal",
gls = FALSE)
ruv4ols <- ruv::RUV4(Y = Y, X = X[, 2, drop = FALSE],
Z = X[, -2, drop = FALSE], ctl = ctl, k = k)
ruv4p <- ruv4out$beta.p.value
vruv4p <- vruv4out$pvalues
ruv4olsp <- ruv4ols$p
vruv4olsp <- vruv4ols$pvalues
## These should be monotonically related
## They often aren't because CATE often returns p-values that are exactly 0.
graphics::plot(ruv4p, vruv4p)
cor(c(ruv4p), c(vruv4p), method = "kendall")
## These should be monotonically related
graphics::plot(ruv4olsp, vruv4olsp)
cor(c(ruv4olsp), c(vruv4olsp), method = "kendall")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.