#' Calibrated RUV4 where the control genes are used to estimate hidden
#' confounders and a variance inflation parameter.
#'
#' This function will perform a variant of Removing Unwanted Variation
#' 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes
#' are used not only to estimate the hidden confounders, but to
#' estimate a variance inflation parameter. This variance inflation
#' step is akin to the "empirical null" approach of Efron (2004).
#'
#' The model is \deqn{Y = XB + ZA + E,} where \eqn{Y} is a matrix of
#' responses (e.g. log-transformed gene expression levels), \eqn{X} is
#' a matrix of covariates, \eqn{B} is a matrix of coefficients,
#' \eqn{Z} is a matrix of unobserved confounders, \eqn{A} is a matrix
#' of unobserved coefficients of the unobserved confounders, and
#' \eqn{E} is the noise matrix where the elements are independent
#' Gaussian and each column shares a common variance. The rows of
#' \eqn{Y} are the observations (e.g. individuals) and the columns of
#' \eqn{Y} are the response variables (e.g. genes).
#'
#' This model is fit using a two-step approach proposed in
#' Gagnon-Bartsch et al (2013) and described in Wang et al (2015),
#' modified to include estimating a variance inflation parameter. An
#' additional modification is to use a t-likelihood in the second step
#' of the procedure, improving robustness to model misspecification.
#'
#' For instructions and examples on how to specify your own factor analysis, run the following code in R:
#' \code{utils::vignette("customFA", package = "vicar")}. If it doesn't work, then you probably haven't built
#' the vignettes. To do so, see \url{https://github.com/dcgerard/vicar#vignettes}.
#'
#' @param Y A matrix of numerics. These are the response variables
#' where each column has its own variance. In a gene expression
#' study, the rows are the individuals and the columns are the
#' genes.
#' @param X A matrix of numerics. The covariates of interest.
#' @param k A non-negative integer.The number of unobserved
#' confounders. If not specified and the R package sva is
#' installed, then this function will estimate the number of
#' hidden confounders using the methods of Buja and Eyuboglu
#' (1992).
#' @param cov_of_interest A vector of positive integers. The column
#' numbers of the covariates in X whose coefficients you are
#' interested in. The rest are considered nuisance parameters and
#' are regressed out by OLS.
#' @param ctl A vector of logicals of length \code{ncol(Y)}. If
#' position i is \code{TRUE} then position i is considered a
#' negative control.
#' @param include_intercept A logical. If \code{TRUE}, then it will
#' check \code{X} to see if it has an intercept term. If not, then
#' it will add an intercept term. If \code{FALSE}, then \code{X}
#' will be unchanged.
#' @param gls A logical. Should we use generalized least squares
#' (\code{TRUE}) or ordinary least squares (\code{FALSE}) for
#' estimating the confounders? The OLS version is equivalent to
#' using RUV4 to estimate the confounders.
#' @param likelihood Either \code{"normal"} or \code{"t"}. If
#' \code{likelihood = "t"}, then the user may provide the degrees
#' of freedom via \code{degrees_freedom}.
#' @param degrees_freedom if \code{likelihood = "t"}, then this is the
#' user-defined degrees of freedom for that distribution. If
#' \code{degrees_freedom} is \code{NULL} then the degrees of
#' freedom will be the sample size minus the number of covariates
#' minus \code{k}.
#' @param limmashrink A logical. Should we apply hierarchical
#' shrinkage to the variances (\code{TRUE}) or not (\code{FALSE})?
#' If \code{degrees_freedom = NULL} and \code{limmashrink = TRUE}
#' and \code{likelihood = "t"}, then we'll also use the limma
#' returned degrees of freedom.
#' @param fa_func A factor analysis function. The function must have
#' as inputs a numeric matrix \code{Y} and a rank (numeric scalar)
#' \code{r}. It must output numeric matrices \code{alpha} and
#' \code{Z} and a numeric vector \code{sig_diag}. \code{alpha} is
#' the estimate of the coefficients of the unobserved confounders,
#' so it must be an \code{r} by \code{ncol(Y)} matrix. \code{Z}
#' must be an \code{r} by \code{nrow(Y)} matrix. \code{sig_diag}
#' is the estimate of the column-wise variances so it must be of
#' length \code{ncol(Y)}. The default is the function
#' \code{pca_naive} that just uses the first \code{r} singular
#' vectors as the estimate of \code{alpha}. The estimated
#' variances are just the column-wise mean square.
#' @param fa_args A list. Additional arguments you want to pass to
#' fa_func.
#' @param adjust_bias A logical. Should we also use the control genes
#' to adjust for bias (\code{TRUE}) or not (\code{FALSE})
#'
#'
#' @return A list whose elements are:
#'
#' \code{multiplier} A numeric. The estimated variance inflation
#' parameter.
#'
#' \code{betahat_ols} A vector of numerics. The ordinary least
#' squares estimates of the coefficients of the covariate of
#' interest. This is when not including the estimated confounding
#' variables.
#'
#' \code{sebetahat_ols} A vector of positive numerics. The
#' pre-inflation standard errors of \code{ruv$betahat} (NOT
#' \code{ruv$betahat_ols}).
#'
#' \code{betahat} A matrix of numerics. The ordinary least squares
#' estimates of the coefficients of the covariate of interest WHEN
#' YOU ALSO INCLUDE THE ESTIMATES OF THE UNOBSERVED CONFOUNDERS.
#'
#' \code{sebetahat} A matrix of positive numerics. This is equal
#' to \code{sebethat_ols * sqrt(multiplier)}. This is the
#' post-inflation adjusted standard errors for \code{ruv$betahat}.
#'
#' \code{tstats} A vector of numerics. The t-statistics for
#' testing against the null hypothesis of the coefficient of the
#' covariate of interest being zero. This is after estimating the
#' variance inflation parameter.
#'
#' \code{pvalues} A vector of numerics. The p-values of said test
#' above.
#'
#' \code{alphahat} A matrix of numerics. The estimates of the
#' coefficients of the hidden confounders. Only identified up to a
#' rotation on the rowspace.
#'
#' \code{Zhat} A matrix of numerics. The estimates of the
#' confounding variables. Only identified up to a rotation on the
#' columnspace.
#'
#' \code{sigma2} A vector of positive numerics. The estimates of
#' the variances PRIOR to inflation.
#'
#' \code{sigma2_adjusted} A vector of positive numerics. The
#' estimates of the variances AFTER to inflation. This is equal to
#' \code{sigma2 * multiplier}.
#'
#' \code{mult_matrix} A matrix of numerics. Equal to
#' \code{solve(t(R22) \%*\% R22)}. One multiplies \code{sigma2} or
#' \code{simga2_adjused} by the diagonal elements of
#' \code{mult_matrix} to get the standard errors of
#' \code{betahat}.
#'
#' \code{degrees_freedom} The degrees of freedom. If
#' \code{likelihood = "t"}, then this was the degrees of freedom
#' used.
#'
#' \code{R22} A matrix of numerics numeric. This is the submatrix
#' of the R in the QR decomposition of X that corresponds to the
#' covariates of interest. This is mostly returned for debugging
#' purposes and may be removed in the future.
#'
#' \code{Z2} A matrix of numerics of length 1. This is the
#' estimated confounders (after a rotation). Not useful on it's
#' own and is mostly returned for debugging purposes. It may be
#' removed in the future.
#'
#' @export
#'
#' @seealso \code{\link{ruv3}} for a version of RUV4 that can also be considered a version of RUV2.
#'
#' \code{\link[ruv]{RUV4}} For the version of RUV4 in the ruv package.
#'
#' \code{\link[cate]{cate}} For the version of RUV4 in the cate package.
#'
#' @author David Gerard
#'
#' @references
#' \itemize{
#' \item{Buja, A. and Eyuboglu, N., 1992. "Remarks on parallel analysis." \emph{Multivariate behavioral research}, 27(4), pp.509-540. \doi{10.1207/s15327906mbr2704_2}}
#' \item{Efron, B., 2004. "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis." \emph{Journal of the American Statistical Association}, 99(465), pp.96-104. \doi{10.1198/016214504000000089}}
#' \item{Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. \url{https://statistics.berkeley.edu/tech-reports/820}}
#' \item{Gerard, David, and Matthew Stephens. 2021. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls." \emph{Statistica Sinica}, 31(3), 1145-1166. \doi{10.5705/ss.202018.0345}}
#' \item{Wang, Jingshu, Qingyuan Zhao, Trevor Hastie, and Art B. Owen. 2017. "Confounder adjustment in multiple hypothesis testing." \emph{The Annals of Statistics} 45, no. 5: 1863-1894. \doi{10.1214/16-AOS1511}}
#' }
#'
#' @examples
#' library(vicar)
#'
#' ## Generate data and controls ---------------------------------------------
#' set.seed(1327)
#' n <- 13
#' p <- 201
#' k <- 2
#' q <- 3
#' is_null <- rep(FALSE, length = p)
#' is_null[1:101] <- TRUE
#' ctl <- rep(FALSE, length = p)
#' ctl[1:73] <- TRUE
#'
#' X <- matrix(stats::rnorm(n * q), nrow = n)
#' B <- matrix(stats::rnorm(q * p), nrow = q)
#' B[2, is_null] <- 0
#' Z <- X %*% matrix(stats::rnorm(q * k), nrow = q) +
#' matrix(rnorm(n * k), nrow = n)
#' A <- matrix(stats::rnorm(k * p), nrow = k)
#' E <- matrix(stats::rnorm(n * p, sd = 1 / 2), nrow = n)
#' Y <- X %*% B + Z %*% A + E
#'
#' ## Fit RUV4 assuming only second covariate is of interest -----------------
#' ruvout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, include_intercept = FALSE,
#' likelihood = "normal", cov_of_interest = 2)
#' graphics::plot(ruvout$pvalue, col = is_null + 3, ylab = "pvalues")
#' graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
#' pch = 1)
#' ## should look uniform
#' graphics::hist(ruvout$pvalue[is_null], xlab = "pvalues", main = NULL)
#'
#' ## Compare to linear model ------------------------------------------------
#' lmout <- coefficients(summary(stats::lm(Y ~ X)))
#' lmp <- sapply(lmout, function(x) { x[3, 4] })
#' graphics::plot(lmp, col = is_null + 3, ylab = "pvalues")
#' graphics::legend("topleft", col = 3:4, legend = c("non-null", "null"),
#' pch = 1)
#' ## should look uniform
#' graphics::hist(lmp[is_null], xlab = "pvalues", main = NULL)
#'
#' ## Other ways to fit RUV4 from various packages ------------------------------
#' ruv4out <- cate::cate.fit(Y = Y, X.primary = X[, 2, drop = FALSE],
#' X.nuis = X[, -2, drop = FALSE], r = k, fa.method = "pc",
#' adj.method = "nc", nc = ctl, calibrate = FALSE,
#' nc.var.correction = FALSE)
#' vruv4out <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
#' include_intercept = FALSE, ctl = ctl,
#' limmashrink = FALSE, likelihood = "normal")
#' vruv4ols <- vruv4(Y = Y, X = X, k = k, cov_of_interest = 2,
#' include_intercept = FALSE, ctl = ctl,
#' limmashrink = FALSE, likelihood = "normal",
#' gls = FALSE)
#' ruv4ols <- ruv::RUV4(Y = Y, X = X[, 2, drop = FALSE],
#' Z = X[, -2, drop = FALSE], ctl = ctl, k = k)
#'
#' ruv4p <- ruv4out$beta.p.value
#' vruv4p <- vruv4out$pvalues
#' ruv4olsp <- ruv4ols$p
#' vruv4olsp <- vruv4ols$pvalues
#'
#' ## These should be monotonically related
#' ## They often aren't because CATE often returns p-values that are exactly 0.
#' graphics::plot(ruv4p, vruv4p)
#' cor(c(ruv4p), c(vruv4p), method = "kendall")
#'
#' ## These should be monotonically related
#' graphics::plot(ruv4olsp, vruv4olsp)
#' cor(c(ruv4olsp), c(vruv4olsp), method = "kendall")
#'
vruv4 <- function(Y, X, ctl, k = NULL,
cov_of_interest = ncol(X),
likelihood = c("t", "normal"),
limmashrink = TRUE, degrees_freedom = NULL,
include_intercept = TRUE, gls = TRUE,
fa_func = pca_naive, fa_args = list(),
adjust_bias = FALSE) {
assertthat::assert_that(is.matrix(Y))
assertthat::assert_that(is.matrix(X))
assertthat::are_equal(nrow(Y), nrow(X))
assertthat::are_equal(ncol(Y), length(ctl))
assertthat::assert_that(is.logical(ctl))
assertthat::assert_that(all(abs(cov_of_interest - round(cov_of_interest)) < 10 ^ -14))
assertthat::assert_that(all(cov_of_interest >= 1 & cov_of_interest <= ncol(X)))
assertthat::assert_that(is.logical(gls))
assertthat::assert_that(is.logical(include_intercept))
assertthat::assert_that(is.logical(limmashrink))
assertthat::assert_that(is.list(fa_args))
assertthat::assert_that(is.null(fa_args$Y))
assertthat::assert_that(is.null(fa_args$r))
assertthat::assert_that(is.function(fa_func))
likelihood <- match.arg(likelihood)
## RUN THE ROTATED MODEL HERE -------------------------------------------
rotate_out <- rotate_model(Y = Y, X = X, k = k,
cov_of_interest = cov_of_interest,
include_intercept = include_intercept,
limmashrink = limmashrink, fa_func = fa_func,
fa_args = fa_args, do_factor = TRUE)
betahat_ols <- rotate_out$betahat_ols
k <- rotate_out$k
## Deal with degrees of freedom
if (likelihood == "normal") {
if (!is.null(degrees_freedom)) {
message("likelihood = \"normal\" but degrees_freedom not NULL. Setting degrees_freedom to Inf")
}
degrees_freedom <- Inf
}
if (!is.null(rotate_out$prior_df) & is.null(degrees_freedom) & likelihood == "t") {
degrees_freedom <- rotate_out$prior_df + nrow(X) - ncol(X) - k
if (degrees_freedom == Inf) {
message("limma estimated df = Inf . Changing likelihood to \"normal\".")
likelihood <- "normal"
}
} else if (is.null(degrees_freedom) & likelihood == "t") {
degrees_freedom <- nrow(X) - ncol(X) - k
}
assertthat::assert_that(length(degrees_freedom) == 1 | length(degrees_freedom) == ncol(Y))
assertthat::assert_that(all(degrees_freedom > 0))
## RUN RUV4 HERE ----------------------------------------------------------
Y2 <- rotate_out$Y2
alpha <- rotate_out$alpha
if (adjust_bias) {
alpha <- cbind(rep(1, nrow(alpha)), alpha)
}
sig_diag <- rotate_out$sig_diag
R22 <- rotate_out$R22
ruv4_out <- cruv4_multicov(Y2 = t(Y2), alpha = alpha,
sig_diag = sig_diag, ctl = ctl,
R22 = R22,
degrees_freedom = degrees_freedom,
gls = gls, likelihood = likelihood)
## Estimate rest of the hidden confounders.
Y1 <- rotate_out$Y1
if (adjust_bias) {
Z2 <- ruv4_out$Z2[-1, , drop = FALSE]
additivor <- ruv4_out$Z2[1, , drop = FALSE]
alpha <- alpha[, -1, drop = FALSE]
} else {
Z2 <- ruv4_out$Z2
}
Z3 <- rotate_out$Z3
if (!is.null(Y1)) {
R12 <- rotate_out$R12
R11 <- rotate_out$R11
Q <- rotate_out$Q
beta1_ols <- solve(R11) %*% (Y1 - R12 %*% t(ruv4_out$betahat_ols))
resid_top <- Y1 - R12 %*% t(ruv4_out$betahat) - R11 %*% beta1_ols
if (gls) {
Z1 <- solve(t(alpha) %*% diag(1 / sig_diag) %*% alpha) %*%
t(alpha) %*% diag(1 / sig_diag) %*% t(resid_top)
} else {
Z1 <- solve(t(alpha) %*% alpha) %*% t(alpha) %*% t(resid_top)
}
Zhat <- Q %*% rbind(t(Z1), t(Z2), Z3)
} else {
Q <- rotate_out$Q
Zhat <- Q %*% rbind(t(Z2), Z3)
}
## sebetahats --- used a new mult-matrix. Hopefully this works better.
## mult_matrix_old <- solve(t(R22) %*% R22)
XZ <- cbind(X, Zhat)
mult_matrix <- solve(t(XZ) %*% XZ)[cov_of_interest, cov_of_interest, drop = FALSE]
sebetahat <- sqrt(outer(ruv4_out$sigma2_adjusted, diag(mult_matrix), FUN = "*"))
sebetahat_ols <- sqrt(outer(sig_diag, diag(mult_matrix), FUN = "*"))
tstats <- ruv4_out$betahat / sebetahat
pvalues <- 2 * (stats::pt(q = -abs(tstats), df = degrees_freedom))
## output list
ruv4_out$Zhat <- Zhat
ruv4_out$sebetahat <- sebetahat
ruv4_out$sebetahat_ols <- sebetahat_ols
ruv4_out$tstats <- tstats
ruv4_out$pvalues <- pvalues
ruv4_out$mult_matrix <- mult_matrix
ruv4_out$degrees_freedom <- degrees_freedom
if (adjust_bias) {
ruv4_out$additivor <- additivor
}
return(ruv4_out)
}
#' RUV4's second step.
#'
#' @param Y2 A matrix of numerics with p rows and q columns, where q
#' is the number of covariates of interest and p is the number of
#' genes.
#' @param alpha A matrix of numerics of dimension p by k, where k is
#' the number of hidden confounders. The estimated coefficients of
#' the unobserved confounders.
#' @param sig_diag A vector of positive numerics. The estimates of the
#' variances.
#' @param degrees_freedom A positive numeric. The degrees of freedom
#' of the t-likelihood if using it.
#' @param R22 A matrix of numerics numeric. This is the submatrix
#' of the R in the QR decomposition of X that corresponds to the
#' covariates of interest. May be removed in the future.
#' @inheritParams vruv4
#'
#' @author David Gerard
#'
cruv4_multicov <- function(Y2, alpha, sig_diag, ctl, R22, degrees_freedom,
gls = TRUE, likelihood = "t") {
assertthat::assert_that(is.matrix(Y2))
assertthat::assert_that(is.matrix(alpha))
assertthat::assert_that(is.vector(sig_diag))
assertthat::assert_that(is.matrix(R22))
assertthat::assert_that(all(sig_diag > 0))
assertthat::assert_that(is.logical(ctl))
assertthat::assert_that(length(degrees_freedom) == 1 | length(degrees_freedom) == ncol(Y2))
assertthat::assert_that(all(degrees_freedom > 0))
assertthat::are_equal(nrow(Y2), nrow(alpha))
k <- ncol(alpha)
## Use control genes to jointly estimate Z2 and variance scaling parameter.
Yc <- Y2[ctl, , drop = FALSE]
if (k != 0) {
alphac <- alpha[ctl, , drop = FALSE]
sig_diag_inv <- 1 / sig_diag[ctl]
if (gls) {
Z2 <- crossprod(solve(crossprod(alphac, sig_diag_inv * alphac)),
crossprod(alphac, sig_diag_inv * Yc))
} else {
Z2 <- crossprod(solve(crossprod(alphac, alphac)),
crossprod(alphac, Yc))
}
resid_mat <- Yc - alphac %*% Z2
betahat <- (Y2 - alpha %*% Z2) %*% solve(t(R22))
} else {
Z2 <- NULL
resid_mat <- Yc
betahat <- Y2 %*% solve(t(R22))
}
## Gaussian MLE of variance inflation parameter.
assertthat::are_equal(ncol(resid_mat), sum(ctl))
multiplier <- mean(t(resid_mat ^ 2) / sig_diag[ctl])
## multiplier should be the exact same as below
## sum(diag(t(resid_mat) %*% diag(1 / sig_diag[ctl]) %*% resid_mat)) / prod(dim(resid_mat))
## T-likelihood regression and estimate variance inflation
## parameter using control genes.
if (likelihood == "t") {
if (k != 0) {
if (!gls) {
message("gls = FALSE not supported for t-likelihood, using gls = TRUE.")
}
alphac <- alpha[ctl, , drop = FALSE]
tout <- tregress_em(Y = Yc, alpha = alphac,
sig_diag = sig_diag[ctl],
nu = degrees_freedom, lambda_init = multiplier,
Z_init = Z2)
Z2 <- tout$Z
betahat <- (Y2 - alpha %*% Z2) %*% solve(t(R22))
multiplier <- tout$lambda
} else if (k == 0) {
tout <- stats::optim(par = multiplier, fn = tregress_obj_wrapper,
Z = matrix(0, nrow = 1, ncol = ncol(Yc)), Y = Yc,
alpha = matrix(0, nrow = nrow(Yc), ncol = 1),
sig_diag = sig_diag[ctl],
nu = degrees_freedom, method = "Brent",
lower = 0, upper = 10,
control = list(fnscale = -1))
multiplier <- tout$par
}
}
## Output values
ruv4_out <- list()
sig_diag_adjusted <- sig_diag * multiplier
betahat_ols <- Y2 %*% solve(t(R22))
ruv4_out$sigma2 <- sig_diag
ruv4_out$multiplier <- multiplier
ruv4_out$sigma2_adjusted <- sig_diag_adjusted
ruv4_out$betahat_ols <- betahat_ols
ruv4_out$betahat <- betahat
ruv4_out$Z2 <- Z2
ruv4_out$alphahat <- alpha
ruv4_out$R22 <- R22
return(ruv4_out)
}
#' QR rotation to independent models.
#'
#' @inheritParams vruv4
#' @param do_factor A logical. Should we do the factor analysis or just rotation?
#'
#' @return A list that contains some of the following elements.
#' \itemize{
#' \item{\code{betahat_ols}: }{The OLS estimates of the coefficients of interest.}
#' \item{\code{Q}: }{The Q from the QR decomposition of \code{X}}
#' \item{\code{R11}: }{The top left element of the R matrix from the QR decomposition of \code{X}}
#' \item{\code{R12}: }{The top right element of the R matrix from the QR decomposition of \code{X}}
#' \item{\code{R22}: }{The bottom right element of the R matrix from the QR decomposition of \code{X}}
#' \item{\code{Y1}: }{The rows of the rotated Y matrix that correspond to the uninteresting covariates in \code{X}}
#' \item{\code{Y2}: }{The rows of the rotated Y matrix that correspond to the interesting covariates in \code{X}}
#' \item{\code{Y3}: }{The rows of the rotated Y matrix that correspond to the nullspace of \code{X}}
#' \item{\code{k}: }{The number of latent factors. Estimated with \code{\link[sva]{num.sv}} if not provided in the function arguments.}
#' \item{\code{X}: }{The provided \code{X} matrix. This might differ from the original \code{X} matrix up to a permutation of the columns. The interesting covariates are placed at the end of \code{X} and the uninteresting covariates are placed at the start of \code{X}.}
#' \item{\code{Y_tilde}: }{The rotated Y matrix.}
#' \item{\code{alpha}: }{The estimates of the coefficients of the unobserved confounders. Only returned if \code{do_factor = TRUE}}
#' \item{\code{Z3}: }{The estimates of the unobserved confounders (third part). Only returned if \code{do_factor = TRUE}}
#' \item{\code{sig_diag}: }{The estimates of the column-specific variances (NOT standard deviations). A vector. Only returned if \code{do_factor = TRUE}}
#' \item{\code{prior_df}: }{The estimated prior degrees of freedom. Returns NULL if \code{limmashrink = FALSE}. Only returned if \code{do_factor = TRUE}}
#' }
#'
#'
#' @author David Gerard
#'
rotate_model <- function(Y, X, k, cov_of_interest = ncol(X),
include_intercept = TRUE,
limmashrink = FALSE, fa_func = pca_naive,
fa_args = list(), do_factor = TRUE) {
assertthat::assert_that(is.logical(do_factor))
if (include_intercept) {
X_scaled <- apply(X, 2, function(x) {
x / sqrt(sum(x ^ 2))
})
int_term <- rep(1, length = nrow(X)) / sqrt(nrow(X))
any_int <- any(colSums( (int_term - X_scaled) ^ 2) < 10 ^ -14)
if (!any_int) {
X <- cbind(X, rep(1, length = nrow(X)))
}
}
if (is.null(k)) {
if (requireNamespace("sva", quietly = TRUE)) {
message("Number of confounders not provided so being estimated with package sva.")
k <- sva::num.sv(dat = t(Y), mod = X)
} else {
stop("If sva is not installed, then k needs to be provided. To install sva, run in R\n source(\"https://bioconductor.org/biocLite.R\")\n biocLite(\"sva\")")
}
}
assertthat::assert_that(k + ncol(X) < nrow(X))
## Place desired covariate as last covariate
X <- X[, c( (1:ncol(X))[-cov_of_interest], cov_of_interest), drop = FALSE]
cov_of_interest <- (ncol(X) - length(cov_of_interest) + 1):ncol(X)
qr_x <- qr_ident(X)
## multiply by sign so that it matches with beta_hat_ols
Q <- qr_x$Q
R <- qr_x$R
## discard first q-1 rows.
Y_tilde <- crossprod(Q, Y)
y2start_index <- ncol(X) - length(cov_of_interest) + 1
y3start_index <- ncol(X) + 1
if (y2start_index > 1) {
Y1 <- Y_tilde[1:(y2start_index - 1), , drop = FALSE]
} else {
Y1 <- NULL
}
Y2 <- Y_tilde[y2start_index:(y3start_index - 1), , drop = FALSE]
Y3 <- Y_tilde[y3start_index:nrow(Y_tilde), , drop = FALSE]
if (do_factor) {
## Factor analysis using all but first row of Y_tilde
fa_args$Y <- Y3
fa_args$r <- k
fa_out <- do.call(what = fa_func, args = fa_args)
alpha <- fa_out$alpha
sig_diag <- fa_out$sig_diag
Z3 <- fa_out$Z
## make sure the user didn't screw up the factor analysis.
assertthat::assert_that(is.vector(sig_diag))
assertthat::are_equal(length(sig_diag), ncol(Y))
assertthat::assert_that(all(sig_diag > 0))
if (k != 0) {
assertthat::assert_that(is.matrix(alpha))
assertthat::are_equal(ncol(alpha), k)
assertthat::are_equal(nrow(alpha), ncol(Y))
assertthat::assert_that(is.matrix(Z3))
assertthat::are_equal(ncol(Z3), k)
assertthat::are_equal(nrow(Z3), nrow(Y))
} else {
assertthat::assert_that(is.null(alpha))
assertthat::assert_that(is.null(Z3))
}
## Shrink variances if desired.
if (requireNamespace("limma", quietly = TRUE) & limmashrink) {
limma_out <- limma::squeezeVar(var = sig_diag,
df = nrow(X) - ncol(X) - k)
sig_diag <- limma_out$var.post
prior_df <- limma_out$df.prior
} else if (!requireNamespace("limma", quietly = TRUE) & limmashrink) {
stop("limmashrink = TRUE but limma not installed. To install limma, run in R:\n source(\"https://bioconductor.org/biocLite.R\")\n biocLite(\"limma\")")
} else {
prior_df <- NULL
}
}
R22 <- R[cov_of_interest, cov_of_interest, drop = FALSE]
if (y2start_index > 1) {
R11 <- R[(1:ncol(X))[-cov_of_interest], (1:ncol(X))[-cov_of_interest], drop = FALSE]
R12 <- R[(1:ncol(X))[-cov_of_interest], cov_of_interest, drop = FALSE]
} else {
R11 <- NULL
R12 <- NULL
}
## this is betahat from OLS, called Y1 in CATE.
betahat_ols <- solve(R22) %*% Y2
## create list for returns
return_list <- list()
return_list$betahat_ols <- betahat_ols
if (do_factor) {
return_list$alpha <- alpha
return_list$Z3 <- Z3
return_list$sig_diag <- sig_diag
return_list$prior_df <- prior_df
}
return_list$Q <- Q
return_list$R11 <- R11
return_list$R12 <- R12
return_list$R22 <- R22
return_list$Y1 <- Y1
return_list$Y2 <- Y2
return_list$Y3 <- Y3
return_list$k <- k
return_list$X <- X
return_list$Ytilde <- Y_tilde ## the rotated response
return(return_list)
}
#' An identified QR decomposition.
#'
#' This just re-jiggers the output from \code{\link[base]{qr}} so that
#' the upper triangular matrix has positive diagonal elements. This
#' one will only work if the column dimension is less than or equal to
#' the row dimension.
#'
#' @param X A matrix.
qr_ident <- function (X)
{
assertthat::assert_that(ncol(X) <= nrow(X))
assertthat::assert_that(is.matrix(X))
qr_X <- qr(X)
R <- qr.R(qr_X, complete = TRUE)
Q <- qr.Q(qr_X, complete = TRUE)
sign_vec <- sign(diag(R))
Q[, 1:ncol(X)] <- t(t(Q)[1:ncol(X), , drop = FALSE] * sign_vec)
R[1:ncol(X), ] <- sign_vec * R[1:ncol(X), , drop = FALSE]
return(list(R = R, Q = Q))
}
#' Basic PCA.
#'
#' Glorified truncated SVD. The variance estimates are just the
#' column-wise mean-squares, except I divide by \code{nrow(Y) - r}
#' rather than by \code{nrow(Y)}. I allow \code{r = 0}.
#'
#'
#' @param Y A matrix of numerics. The data.
#' @param r the rank.
#'
#' @export
#'
#' @author David Gerard
pca_naive <- function (Y, r) {
assertthat::assert_that(is.matrix(Y))
assertthat::are_equal(length(r), 1)
assertthat::assert_that(r >= 0 & r < min(dim(Y)))
if (r == 0) {
Gamma <- NULL
Z <- NULL
Sigma <- apply(Y, 2, function(x) mean(x ^ 2))
} else {
svd_Y <- svd(Y)
Gamma <- svd_Y$v[, 1:r, drop = FALSE] %*% diag(svd_Y$d[1:r], r, r) /
sqrt(nrow(Y))
Z <- sqrt(nrow(Y)) * svd_Y$u[, 1:r, drop = FALSE]
Sigma <- apply(Y - Z %*% t(Gamma), 2, function(x) sum(x ^ 2)) / (nrow(Y) - r)
}
return(list(alpha = Gamma, Z = Z, sig_diag = Sigma))
}
#' Wrapper for cate's fa.em function.
#'
#' @inheritParams pca_naive
#'
#' @author David Gerard
#'
#' @export
#'
#' @references
#' \itemize{
#' \item{Jingshu Wang and Qingyuan Zhao (2015). cate: High
#' Dimensional Factor Analysis and Confounder Adjusted Testing and
#' Estimation. R package version
#' 1.0.4. \url{https://CRAN.R-project.org/package=cate}}
#' }
fa_ml <- function(Y, r) {
assertthat::assert_that(is.matrix(Y))
assertthat::are_equal(length(r), 1)
assertthat::assert_that(r >= 0 & r < min(dim(Y)))
if (r == 0) {
Gamma <- NULL
Z <- NULL
Sigma <- apply(Y, 2, function(x) mean(x ^ 2))
} else {
cout <- cate::fa.em(Y = Y, r = r)
alpha <- cout$Gamma
Z <- cout$Z
sig_diag <- cout$Sigma
}
return(list(alpha = alpha, Z = Z, sig_diag = sig_diag))
}
#' EM algorithm to find regression coefficients using t-likelihood
#' when variances are known up to scale.
#'
#' When using a non-standard t-distribution for your regression
#' likelihood, there is a simple latent variable representation that
#' allows us to develop an EM algorithm. This is a very similar
#' procedure to Lange et al (1989).
#'
#'
#' @param Y A matrix of numerics with one column. The response
#' variables.
#' @param alpha A matrix of numerics, the covariates. It must be that
#' \code{nrow(Y)} is equal to \code{nrow(alpha)}.
#' @param sig_diag A vector of numerics. The variances of the elements
#' in \code{Y}, but only assumed to be known up to a scaling
#' factor.
#' @param nu A positive numeric scalar. The known degrees of freedom
#' of the t-distribution.
#' @param lambda_init A positive numeric scalar. The initial value of
#' the variance inflation parameter. Defaults to the estimate
#' under Gaussian errors.
#' @param Z_init A matrix of numerics with one column. The initial
#' value of the coefficients of \code{alpha}. Defaults to the
#' estimate under Gaussian errors.
#' @param control_args A list of control arguments for the EM
#' algorithm that is passed to SQUAREM.
#'
#' @author David Gerard
#'
#' @export
#'
#' @references
#' \itemize{
#' \item{Lange, K.L., Little, R.J. and Taylor, J.M., 1989. "Robust statistical modeling using the t distribution." \emph{Journal of the American Statistical Association}, 84(408), pp.881-896. \doi{10.1080/01621459.1989.10478852}}
#' }
#'
#' @seealso \code{\link{tregress_obj}} for the objective function that
#' this function maximizes, \code{\link{tregress_fix}} for the
#' fixed point iteration that this function calls.
tregress_em <- function(Y, alpha, sig_diag, nu, lambda_init = NULL,
Z_init = NULL, control_args = list()) {
assertthat::assert_that(is.matrix(Y))
assertthat::assert_that(is.matrix(alpha))
assertthat::assert_that(is.vector(sig_diag))
assertthat::assert_that(all(sig_diag > 0))
assertthat::assert_that(length(nu) == 1 | length(nu) == nrow(Y))
assertthat::assert_that(all(nu > 0))
assertthat::assert_that(is.list(control_args))
assertthat::are_equal(nrow(Y), nrow(alpha))
assertthat::are_equal(nrow(Y), length(sig_diag))
sig_diag_inv <- 1 / sig_diag
if (is.null(Z_init)) {
Z_init <- crossprod(solve(crossprod(alpha, sig_diag_inv * alpha)),
crossprod(alpha, sig_diag_inv * Y))
}
resid_init <- Y - alpha %*% Z_init
if (is.null(lambda_init)) {
lambda_init <- mean(resid_init ^ 2 * sig_diag_inv)
}
assertthat::assert_that(is.matrix(Z_init))
assertthat::are_equal(ncol(alpha), nrow(Z_init))
assertthat::are_equal(length(lambda_init), 1)
assertthat::assert_that(lambda_init > 0)
zlambda <- c(c(Z_init), lambda_init)
sqout <- SQUAREM::squarem(par = zlambda, fixptfn = tregress_fix,
objfn = tregress_obj, Y = Y, alpha = alpha,
sig_diag = sig_diag, nu = nu,
control = control_args)
Z <- matrix(sqout$par[-length(sqout$par)], nrow = ncol(alpha))
lambda <- sqout$par[length(sqout$par)]
return(list(Z = Z, lambda = lambda))
}
#' Fixed point iteration for EM algorithm for regression with
#' t-errors where the variance is known up to scale.
#'
#' @inheritParams tregress_em
#' @param zlambda A vector containing the current estimates of the
#' coefficients (Z) and the variance inflation parameter
#' (lambda). The last element is lambda and all other elements are
#' Z.
#'
#' @author David Gerard
#'
#' @seealso \code{\link{tregress_em}} where this function is called,
#' \code{\link{tregress_obj}} for the objective function that this
#' fixed-point iteration increases.
#'
tregress_fix <- function(zlambda, Y, alpha, sig_diag, nu) {
assertthat::assert_that(is.vector(zlambda))
assertthat::assert_that(is.matrix(Y))
assertthat::assert_that(is.matrix(alpha))
assertthat::assert_that(is.vector(sig_diag))
assertthat::are_equal(nrow(alpha), nrow(Y))
assertthat::assert_that(length(nu) == 1 | length(nu) == nrow(Y))
assertthat::assert_that(all(nu > 0))
if (length(nu) == nrow(Y)) {
nu <- matrix(rep(nu, times = ncol(Y)), nrow = nrow(Y), byrow = FALSE)
}
Z <- matrix(zlambda[-length(zlambda)], nrow = ncol(alpha))
lambda <- zlambda[length(zlambda)]
resid_vec <- Y - alpha %*% Z
wvec <- (nu + 1) / (resid_vec ^ 2 / (lambda * sig_diag) + nu)
wsig <- wvec / sig_diag
Znew <- matrix(NA, nrow = nrow(Z), ncol = ncol(Z))
for (zindex in 1:ncol(wsig)) {
Znew[, zindex] <- crossprod(solve(crossprod(alpha, wsig[, zindex] * alpha)),
crossprod(alpha, wsig[, zindex] * Y[, zindex]))
}
resid_new <- Y - alpha %*% Znew
lambda_new <- mean(resid_new ^ 2 * wsig)
zlambda_new <- c(c(Znew), lambda_new)
return(zlambda_new)
}
#' The likelihood for regression with t-errors where the variance is
#' known up to scale.
#'
#' @inheritParams tregress_fix
#'
#' @author David Gerard
#'
#' @seealso \code{\link{tregress_em}} where this function is called,
#' \code{\link{tregress_fix}} for the fixed point iteration that
#' increases this objective function.
tregress_obj <- function(zlambda, Y, alpha, sig_diag, nu) {
assertthat::assert_that(is.vector(zlambda))
assertthat::assert_that(is.matrix(Y))
assertthat::assert_that(is.matrix(alpha))
assertthat::assert_that(is.vector(sig_diag))
assertthat::are_equal(nrow(alpha), nrow(Y))
assertthat::assert_that(length(nu) == 1 | length(nu) == nrow(Y))
assertthat::assert_that(all(nu > 0))
Z <- matrix(zlambda[-length(zlambda)], nrow = ncol(alpha))
lambda <- zlambda[length(zlambda)]
standard_t<- (Y - alpha %*% Z) / sqrt(lambda * sig_diag)
if (length(nu) == nrow(Y)) {
nu <- matrix(rep(nu, times = ncol(Y)), nrow = nrow(Y), byrow = FALSE)
}
llike <- sum(stats::dt(x = standard_t, df = nu, log = TRUE)) -
prod(dim(Y)) * log(lambda) / 2 - ncol(Y) * sum(log(sig_diag)) / 2
return(llike)
}
#' Wrapper for \code{\link{tregress_obj}}.
#'
#' This is mostly so I can run \code{stats::optim} with just
#' \code{lambda}.
#'
#' @author David Gerard
#'
#' @inheritParams tregress_obj
#' @param lambda A positive numeric. The variance inflation parameter.
#' @param Z A matrix of numerics with one column. The coefficients.
tregress_obj_wrapper <- function(lambda, Z, Y, alpha, sig_diag, nu) {
zlambda <- c(c(Z), lambda)
llike <- tregress_obj(zlambda = zlambda, Y = Y, alpha = alpha,
sig_diag = sig_diag, nu = nu)
return(llike)
}
#' RUV4's second step.
#'
#' This is the old version that only works for one covariate of
#' interest. Use \code{\link{cruv4_multicov}} for the more recent
#' version.
#'
#' @param betahat_ols A matrix of numerics with one column. The ols
#' estimates of beta.
#' @param alpha_scaled A matrix of numerics. The estimated
#' coefficients of the unobserved confounders.
#' @param sig_diag_scaled A vector of positive numerics. The estimated
#' standard errors of \code{betahat_ols}.
#' @param degrees_freedom A positive numeric. The degrees of freedom
#' of the t-likelihood if using it.
#' @inheritParams vruv4
#'
#' @author David Gerard
#'
cruv4 <- function(betahat_ols, alpha_scaled, sig_diag_scaled, ctl, degrees_freedom,
gls = TRUE, likelihood = "t") {
assertthat::assert_that(is.matrix(betahat_ols))
assertthat::assert_that(is.matrix(alpha_scaled))
assertthat::assert_that(is.vector(sig_diag_scaled))
assertthat::assert_that(all(sig_diag_scaled > 0))
assertthat::assert_that(is.logical(ctl))
assertthat::assert_that(length(degrees_freedom) == 1 | length(degrees_freedom) == ncol(betahat_ols))
assertthat::assert_that(all(degrees_freedom > 0))
assertthat::are_equal(nrow(betahat_ols), nrow(alpha_scaled))
k <- ncol(alpha_scaled)
## Use control genes to jointly estimate Z1 and variance scaling parameter.
Yc <- betahat_ols[ctl, , drop = FALSE]
if (k != 0) {
alphac <- alpha_scaled[ctl, , drop = FALSE]
sig_diag_inv <- 1 / sig_diag_scaled[ctl]
if (gls) {
Z1 <- crossprod(solve(crossprod(alphac, sig_diag_inv * alphac)),
crossprod(alphac, sig_diag_inv * Yc))
} else {
Z1 <- crossprod(solve(crossprod(alphac, alphac)),
crossprod(alphac, Yc))
}
resid_mat <- Yc - alphac %*% Z1
betahat <- betahat_ols - alpha_scaled %*% Z1
} else {
Z1 <- NULL
resid_mat <- Yc
betahat <- betahat_ols
}
## Gaussian MLE of variance inflation parameter.
multiplier <- mean(resid_mat ^ 2 / sig_diag_scaled[ctl])
## T-likelihood regression and estimate variance inflation
## parameter using control genes.
if (likelihood == "t") {
if (k != 0) {
if (!gls) {
message("gls = FALSE not supported for t-likelihood, using gls = TRUE.")
}
alphac <- alpha_scaled[ctl, , drop = FALSE]
tout <- tregress_em(Y = Yc, alpha = alphac,
sig_diag = sig_diag_scaled[ctl],
nu = degrees_freedom, lambda_init = multiplier,
Z_init = Z1)
Z1 <- tout$Z
betahat <- betahat_ols - alpha_scaled %*% Z1
multiplier <- tout$lambda
} else if (k == 0) {
tout <- stats::optim(par = multiplier, fn = tregress_obj_wrapper,
Z = matrix(0, nrow = 1, ncol = 1), Y = Yc,
alpha = matrix(0, nrow = nrow(Yc), ncol = 1),
sig_diag = sig_diag_scaled[ctl],
nu = degrees_freedom, method = "Brent",
lower = 0, upper = 10,
control = list(fnscale = -1))
multiplier <- tout$par
}
}
## Output values
sebetahat <- sqrt(sig_diag_scaled * multiplier)
ruv4_out <- list()
ruv4_out$multiplier <- multiplier
ruv4_out$betahat_ols <- betahat_ols
ruv4_out$sebetahat_ols <- sqrt(sig_diag_scaled)
ruv4_out$betahat <- betahat
ruv4_out$sebetahat <- sebetahat
ruv4_out$tstats <- betahat / sebetahat
ruv4_out$pvalues <- 2 * (stats::pt(q = -abs(ruv4_out$tstats),
df = degrees_freedom))
ruv4_out$Z1 <- Z1
return(ruv4_out)
}
#' Variance inflated RUV-inverse.
#'
#' RUV-inverse as described in Gagnon-Bartsch et al (2013) is just
#' RUV4 using the maximum number of confounders allowed by the model,
#' followed by estimating the variances using a method-of-moments type
#' approach. \code{vruvinv} is similar in spirit to this by using the
#' maximum number of confounders allowed by the model, but we still
#' estimate the variance inflation using the confounders. We force
#' \code{limmashrink = TRUE}, otherwise there is only one degree of
#' freedom to estimate the variance.
#'
#' Hence, this is just a wrapper for \code{\link{vruv4}}, but with a
#' special choice for the number of confounders and forcing
#' \code{limmashrink = TRUE}.
#'
#' @inheritParams vruv4
#'
#' @return See \code{\link{vruv4}} for the elements returned.
#'
#' @author David Gerard
#'
#' @export
#'
#' @references
#' \itemize{
#' \item{Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. \url{https://statistics.berkeley.edu/tech-reports/820}}
#' }
vruvinv <- function(Y, X, ctl, cov_of_interest = ncol(X),
likelihood = c("t", "normal"),
include_intercept = TRUE, fa_func = pca_naive,
fa_args = list(), adjust_bias = FALSE) {
k1 <- sum(ctl) - ncol(X) - include_intercept - adjust_bias - 1
k2 <- nrow(X) - ncol(X) - include_intercept - adjust_bias - 1
k <- min(k1, k2)
assertthat::assert_that(k > 0)
likelihood <- match.arg(likelihood)
vout <- vruv4(Y = Y, X = X, ctl = ctl, k = k, likelihood = likelihood, limmashrink = TRUE,
include_intercept = include_intercept, gls = TRUE, fa_func = fa_func,
fa_args = fa_args, adjust_bias = adjust_bias)
return(vout)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.