ash_ruv4: Use control genes to estimate hidden confounders and variance...
In dcgerard/vicar: Various Ideas for Confounder Adjustment in Regression

Description Usage Arguments Details Value Author(s) References

This function will perform a variant of Removing Unwanted Variation 4-step (RUV4) (Gagnon-Bartsch et al, 2013) where the control genes are used not only to estimate the hidden confounders, but to estimate a variance inflation parameter. This variance inflation step is akin to the "empirical null" approach of Efron (2004). After this procedure, Adaptive SHrinkage (ASH) (Stephens, 2016) is performed on the coefficient estimates and the inflated standard errors.

ash_ruv4(
  Y,
  X,
  ctl = NULL,
  k = NULL,
  cov_of_interest = ncol(X),
  likelihood = c("t", "normal"),
  ash_args = list(),
  limmashrink = TRUE,
  degrees_freedom = NULL,
  include_intercept = TRUE,
  gls = TRUE,
  fa_func = pca_naive,
  fa_args = list(),
  scale_var = TRUE
)

`Y`	A matrix of numerics. These are the response variables where each column has its own variance. In a gene expression study, the rows are the individuals and the columns are the genes.
`X`	A matrix of numerics. The covariates of interest.
`ctl`	A vector of logicals of length `ncol(Y)`. If position i is `TRUE` then position i is considered a negative control.
`k`	A non-negative integer.The number of unobserved confounders. If not specified and the R package sva is installed, then this function will estimate the number of hidden confounders using the methods of Buja and Eyuboglu (1992).
`cov_of_interest`	A positive integer. The column number of the covariate in X whose coefficients you are interested in. The rest are considered nuisance parameters and are regressed out by OLS. `ash_ruv4` only works with one covariate of interest right now.
`likelihood`	Either `"normal"` or `"t"`. If `likelihood = "t"`, then the user may provide the degrees of freedom via `degrees_freedom`.
`ash_args`	A list of arguments to pass to ash. See `ash.workhorse` for details.
`limmashrink`	A logical. Should we apply hierarchical shrinkage to the variances (`TRUE`) or not (`FALSE`)? If `degrees_freedom = NULL` and `limmashrink = TRUE` and `likelihood = "t"`, then we'll also use the limma returned degrees of freedom.
`degrees_freedom`	if `likelihood = "t"`, then this is the user-defined degrees of freedom for that distribution. If `degrees_freedom` is `NULL` then the degrees of freedom will be the sample size minus the number of covariates minus `k`.
`include_intercept`	A logical. If `TRUE`, then it will check `X` to see if it has an intercept term. If not, then it will add an intercept term. If `FALSE`, then `X` will be unchanged.
`gls`	A logical. Should we use generalized least squares (`TRUE`) or ordinary least squares (`FALSE`) for estimating the confounders? The OLS version is equivalent to using RUV4 to estimate the confounders.
`fa_func`	A factor analysis function. The function must have as inputs a numeric matrix `Y` and a rank (numeric scalar) `r`. It must output numeric matrices `alpha` and `Z` and a numeric vector `sig_diag`. `alpha` is the estimate of the coefficients of the unobserved confounders, so it must be an `r` by `ncol(Y)` matrix. `Z` must be an `r` by `nrow(Y)` matrix. `sig_diag` is the estimate of the column-wise variances so it must be of length `ncol(Y)`. The default is the function `pca_naive` that just uses the first `r` singular vectors as the estimate of `alpha`. The estimated variances are just the column-wise mean square.
`fa_args`	A list. Additional arguments you want to pass to fa_func.
`scale_var`	A logical. Should we use the variance inflation parameter in the estimate standard errors when inserting into `ash.workhorse` (`TRUE`) or not (`FALSE`)?

The model is

Y = XB + ZA + E,

where Y is a matrix of responses (e.g. log-transformed gene expression levels), X is a matrix of covariates, B is a matrix of coefficients, Z is a matrix of unobserved confounders, A is a matrix of unobserved coefficients of the unobserved confounders, and E is the noise matrix where the elements are independent Gaussian and each column shares a common variance. The rows of Y are the observations (e.g. individuals) and the columns of Y are the response variables (e.g. genes).

This model is fit using a two-step approach proposed in Gagnon-Bartsch et al (2013) and described in Wang et al (2015), modified to include estimating a variance inflation parameter. Rather than use OLS in the second step of this two-step procedure, we estimate the coefficients using Adaptive SHrinkage (ASH) (Stephens, 2016). In the current implementation, only the coefficients of one covariate can be estimated using ASH. The rest are regressed out using OLS.

Except for the list ruv4, the values returned are the exact same as in ash.workhorse. See that function for more details. Elements in the ruv4 are the exact same as returned in vruv4.

David Gerard

Buja, A. and Eyuboglu, N., 1992. "Remarks on parallel analysis." Multivariate behavioral research, 27(4), pp.509-540. doi: 10.1207/s15327906mbr2704_2
Efron, B., 2004. "Large-scale simultaneous hypothesis testing: the choice of a null hypothesis." Journal of the American Statistical Association, 99(465), pp.96-104. doi: 10.1198/016214504000000089
Gagnon-Bartsch, J., Laurent Jacob, and Terence P. Speed, 2013. "Removing unwanted variation from high dimensional data with negative controls." Berkeley: Department of Statistics. University of California. https://statistics.berkeley.edu/tech-reports/820
Stephens, Matthew. 2016. "False discovery rates: a new deal." Biostatistics 18 (2): 275–94. doi: 10.1093/biostatistics/kxw041
Wang, Jingshu, Qingyuan Zhao, Trevor Hastie, and Art B. Owen. 2017. "Confounder adjustment in multiple hypothesis testing." The Annals of Statistics 45, no. 5: 1863-1894. doi: 10.1214/16-AOS1511