getK: Get K
In ruv: Detect and Remove Unwanted Variation using Negative Controls

Description Usage Arguments Value Warning Author(s) References See Also Examples

Finds an often-suitable value of K for use in RUV-4.

1 2	getK(Y, X, ctl, Z = 1, eta = NULL, include.intercept = TRUE, fullW0 = NULL, cutoff = NULL, method="select", l=1, inputcheck = TRUE)

`Y`	The data. A m by n matrix, where m is the number of samples and n is the number of features.
`X`	The factor(s) of interest. A m by p matrix, where m is the number of samples and p is the number of factors of interest. Note that X should be only a single column, i.e. p = 1; if X has more than one column, only column `l` will be used (see below).
`ctl`	An index vector to specify the negative controls. Either a logical vector of length n or a vector of integers.
`Z`	Any additional covariates to include in the model, typically a m by q matrix. Factors and dataframes are also permissible, and converted to a matrix by `design.matrix`. Alternatively, may simply be 1 (the default) for an intercept term. May also be `NULL`.
`eta`	Gene-wise (as opposed to sample-wise) covariates. These covariates are adjusted for by RUV-1 before any further analysis proceeds. Can be either (1) a matrix with n columns, (2) a matrix with n rows, (3) a dataframe with n rows, (4) a vector or factor of length n, or (5) simply 1, for an intercept term.
`include.intercept`	Applies to both `Z` and `eta`. When `Z` or `eta` (or both) is specified (not `NULL`) but does not already include an intercept term, this will automatically include one. If only one of `Z` or `eta` should include an intercept, this variable should be set to `FALSE`, and the intercept term should be included manually where desired.
`fullW0`	Can be included to speed up execution. Is returned by previous calls of `getK`, `RUV4`, `RUVinv`, or `RUVrinv` (see below).
`cutoff`	Specify an alternative cut-off. Default is the (approximate) 90th percentile of the distribution of the first singular value of an m by n gaussian matrix.
`method`	Can be set to either `leave1out`, `fast`, or `select`. `leave1out` is the preferred method but may be slow, `fast` is an approximate method that is faster but may provide poor results if n_c is not much larger than m, and `select` (the default) tries to choose for you.
`l`	Which column of X to use in the getK algorithm.
`inputcheck`	Perform a basic sanity check on the inputs, and issue a warning if there is a problem.

A list containing

`k`	the estimated value of k
`cutoff`	The cutoff value used
`sizeratios`	A measure of the relative sizes of the rows of alpha.
`fullW0`	Can be used to speed up future calls of RUV4.

This value of K will not be the best choice in all cases. Moreover, it will often be a poor choice of K for use with RUV2. See Gagnon-Bartsch and Speed (2012) for commentary on estimating k.

Johann Gagnon-Bartsch johanngb@umich.edu

Using control genes to correct for unwanted variation in microarray data. Gagnon-Bartsch and Speed, 2012. Available at: http://biostatistics.oxfordjournals.org/content/13/3/539.full.

Removing Unwanted Variation from High Dimensional Data with Negative Controls. Gagnon-Bartsch, Jacob, and Speed, 2013. Available at: http://statistics.berkeley.edu/tech-reports/820.

RUV4

## Create some simulated data
m = 50
n = 10000
nc = 1000
p = 1
k = 20
ctl = rep(FALSE, n)
ctl[1:nc] = TRUE
X = matrix(c(rep(0,floor(m/2)), rep(1,ceiling(m/2))), m, p)
beta = matrix(rnorm(p*n), p, n)
beta[,ctl] = 0
W = matrix(rnorm(m*k),m,k)
alpha = matrix(rnorm(k*n),k,n)
epsilon = matrix(rnorm(m*n),m,n)
Y = X%*%beta + W%*%alpha + epsilon

## Run getK
temp = getK(Y, X, ctl)
K = temp$k