RUVIII_C: RUV-III-C
In RUVIIIC: RUV-III-C

Description Usage Arguments Details Value Examples

View source: R/RUVIII_C.R

Apply RUV-III-C, a variation of RUV-III that only uses non-missing values

RUVIII_C(
  k,
  Y,
  M,
  toCorrect,
  controls,
  withExtra = FALSE,
  withW = FALSE,
  withAlpha = FALSE,
  version = "CPP",
  progress = TRUE,
  ...
)

`k`	The number of factors of unwanted variation to remove
`Y`	The input data matrix. Must be a matrix, not a data.frame. It should contain missing (NA) values, rather than zeros. No additional transformation is applied to the input data.
`M`	The design matrix containing information about technical replicates. It should not contain an intercept term!
`toCorrect`	The names of the variables to correct using RUV-III-C
`controls`	The names of the control variables which are known to be constant across the observations
`withExtra`	Should we generate extra information?
`withW`	Should we generate the matrices W giving information about the unwanted factors, for every peptide?
`withAlpha`	Should we generate, per-peptide, the matrix alpha giving the effects of the unwanted factors?
`version`	The version of the underlying code to use. Must be either "CPP" or "R"
`progress`	Should a progress bar be displayed?
`...`	Other arguments for the prototype R code. Supported values are `filename` for a checkpoint file, and `batchSize` for the frequency with which the checkpoint file is written.

RUV-III is a sophisticated method for removing unwanted variation. The key difficulty in removing unwanted variation is distinguishing wanted from unwanted variation. RUV-III solves this by relying on technical replication, and a list of variables (known as negative control variables) which are known a priori to be constant across all observations. Any variation in the negative control variables across the dataset is (by assumption) unwanted. So we can distinguish wanted from unwanted variation, and therefore estimate the unwanted variation and remove it.

One problem with this approach is the presence of “missing” or zero values in certain application domains. For example, in proteomics it will sometimes be the case that a protein or peptide is not detected in a specific technical replicate of a sample, for purely technical reasons relating to data collection. These missing values are often not related to censoring or the limit of detection. Similar problems occur in metabolomics and single-cell transcriptomics. In all these cases, the metabolite, gene or peptide will be recorded as a zero in the data matrix. Where this type of variation occurs between technical replicates (e.g. one records a zero value and one records a non-zero value) is not correctable.

Regardless of the reason for these zeros, and whether they are accurate or not, zero values are not affected by technical variation, which breaks an assumption of the RUV-III model. In the case that a zero value is incorrect, more serious problems occur. The discrepancies between a pair of technical replicates due to zero values will appear to be much larger than the discrepancies due to other (correctable) technical factors. RUV-III will attempt to correct for the larger (uncorrectable) discrepancy, and ignore the correctable technical factors.

RUV-III-C is a variation of RUV-III that attempts to solve this problem, by applying RUV-III separately to every variable. If variable X is being corrected, we take the rows of the data matrix for which X is non-missing. RUV-III is then applied, and the corrected values of X is retained. The corrected values of all other variables are discarded. Note that when we take a subset of the rows of the data matrix, other columns besides X will still have missing values. These values are replaced with zero in order to apply RUV-III. No additional transformation is applied to the input data matrix. If normalization should be applied on the log-scale, then logged data must be input.

There are two implementations of this function, the preferred C++ version and the original protoype R code. Select which version using the version argument, which must be either "CPP" or "R"

If withExtra = FALSE, returns a matrix. If withExtra = TRUE, returns a list with entries named newY, residualDimensions and W.

data(crossLab)
#Design matrix containing information about which runs are technical replicates of each other. 
#In this case, random pairings of mass-spec runs analysing the same sample, at different sites.
#Note that we specify no intercept term!
M <- model.matrix(~ grouping - 1, data = peptideData)
#Get out the list of peptides, both HEK (control) and peptides of interest.
peptides <- setdiff(colnames(peptideData), c("filename", "site", "mixture", "Date", "grouping"))
#Reduce the data matrix to only the peptide data
onlyPeptideData <- data.matrix(peptideData[, peptides])
#All the human peptides are potential controls. That is, everything that's not an SIS peptides.
potentialControls <- setdiff(peptides, sisPeptides)
#But we want to use controls that are always found
potentialControlsAlwaysFound <- names(which(apply(onlyPeptideData[, potentialControls], 2, 
    function(x) sum(is.na(x))) == 0))
#Actually run correction
#Set number of threads for CRAN
try(RUVIIIC::omp_set_num_threads(2L), silent=TRUE)
results <- RUVIII_C(k = 11, Y = log10(onlyPeptideData), M = M, toCorrect = 
    colnames(onlyPeptideData), controls = potentialControlsAlwaysFound)