estVC: Estimate Contributions to Variance
In dmoskowitz/mavric: Measuring Association between Variance and Covariates

Description Usage Arguments Details Value See Also Examples

This the main end-user function of mavric. Given a data matrix and a set of covariates, this function layers clustering analysis atop PCA to attribute axes of variance to underlying sources.

estVC(data, annotation, contrastlist, ntop = Inf, sigcor = TRUE,
  alpha = 0.01, effsize = 0.05, discthresh = 1/3, ncores = 1,
  verbose = TRUE, prenorm = FALSE, autosel = FALSE,
  corMethod = c("pearson", "kendall", "spearman"), resampling = c("none",
  "permutation", "bootstrap"), pcaobj = NULL, uniqueonly = FALSE,
  scaleVar = TRUE, minVar = 1, wilcox = FALSE)

`data`	A p x n data matrix, with p measurements on n samples. Typically, this will be count data from an HTS assay.
`annotation`	An n x l annotation data frame, with l covariates annotated for each of the n samples. This can be a mix of categorical and continuous covariates; categorical covariates are subjected to clustering analysis and continuous covariates are used as the dependent variable in an L1-regularized principal components regression. For categorical covariates, there must be at least two levels, and each level must have at least two samples.
`contrastlist`	A list describing the contrasts to test. Each element is a list where the first element is the name of covariate, matching a column name in ‘annotation.’ For categorical variables, there should be two additional elements, each with the name of a category (i.e. a level of the factor), indicating that those two categories should be compared; for continuous covariates, the single element of the covariate's name suffices to indicate that testing should occur for high versus low.
`ntop`	A numeric giving the number of measurements to feed into PCA. For parameter n, the n measurements with the highest variance across all samples will be used.
`sigcor`	Whether to report those measurements significantly correlated with pairwise differences between categories, for each categorical covariates, and between high/low, for continuous covariates.
`alpha`	The p-value at which the observed silhouette is deemed sufficient for adding a PC to a subspace. See Details.
`effsize`	The change in silhouette above the expected value that is deemed sufficient for adding a PC to a subspace. See Details.
`discthresh`	The p-value threshold above which PCs are discarded from further consideration, during the first forward selection step. See Details.
`ncores`	The number of CPU cores to use. Note that the memory footprint for each core can be large for large data sets.
`verbose`	Whether to print out status messages during the run.
`prenorm`	Whether ‘data’ is already normalized and should not be normalized further.
`autosel`	Whether to automatically determine which peaks are high-variance, ignoring ‘ntop.’ See Details.
`corMethod`	Which correlation method to use for identifying significantly correlated measurements. See ‘method’ argument to `cor`.
`resampling`	Whether to use either permutation or bootstrap testing in lieu of the default asymptotic t approximation in the correlation-based statistical testing of the differential analysis.
`pcaobj`	If mavric has previously been run, the ‘pcaobj’ field of the returned object can be passed here to avoid recomputing it.
`uniqueonly`	When performing differential comparisons, whether to use all associated PCs, or only those unique to the covariate.
`scaleVar`	Whether to scale features to unit variance when performing PCA.
`minVar`	Minimum percent variance a PC must explain to be retained.
`wilcox`	Whether to use a Wilcoxon rank-sum test rather than MAVRIC's correlation testing.

estVC uses a forward selection technique to find a subspace of PCs within which there is separation between levels of categorical covariates.

Following PCA, the function iterates over all PCs and calculates both the silhouette statistic, and a null distribution for the silhouette by permuting the PC's values for each sample. Covariates' respective levels are used as cluster assignments when calculating the silhouette. The parameter alpha controls the p-value at which an association between the PC and the covariate is deemed significant. The parameter effsize further controls the attribution by setting a minimum difference between the observed silhouette and the expected value from the permutation distribution. If multiple PCs are more significant than alpha, the PC with the greatest effect size is chosen.

The process repeats until no PCs pass the p-value and effect size thresholds. In the first step only, PCs with p-value greater than discthresh are discarded from further steps. In subsequent steps only, only newly added PCs have their values permuted (i.e. not those already selected in previous steps), to test the null hypothesis that adding a given PC provides better separation than would adding a random dimension to the subspace.

Given the subspace, the algorithm then forms an axis of variance between pairs, given by the line formed by the pairs' respective centroids in the PC-subspace. Points are then projected onto that axis, and the original measurements are correlated with those projected values to find significant associations.

When ‘autosel’ is true, mavric uses a robust LOESS fit to identify measurements with more variance than expected (based on a 95% confidence interval), given the measurements' respective means.

A list with the following fields:

dat: The ‘data’ object, normalized if applicable.

pcaobj: The output of running FactoMineR's PCA on the ‘ntop’ highest-variance measurements of ‘data’ (post-normalization).

pcs: A list with a matrix element for each covariate, for the associated PCs and the covariate's mean.

pcasp: A matrix giving the coordinates of each sample in PC-space.

cor: A list with a list element for each covariate, with each sublist element a matrix giving the correlation between each measurement and the pairwise axis of variance for a given pair of levels in the covariate.

plotPCs and plotVars for plotting functionality; silhouette in the cluster package for the silhouette statistic; and rlog and varianceStabilizingTranformation in the DESeq2 package, and justvsn in the vsn package, for variance-stabilizing transformations.

## Using human and mouse RNA-seq data from ENCODE, evaluating the
##  amount of variance in gene expression attributable to
##  species-specific versus tissue-specific effects, and
##  quantifying the confounding between species-specific effects
##  and batch effects.
##
## Not run: 
evc <- estVC(data = rawCounts, annotation = datasets[, -1],
 ntop = 10000, sigcor = F, effsize = .01, discthresh = Inf,
 ncores = 4) # should complete in ~5 minutes
## Plot variance and confounding amongst experimental covariates
plotVars(evc, datasets[, -1])
## Plot subset of PCs best separating samples by species
plotPCs(evc, datasets[, -1], "species")

## End(Not run)