Description Usage Arguments Value See Also Examples
This is a main user interface to the EMDomics package, and
will usually the only function needed when conducting an analysis using the CVM
algorithm. Analyses can also be conducted with the Komolgorov-Smirnov Test using
calculate_ks
or the Earth Mover's Distance algorithm using calculate_emd
.
The algorithm is used to compare genomics data between any number of groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation).
Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the two distributions. This approach tends to give non-significant results if the two distributions are highly heterogeneous, which can be the case in many biological circumstances (e.g sensitive vs. resistant tumor samples).
The Cramer von Mises (CVM) algorithm generates a test statistic that is the sum of the squared values of the differences between two cumulative distribution functions (CDFs). As a result, the test statistic tends to overestimate the similarity between two distributions and cannot effectively handle partial matching like EMD does. However, it is one of the most commonly referenced nonparametric two-class distribution comparison tests in non-genomic contexts.
The CVM-based algorithm implemented in EMDomics has two main steps. First, a matrix (e.g. of expression data) is divided into data for each of the groups. Every possible pairwise CVM score is then computed and stored in a table. The CVM score for a single gene is calculated by averaging all of the pairwise CVM scores. Next, the labels for each of the groups are randomly permuted a specified number of times, and an CVM score for each permutation is calculated. The median of the permuted scores for each gene is used as the null distribution, and the False Discovery Rate (FDR) is computed for a range of permissive to restrictive significance thresholds. The threshold that minimizes the FDR is defined as the q-value, and is used to interpret the significance of the CVM score analogously to a p-value (e.g. q-value < 0.05 is significant.)
1 2 |
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector containing group labels for each of the samples provided
in the |
nperm |
An integer specifying the number of randomly permuted CVM scores to be computed. Defaults to 100. |
pairwise.p |
Boolean specifying whether the permutation-based q-values should
be computed for each pairwise comparison. Defaults to |
seq |
Boolean specifying if the given data is RNA Sequencing data and ought to be
normalized. Set to |
quantile.norm |
Boolean specifying is data should be normalized by quantiles. If
|
verbose |
Boolean specifying whether to display progress messages. |
parallel |
Boolean specifying whether to use parallel processing via
the BiocParallel package. Defaults to |
The function returns an CVMomics
object.
CVMomics
CramerVonMisesTwoSamples
1 2 3 4 5 6 7 8 9 10 11 | # 100 genes, 100 samples
dat <- matrix(rnorm(10000), nrow=100, ncol=100)
rownames(dat) <- paste("gene", 1:100, sep="")
colnames(dat) <- paste("sample", 1:100, sep="")
# "A": first 50 samples; "B": next 30 samples; "C": final 20 samples
outcomes <- c(rep("A",50), rep("B",30), rep("C",20))
names(outcomes) <- colnames(dat)
results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE)
head(results$cvm)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.