calculate_emd: Earth Mover's Distance for differential analysis of genomics...
In EMDomics: Earth Mover's Distance for Differential Analysis of Genomics Data

Description Usage Arguments Value See Also Examples

This is the main user interface to the EMDomics package, and will usually the only function needed when conducting an analysis using the EMD algorithm. Analyses can also be conducted with the Komolgorov-Smirnov Test using calculate_ks or the Cramer Von Mises algorithm using calculate_cvm.

The algorithm is used to compare genomics data between any number of groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation).

Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the two distributions. This approach tends to give non-significant results if the two distributions are highly heterogeneous, which can be the case in many biological circumstances (e.g sensitive vs. resistant tumor samples).

The Earth Mover's Distance algorithm instead computes the "work" needed to transform one distribution into another, thus capturing possibly valuable information relating to the overall difference in shape between two heterogeneous distributions.

The EMD-based algorithm implemented in EMDomics has two main steps. First, a matrix (e.g. of expression data) is divided into data for each of the groups. Every possible pairwise EMD score is then computed and stored in a table. The EMD score for a single gene is calculated by averaging all of the pairwise EMD scores. Next, the labels for each of the groups are randomly permuted a specified number of times, and an EMD score for each permutation is calculated. The median of the permuted scores for each gene is used as the null distribution, and the False Discovery Rate (FDR) is computed for a range of permissive to restrictive significance thresholds. The threshold that minimizes the FDR is defined as the q-value, and is used to interpret the significance of the EMD score analogously to a p-value (e.g. q-value < 0.05 is significant.)

Because EMD is based on a histogram binning of the expression levels, data that cannot be binned will be discarded, and a message for that gene will be printed. The most likely reason for histogram binning failing is due to uniform values (e.g. all 0s).

1
2
3

calculate_emd(data, outcomes, binSize = 0.2, nperm = 100,
  pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE,
  verbose = TRUE, parallel = TRUE)

`data`	A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers.
`outcomes`	A vector containing group labels for each of the samples provided in the `data` matrix. The names should be the sample identifiers provided in `data`.
`binSize`	The bin size to be used when generating histograms of the data for each group. Defaults to 0.2.
`nperm`	An integer specifying the number of randomly permuted EMD scores to be computed. Defaults to 100.
`pairwise.p`	Boolean specifying whether the permutation-based q-values should be computed for each pairwise comparison. Defaults to `FALSE`.
`seq`	Boolean specifying if the given data is RNA Sequencing data and ought to be normalized. Set to `TRUE`, if passing transcripts per million (TPM) data or raw data that is not scaled. If `TRUE`, data will be normalized by first multiplying by 1E6, then adding 1, then taking the log base 2. If `FALSE`, the data will be handled as is (unless `quantile.norm` is `TRUE`). Note that as a distribution comparison function, EMD will compute faster with scaled data. Defaults to `FALSE`.
`quantile.norm`	Boolean specifying is data should be normalized by quantiles. If `TRUE`, then the `normalize.quantiles` function is used. Defaults to `FALSE`.
`verbose`	Boolean specifying whether to display progress messages.
`parallel`	Boolean specifying whether to use parallel processing via the BiocParallel package. Defaults to `TRUE`.

The function returns an EMDomics object.

EMDomics emd2d

# 100 genes, 100 samples
dat <- matrix(rnorm(10000), nrow=100, ncol=100)
rownames(dat) <- paste("gene", 1:100, sep="")
colnames(dat) <- paste("sample", 1:100, sep="")

# "A": first 50 samples; "B": next 30 samples; "C": final 20 samples
outcomes <- c(rep("A",50), rep("B",30), rep("C",20))
names(outcomes) <- colnames(dat)

results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE)
head(results$emd)

Calculating pairwise emd scores...done.
Calculating emd...done.
Calculating permuted emd #1 of 10...done.
Calculating permuted emd #2 of 10...done.
Calculating permuted emd #3 of 10...done.
Calculating permuted emd #4 of 10...done.
Calculating permuted emd #5 of 10...done.
Calculating permuted emd #6 of 10...done.
Calculating permuted emd #7 of 10...done.
Calculating permuted emd #8 of 10...done.
Calculating permuted emd #9 of 10...done.
Calculating permuted emd #10 of 10...done.
Calculating q-values...done.
           emd   q-value
gene1 2.590000 0.0000000
gene2 2.650000 0.0000000
gene3 1.516667 1.0000000
gene4 2.850000 0.0000000
gene5 2.233333 0.2941176
gene6 1.950000 0.7708333