Overview

Gene modules are sets of genes that are (or appear to be) co-regulated. The DEXICA package provides functions to infer gene modules from compendia of gene expression data (typically microarray data) using independent component analysis (ICA). The term DEXICA is an acronym for Deep EXtraction ICA, so named because it is designed to extract a large number of potentially weak signals from the input data.

Because data artifacts and/or noise can exert an out-sized influence on weak signal recovery, the DEXICA process seeks to optimize data preprocessing procedures and parameters of the ICA algorithm using biological criteria. These criteria take a variety of forms and are generally organism-dependent, but some examples include gene function annotations, such as GO terms or REACTOME pathways, gene expression information (other than that in the compendium to be analyzed), such as tissue expression patterns, and regulatory sequence annotations. The package DEXDATA.Celegans provides several such files for the model organism C. elegans; the DEXICA package itself contains several data files derived from the C. elegans package for use in examples.

Note: The examples in this vignette use truncated versions of several DEXDATA.Celegans data files (for example, a microarray compendium comprising 50 arrays rather than 1386) so that the examples will execute quickly on most machines.

Getting started

Installation

The DEXICA package (and supporting packages, such as DEXDATA.Celegans) can be installed from the MPCary code repository on GitHub.

install.packages("devtools") # Skip this line if you aleady have devtools installed
library(devtools)
install_github("MPCary/DEXICA", build_vignettes = TRUE)
install_github("MPCary/DEXDATA.Celegans") # C. elegans data package
# Note: It is advisable to restart your R session after running install_github
# due to a known issue in which documentation files are incorrectly reported
# to be corrupt.

Gene module prediction

The two main purposes of the DEXICA package are to: a) predict gene modules, and b) to evaluate the quality of those predictions using biological criteria.

Module prediction is carried out by the predictModules function. It takes as input a (suitably preprocessed) compendium of expression data, and returns a list containing two matrices named S and A. If the input matrix comprises probesets in rows and microarrays in columns, the S matrix (referred to as the "source matrix" in ICA parlance) provides a weight for each probeset in each gene module (or independent component), and the A matrix (or "mixing matrix") provides a weight for each module in each microarray.

library(DEXICA)

# Load and preprocess microarray compendium
GPL200.50.pp = preprocessMatrix(GPL200.50.mini)

# Extract 10 modules from compendium
set.seed(1) # Set random seed for reproducibility
result = predictModules(GPL200.50.pp, n.comp = 10)
dim(result$S) # Dimensions of S matrix
result$S[1:3,1:3] # First few rows and columns of S matrix

S matrix partitioning

The S matrix can be thought of assigning a weight for each probeset in each module. While this matrix is useful for many purposes, it is sometimes desirable to derive discrete sets of genes from it for each module. This process is referred to as partitioning, and several approaches to partitioning have been described in the literature.

Perhaps the most common approach is to apply a fixed threshold procedure. In ICA, the S matrix is standardized (such that each column has zero mean and unit variance), and the fixed threshold partitioning method assigns probesets with extreme values (e.g. > 3 or < -3) as being "in" the module. N.b., because modules comprise both positive and negative values, partitioning gives rise to two disjoint sets of genes per S matrix column. We refer to the two opposite (in sign) halves of a module as "hemi-modules".

An alternative approach to partitioning, called variable threshold partitioning, is to apply a function that attempts to provide an optimal partitioning for each S matrix column independently based on some property (or properties) of the numbers comprising the column. The function partion allows for both fixed and variable threshold partitioning; the variable partitioning method uses an artificial neural network (ANN) that has been trained on simulated data to predict partition thresholds based on the 3rd and 4th moments of each column's distribution.

# Partition S matrix
S.fp = partition(result$S, method = "fixed", t = 3) # Fixed partitioning
S.vp = partition(result$S, method = "ann") # Variable partitioning

# Separate positive sets from negative sets
S.fp.split = splitMatrix(S.fp)
S.vp.split = splitMatrix(S.vp)

# Count the number of probesets in each set
fp.counts = apply(S.fp.split, 2, sum)
vp.counts = apply(S.vp.split, 2, sum)

# Plot counts
count.matrix = rbind(fp.counts, vp.counts)
count.matrix = count.matrix[, order(apply(count.matrix, 2, sum), decreasing = TRUE)]
barplot(count.matrix, beside = TRUE, xlab = "Hemi-modules", ylab = "Probesets",
        names.arg = rep("", ncol(count.matrix)), col = 1:2)
legend("topright", legend = c("Fixed", "ANN"), fill = 1:2, cex = 0.75)

Gene module validation

To check whether a set of predicted gene modules make sense, and to compare the relative quality of different predictions, one can compare module predictions with gene annotations that are believed or suspected to align (even loosely) with gene modularity. For example, there is a tendency of the genes involved in some (but certainly not all) biological processes to be co-regulated; a comprehensive set of biological process annotations may therefore be used to gauge module quality. It is not necessary to know a priori which processes are actually co-regulated - the non-co-regulated processes will simply be uninformative for our purpose and should neither falsely inflate nor deflate our confidence in a set of predicted modules provided we correct for multiple hypothesis testing.

The function checkModules compares a (partitioned and split, as above) set of gene modules with a set of gene (or probeset) annotations. It calculates the enrichment of each annotation in each module using hypergeometric statistics, then adjusts all p-values using the FDR correction. It returns a list containing, among other items, the number of annotations found to be significant in at least one module (anns.signif).

# Evaluate module predictions using GO annotations
fp.check.GO = checkModules(S.fp.split, GPL200.GO.mini)
vp.check.GO = checkModules(S.vp.split, GPL200.GO.mini)

data.frame(rbind(fp.check.GO, vp.check.GO))

The checkModules function also returns the number of modules with at least one significant annotation (mods.signif), but this value should generally not be used to gauge the quality of a set of module predictions. In general, the true number of latent gene modules in a data set is unknown, and predicting additional significant modules may merely indicate that a once highly accurate module prediction becomes split into somewhat less accurate subsets when additional modules are extracted. Because the accuracy of ICA at predicting latent signal sources is maximal when the number of extracted components matches the number of latent signals, the total number of significant annotations across all modules should generally reach a maximum value when the true number of sources (n.comp) is extracted.

Batch mode

The main idea behind DEXICA is to optimize gene module prediction through thorough sampling of the input parameter space; a DEXICA analysis may require hundreds or even thousands of separate ICA runs. The DexBatch class was created to make this process easier.

The DexBatch class

A DexBatch object can be thought of as a foreman that directs the execution of (potentially) a large number of jobs. Each job consists of three main tasks:

When it is created, the DexBatch object must be supplied with a named list of one or more gene expression matrices, a named list of one or more gene annotation matrices (for use in module evaluation), and a set of parameters to test. The first two of these are relatively straight-forward - the named lists should be passed to the DexBatch constructor method (dexBatch) using the compen and annmats arguments. Specifying a parameter set is slightly more involved - an object of class ParameterSet must be created and passed to dexBatch in the params argument.

# Create a simple DexBatch object
db = dexBatch(compen = list(compen1 = GPL200.50.mini),
              annmats = list(annmat1 = GPL200.GO.mini),
              params = parameterSet(n.comp = c(10,20)))
countJobs(db)

The code above creates a DexBatch that has two jobs. In both jobs, modules will be extracted from the GPL200.50.mini compendium and evaluated with the GPL200.GO.mini annotations. In job 1, 10 modules will be extracted, while in job 2, 20 modules will be extracted. To execute the jobs, use the runJob method.

# Run DexBatch jobs
for(i in 1:countJobs(db)) {
  runJob(db, i)
}

The ParameterSet class

A ParameterSet object defines the range of parameters to test in a DexBatch. The ParameterSet constructor method is parameterSet; during object creation any missing arguments are given their default values. The default argument values match those of the fastICA::fastICA function with n.comp = 1.

While all of the parameters in a ParameterSet may affect the quality of the resulting module predictions (save for verbose, which controls only how many status messages are displayed during execution), of particular note is the n.comp parameter, which controls how many independent components are extracted from the data matrix. Since the number of true latent signals (gene modules) in the data are not usually known, testing a range of values for n.comp is often advisable.

The following code creates a ParameterSet that tests a range of n.comp values. The getParamSetCount method returns the size of the ParameterSet.

# Test a range of extracted components
n.comp.range = seq(from = 5, to = 100, by = 5)
p = parameterSet(n.comp = n.comp.range)
getParamSetCount(p)

If we also wished to test the effect of centering and/or scaling the columns of the data matrix, we would add those to the parameterSet call. N.b., adding two Boolean options to the ParameterSet increases its size by a factor of four.

# Test a range of extracted components and two preprocessing options
n.comp.range = seq(from = 5, to = 100, by = 5)
p = parameterSet(n.comp = n.comp.range, center.cols = c(T,F), scale.cols = c(T,F))
getParamSetCount(p)

Replicate runs

Another special parameter that may be of particular interest to many users is w.init. This parameter specifies a number (called a "seed value") that will be used to reset the random number generator prior to a fastICA run (n.b., this is slightly different from its definition in the fastICA package.)

If w.init is not specified when a ParameterSet is created, it is given a random integer value (this value is recorded in the results generated by a DexBatch). If the user supplies a single value to w.init, that value will be used as the random seed for all runs (this is useful for replication of results.) If the user supplies multiple values to w.init, however, all of the other parameter ranges will be tested with each of the supplied values. This allows the user to test a particular combination of parameters from multiple different starting points; since fastICA is a stochastic algorithm, this can give users a better idea of whether a particular combination of parameters is truly better than another, and not merely preforming better due to random chance.

# Test two preprocessing options 5 times each
my.seeds = c(1:5)
p = parameterSet(n.comp = 50, center.cols = c(T,F), scale.cols = c(T,F), w.init = my.seeds)
getParamSetCount(p)

Parallel execution

If a DexBatch contains a large number of jobs, serial execution of those jobs may not be practical. For this reason, the DexBatch class was designed to handle parallel job execution. An optional argument in dexBatch is output. Supplying a file name to this argument results in job output being written to that file (if output is NULL, results are written to stdout.) By executing DexBatch jobs in parallel and writing all results to the same file, evaluating the effect of different parameter options on module quality becomes much easier.

The code below shows an example of parallel execution on a typical Linux cluster. The first chunk of code creates and saves a DexBatch object. The second, which is meant to be executed in a parallel environment, loads the object and then runs one of its jobs. Not shown is a shell script that would direct the distribution and exectution of the second code chunk to various nodes in the cluster (shell script command syntax varies by cluster management software, consult your cluster's user guide for more information).

# Create and save a DexBatch
my.seeds = c(1:3)
n.comp.range = seq(from = 5, to = 100, by = 5)
p = parameterSet(n.comp = n.comp.range, center.cols = c(T,F), scale.cols = c(T,F), w.init = my.seeds)
db = dexBatch(compen = list(compen1 = GPL200.50.mini),
              annmats = list(annmat1 = GPL200.GO.mini),
              params = p)
countJobs(db) # 240 jobs

save(db, file = "db.Rdata")
# Call this script in parallel using cluster management software
load(file = "db.Rdata") # Load DexBatch object
j = getBatchJobID() # Get ID of current job
runJob(db, j)

Interpreting results

The output file generated by running the jobs defined in a DexBatch can be used as input to subsequent analyses, such as ANOVA, to determine the optimal parameter set. A typical workflow might be to first use ANOVA to determine all optimal parameters other than n.comp, then to plot the results for that parameter set over the range of n.comp tested, using a loess regression line to determine the optimal n.comp.

Outlook

Additional features are planned for subsequent versions of this package; these include:

Citing DEXICA

Insert paper citation here.

Reporting problems or bugs

To report a bug in the DEXICA package, please contact the current package maintainer, who can be found using this function:

maintainer("DEXICA")

Session information

sessionInfo()


MPCary/DEXICA documentation built on May 4, 2019, 2:35 p.m.