Gene modules are sets of genes that are (or appear to be) co-regulated. The DEXICA package provides functions to infer gene modules from compendia of gene expression data (typically microarray data) using independent component analysis (ICA). The term DEXICA is an acronym for Deep EXtraction ICA, so named because it is designed to extract a large number of potentially weak signals from the input data.
Because data artifacts and/or noise can exert an out-sized influence on weak signal recovery, the DEXICA process seeks to optimize data preprocessing procedures and parameters of the ICA algorithm using biological criteria. These criteria take a variety of forms and are generally organism-dependent, but some examples include gene function annotations, such as GO terms or REACTOME pathways, gene expression information (other than that in the compendium to be analyzed), such as tissue expression patterns, and regulatory sequence annotations. The package DEXDATA.Celegans provides several such files for the model organism C. elegans; the DEXICA package itself contains several data files derived from the C. elegans package for use in examples.
Note: The examples in this vignette use truncated versions of several DEXDATA.Celegans data files (for example, a microarray compendium comprising 50 arrays rather than 1386) so that the examples will execute quickly on most machines.
The DEXICA package (and supporting packages, such as DEXDATA.Celegans) can be installed from the MPCary code repository on GitHub.
install.packages("devtools") # Skip this line if you aleady have devtools installed library(devtools) install_github("MPCary/DEXICA", build_vignettes = TRUE) install_github("MPCary/DEXDATA.Celegans") # C. elegans data package # Note: It is advisable to restart your R session after running install_github # due to a known issue in which documentation files are incorrectly reported # to be corrupt.
The two main purposes of the DEXICA package are to: a) predict gene modules, and b) to evaluate the quality of those predictions using biological criteria.
Module prediction is carried out by the predictModules
function. It takes as input a (suitably preprocessed) compendium of expression data, and returns a list containing two matrices named S
and A
. If the input matrix comprises probesets in rows and microarrays in columns, the S
matrix (referred to as the "source matrix" in ICA parlance) provides a weight for each probeset in each gene module (or independent component), and the A
matrix (or "mixing matrix") provides a weight for each module in each microarray.
library(DEXICA) # Load and preprocess microarray compendium GPL200.50.pp = preprocessMatrix(GPL200.50.mini) # Extract 10 modules from compendium set.seed(1) # Set random seed for reproducibility result = predictModules(GPL200.50.pp, n.comp = 10) dim(result$S) # Dimensions of S matrix result$S[1:3,1:3] # First few rows and columns of S matrix
The S matrix can be thought of assigning a weight for each probeset in each module. While this matrix is useful for many purposes, it is sometimes desirable to derive discrete sets of genes from it for each module. This process is referred to as partitioning, and several approaches to partitioning have been described in the literature.
Perhaps the most common approach is to apply a fixed threshold procedure. In ICA, the S matrix is standardized (such that each column has zero mean and unit variance), and the fixed threshold partitioning method assigns probesets with extreme values (e.g. > 3 or < -3) as being "in" the module. N.b., because modules comprise both positive and negative values, partitioning gives rise to two disjoint sets of genes per S matrix column. We refer to the two opposite (in sign) halves of a module as "hemi-modules".
An alternative approach to partitioning, called variable threshold partitioning, is to apply a function that attempts to provide an optimal partitioning for each S matrix column independently based on some property (or properties) of the numbers comprising the column. The function partion
allows for both fixed and variable threshold partitioning; the variable partitioning method uses an artificial neural network (ANN) that has been trained on simulated data to predict partition thresholds based on the 3rd and 4th moments of each column's distribution.
# Partition S matrix S.fp = partition(result$S, method = "fixed", t = 3) # Fixed partitioning S.vp = partition(result$S, method = "ann") # Variable partitioning # Separate positive sets from negative sets S.fp.split = splitMatrix(S.fp) S.vp.split = splitMatrix(S.vp) # Count the number of probesets in each set fp.counts = apply(S.fp.split, 2, sum) vp.counts = apply(S.vp.split, 2, sum) # Plot counts count.matrix = rbind(fp.counts, vp.counts) count.matrix = count.matrix[, order(apply(count.matrix, 2, sum), decreasing = TRUE)] barplot(count.matrix, beside = TRUE, xlab = "Hemi-modules", ylab = "Probesets", names.arg = rep("", ncol(count.matrix)), col = 1:2) legend("topright", legend = c("Fixed", "ANN"), fill = 1:2, cex = 0.75)
To check whether a set of predicted gene modules make sense, and to compare the relative quality of different predictions, one can compare module predictions with gene annotations that are believed or suspected to align (even loosely) with gene modularity. For example, there is a tendency of the genes involved in some (but certainly not all) biological processes to be co-regulated; a comprehensive set of biological process annotations may therefore be used to gauge module quality. It is not necessary to know a priori which processes are actually co-regulated - the non-co-regulated processes will simply be uninformative for our purpose and should neither falsely inflate nor deflate our confidence in a set of predicted modules provided we correct for multiple hypothesis testing.
The function checkModules
compares a (partitioned and split, as above) set of gene modules with a set of gene (or probeset) annotations. It calculates the enrichment of each annotation in each module using hypergeometric statistics, then adjusts all p-values using the FDR correction. It returns a list containing, among other items, the number of annotations found to be significant in at least one module (anns.signif
).
# Evaluate module predictions using GO annotations fp.check.GO = checkModules(S.fp.split, GPL200.GO.mini) vp.check.GO = checkModules(S.vp.split, GPL200.GO.mini) data.frame(rbind(fp.check.GO, vp.check.GO))
The checkModules
function also returns the number of modules with at least one significant annotation (mods.signif
), but this value should generally not be used to gauge the quality of a set of module predictions. In general, the true number of latent gene modules in a data set is unknown, and predicting additional significant modules may merely indicate that a once highly accurate module prediction becomes split into somewhat less accurate subsets when additional modules are extracted. Because the accuracy of ICA at predicting latent signal sources is maximal when the number of extracted components matches the number of latent signals, the total number of significant annotations across all modules should generally reach a maximum value when the true number of sources (n.comp
) is extracted.
The main idea behind DEXICA is to optimize gene module prediction through thorough sampling of the input parameter space; a DEXICA analysis may require hundreds or even thousands of separate ICA runs. The DexBatch class was created to make this process easier.
A DexBatch object can be thought of as a foreman that directs the execution of (potentially) a large number of jobs. Each job consists of three main tasks:
When it is created, the DexBatch object must be supplied with a named list of one or more gene expression matrices, a named list of one or more gene annotation matrices (for use in module evaluation), and a set of parameters to test. The first two of these are relatively straight-forward - the named lists should be passed to the DexBatch constructor method (dexBatch
) using the compen
and annmats
arguments. Specifying a parameter set is slightly more involved - an object of class ParameterSet must be created and passed to dexBatch
in the params
argument.
# Create a simple DexBatch object db = dexBatch(compen = list(compen1 = GPL200.50.mini), annmats = list(annmat1 = GPL200.GO.mini), params = parameterSet(n.comp = c(10,20))) countJobs(db)
The code above creates a DexBatch that has two jobs. In both jobs, modules will be extracted from the GPL200.50.mini
compendium and evaluated with the GPL200.GO.mini
annotations. In job 1, 10 modules will be extracted, while in job 2, 20 modules will be extracted. To execute the jobs, use the runJob
method.
# Run DexBatch jobs for(i in 1:countJobs(db)) { runJob(db, i) }
A ParameterSet object defines the range of parameters to test in a DexBatch. The ParameterSet constructor method is parameterSet
; during object creation any missing arguments are given their default values. The default argument values match those of the fastICA::fastICA
function with n.comp = 1
.
While all of the parameters in a ParameterSet may affect the quality of the resulting module predictions (save for verbose
, which controls only how many status messages are displayed during execution), of particular note is the n.comp
parameter, which controls how many independent components are extracted from the data matrix. Since the number of true latent signals (gene modules) in the data are not usually known, testing a range of values for n.comp
is often advisable.
The following code creates a ParameterSet that tests a range of n.comp
values. The getParamSetCount
method returns the size of the ParameterSet.
# Test a range of extracted components n.comp.range = seq(from = 5, to = 100, by = 5) p = parameterSet(n.comp = n.comp.range) getParamSetCount(p)
If we also wished to test the effect of centering and/or scaling the columns of the data matrix, we would add those to the parameterSet
call. N.b., adding two Boolean options to the ParameterSet increases its size by a factor of four.
# Test a range of extracted components and two preprocessing options n.comp.range = seq(from = 5, to = 100, by = 5) p = parameterSet(n.comp = n.comp.range, center.cols = c(T,F), scale.cols = c(T,F)) getParamSetCount(p)
Another special parameter that may be of particular interest to many users is w.init
. This parameter specifies a number (called a "seed value") that will be used to reset the random number generator prior to a fastICA run (n.b., this is slightly different from its definition in the fastICA
package.)
If w.init
is not specified when a ParameterSet is created, it is given a random integer value (this value is recorded in the results generated by a DexBatch). If the user supplies a single value to w.init
, that value will be used as the random seed for all runs (this is useful for replication of results.) If the user supplies multiple values to w.init
, however, all of the other parameter ranges will be tested with each of the supplied values. This allows the user to test a particular combination of parameters from multiple different starting points; since fastICA is a stochastic algorithm, this can give users a better idea of whether a particular combination of parameters is truly better than another, and not merely preforming better due to random chance.
# Test two preprocessing options 5 times each my.seeds = c(1:5) p = parameterSet(n.comp = 50, center.cols = c(T,F), scale.cols = c(T,F), w.init = my.seeds) getParamSetCount(p)
If a DexBatch contains a large number of jobs, serial execution of those jobs may not be practical. For this reason, the DexBatch class was designed to handle parallel job execution. An optional argument in dexBatch
is output
. Supplying a file name to this argument results in job output being written to that file (if output
is NULL
, results are written to stdout
.) By executing DexBatch jobs in parallel and writing all results to the same file, evaluating the effect of different parameter options on module quality becomes much easier.
The code below shows an example of parallel execution on a typical Linux cluster. The first chunk of code creates and saves a DexBatch object. The second, which is meant to be executed in a parallel environment, loads the object and then runs one of its jobs. Not shown is a shell script that would direct the distribution and exectution of the second code chunk to various nodes in the cluster (shell script command syntax varies by cluster management software, consult your cluster's user guide for more information).
# Create and save a DexBatch my.seeds = c(1:3) n.comp.range = seq(from = 5, to = 100, by = 5) p = parameterSet(n.comp = n.comp.range, center.cols = c(T,F), scale.cols = c(T,F), w.init = my.seeds) db = dexBatch(compen = list(compen1 = GPL200.50.mini), annmats = list(annmat1 = GPL200.GO.mini), params = p) countJobs(db) # 240 jobs save(db, file = "db.Rdata")
# Call this script in parallel using cluster management software load(file = "db.Rdata") # Load DexBatch object j = getBatchJobID() # Get ID of current job runJob(db, j)
The output file generated by running the jobs defined in a DexBatch can be used as input to subsequent analyses, such as ANOVA, to determine the optimal parameter set. A typical workflow might be to first use ANOVA to determine all optimal parameters other than n.comp, then to plot the results for that parameter set over the range of n.comp tested, using a loess regression line to determine the optimal n.comp.
Additional features are planned for subsequent versions of this package; these include:
KS-based evaluation of module quality - Instead of partitioning modules into discrete sets and using hypergeometric statistics to test for annotation enrichment, an option to use KS (Kolmogorov-Smirnov) tests to check module quality will be added to the checkModules
function.
Additional flavors of the ICA algorithm will be supported.
Subset testing - This feature will allow checkModules
to run on a subset of module genes. By optimizing module prediction with a subset of genes (training set), then evaluating modules with a left-out set (validation set), users can test whether predicted modules were over-fit to the training set.
Solution space mapping - This feature will help users generate a map of the ICA solution space and to map module quality indicators onto that map in order to better determine the final solution (i.e., a set of predicted modules) in a DEXICA analysis.
Insert paper citation here.
To report a bug in the DEXICA package, please contact the current package maintainer, who can be found using this function:
maintainer("DEXICA")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.