View source: R/blockwiseModulesC.R
blockwiseConsensusModules  R Documentation 
Perform network construction and consensus module detection across several datasets.
blockwiseConsensusModules( multiExpr, # Data checking options checkMissingData = TRUE, # Blocking options blocks = NULL, maxBlockSize = 5000, blockSizePenaltyPower = 5, nPreclusteringCenters = NULL, randomSeed = 54321, # TOM precalculation arguments, if available individualTOMInfo = NULL, useIndivTOMSubset = NULL, # Network construction arguments: correlation options corType = "pearson", maxPOutliers = 1, quickCor = 0, pearsonFallback = "individual", cosineCorrelation = FALSE, # Adjacency function options power = 6, networkType = "unsigned", checkPower = TRUE, replaceMissingAdjacencies = FALSE, # Topological overlap options TOMType = "unsigned", TOMDenom = "min", suppressNegativeTOM = FALSE, # Save individual TOMs? saveIndividualTOMs = TRUE, individualTOMFileNames = "individualTOMSet%sBlock%b.RData", # Consensus calculation options: network calibration networkCalibration = c("single quantile", "full quantile", "none"), # Simple quantile calibration options calibrationQuantile = 0.95, sampleForCalibration = TRUE, sampleForCalibrationFactor = 1000, getNetworkCalibrationSamples = FALSE, # Consensus definition consensusQuantile = 0, useMean = FALSE, setWeights = NULL, # Saving the consensus TOM saveConsensusTOMs = FALSE, consensusTOMFilePattern = "consensusTOMblock.%b.RData", # Internal handling of TOMs useDiskCache = TRUE, chunkSize = NULL, cacheBase = ".blockConsModsCache", cacheDir = ".", # Alternative consensus TOM input from a previous calculation consensusTOMInfo = NULL, # Basic tree cut options # Basic tree cut options deepSplit = 2, detectCutHeight = 0.995, minModuleSize = 20, checkMinModuleSize = TRUE, # Advanced tree cut opyions maxCoreScatter = NULL, minGap = NULL, maxAbsCoreScatter = NULL, minAbsGap = NULL, minSplitHeight = NULL, minAbsSplitHeight = NULL, useBranchEigennodeDissim = FALSE, minBranchEigennodeDissim = mergeCutHeight, stabilityLabels = NULL, minStabilityDissim = NULL, pamStage = TRUE, pamRespectsDendro = TRUE, # Gene reassignment and trimming from a module, and module "significance" criteria reassignThresholdPS = 1e4, trimmingConsensusQuantile = consensusQuantile, minCoreKME = 0.5, minCoreKMESize = minModuleSize/3, minKMEtoStay = 0.2, # Module eigengene calculation options impute = TRUE, trapErrors = FALSE, #Module merging options equalizeQuantilesForModuleMerging = FALSE, quantileSummaryForModuleMerging = "mean", mergeCutHeight = 0.15, mergeConsensusQuantile = consensusQuantile, # Output options numericLabels = FALSE, # General options nThreads = 0, verbose = 2, indent = 0, ...)
multiExpr 
expression data in the multiset format (see 
checkMissingData 
logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details. 
blocks 
optional specification of blocks in which hierarchical clustering and module detection
should be performed. If given, must be a numeric vector with one entry per gene
of 
maxBlockSize 
integer giving maximum block size for module detection. Ignored if 
blockSizePenaltyPower 
number specifying how strongly blocks should be penalized for exceeding the
maximum size. Set to a lrge number or 
nPreclusteringCenters 
number of centers to be used in the preclustering. Defaults to smaller of

randomSeed 
integer to be used as seed for the random number generator before the function
starts. If a current seed exists, it is saved and restored upon exit. If 
individualTOMInfo 
Optional data for TOM matrices in individual data sets. This object is returned by
the function 
useIndivTOMSubset 
If 
corType 
character string specifying the correlation to be used. Allowed values are (unique
abbreviations of) 
maxPOutliers 
only used for 
quickCor 
real number between 0 and 1 that controls the handling of missing data in the calculation of correlations. See details. 
pearsonFallback 
Specifies whether the bicor calculation, if used, should revert to Pearson when
median absolute deviation (mad) is zero. Recongnized values are (abbreviations of)

cosineCorrelation 
logical: should the cosine version of the correlation calculation be used? The cosine calculation differs from the standard one in that it does not subtract the mean. 
power 
softthresholding power for network construction. Either a single number or a vector of the same length as the number of sets, with one power for each set. 
networkType 
network type. Allowed values are (unique abbreviations of) 
checkPower 
logical: should basic sanity check be performed on the supplied 
replaceMissingAdjacencies 
logical: should missing values in the calculation of adjacency be replaced by 0? 
TOMType 
one of 
TOMDenom 
a character string specifying the TOM variant to be used. Recognized values are

suppressNegativeTOM 
Logical: should the result be set to zero when negative? Negative TOM values can occur when

saveIndividualTOMs 
logical: should individual TOMs be saved to disk for later use? 
individualTOMFileNames 
character string giving the file names to save individual TOMs into. The
following tags should be used to make the file names unique for each set and block: 
networkCalibration 
network calibration method. One of "single quantile", "full quantile", "none" (or a unique abbreviation of one of them). 
calibrationQuantile 
if 
sampleForCalibration 
if 
sampleForCalibrationFactor 
determines the number of samples for calibration: the number is

getNetworkCalibrationSamples 
logical: should samples used for TOM calibration be saved for future analysis?
This option is only available when 
consensusQuantile 
quantile at which consensus is to be defined. See details. 
useMean 
logical: should the consensus be determined from a (possibly weighted) mean across the data sets rather than a quantile? 
setWeights 
Optional vector (one component per input set) of weights to be used for weighted mean
consensus. Only used when 
saveConsensusTOMs 
logical: should the consensus topological overlap matrices for each block be saved and returned? 
consensusTOMFilePattern 
character string containing the file namefiles containing the
consensus topological overlaps. The tag 
useDiskCache 
should calculated network similarities in individual sets be temporarilly saved to disk? Saving to disk is somewhat slower than keeping all data in memory, but for large blocks and/or many sets the memory footprint may be too big. 
chunkSize 
network similarities are saved in smaller chunks of size 
cacheBase 
character string containing the desired name for the cache files. The actual file
names will consists of 
cacheDir 
character string containing the desired path for the cache files. 
consensusTOMInfo 
optional list summarizing consensus TOM, output of 
deepSplit 
integer value between 0 and 4. Provides a simplified control over how sensitive
module detection should be to module splitting, with 0 least and 4 most sensitive. See

detectCutHeight 
dendrogram cut height for module detection. See

minModuleSize 
minimum module size for module detection. See

checkMinModuleSize 
logical: should sanity checks be performed on 
maxCoreScatter 
maximum scatter of the core for a branch to be a cluster, given as the fraction
of 
minGap 
minimum cluster gap given as the fraction of the difference between 
maxAbsCoreScatter 
maximum scatter of the core for a branch to be a cluster given as absolute
heights. If given, overrides 
minAbsGap 
minimum cluster gap given as absolute height difference. If given, overrides

minSplitHeight 
Minimum split height given as the fraction of the difference between

minAbsSplitHeight 
Minimum split height given as an absolute height.
Branches merging below this height will automatically be merged. If not given (default), will be determined
from 
useBranchEigennodeDissim 
Logical: should branch eigennode (eigengene) dissimilarity be considered when merging branches in Dynamic Tree Cut? 
minBranchEigennodeDissim 
Minimum consensus branch eigennode (eigengene) dissimilarity for
branches to be considerd separate. The branch eigennode dissimilarity in individual sets
is simly 1correlation of the
eigennodes; the consensus is defined as quantile with probability 
stabilityLabels 
Optional matrix of cluster labels that are to be used for calculating branch
dissimilarity based on split stability. The number of rows must equal the number of genes in

minStabilityDissim 
Minimum stability dissimilarity criterion for two branches to be considered
separate. Should be a number between 0 (essentially no dissimilarity required) and 1 (perfect dissimilarity
or distinguishability based on 
pamStage 
logical. If TRUE, the second (PAMlike) stage of module detection will be performed.
See 
pamRespectsDendro 
Logical, only used when 
reassignThresholdPS 
perset pvalue ratio threshold for reassigning genes between modules. See Details. 
trimmingConsensusQuantile 
a number between 0 and 1 specifying the consensus quantile used for kME calculation that determines module trimming according to the arguments below. 
minCoreKME 
a number between 0 and 1. If a detected module does not have at least

minCoreKMESize 
see 
minKMEtoStay 
genes whose eigengene connectivity to their module eigengene is lower than

impute 
logical: should imputation be used for module eigengene calculation? See

trapErrors 
logical: should errors in calculations be trapped? 
equalizeQuantilesForModuleMerging 
Logical: equalize quantiles of the module eigengene networks
before module merging? If 
quantileSummaryForModuleMerging 
One of 
mergeCutHeight 
dendrogram cut height for module merging. 
mergeConsensusQuantile 
consensus quantile for module merging. See 
numericLabels 
logical: should the returned modules be labeled by colors ( 
nThreads 
nonnegative integer specifying the number of parallel threads to be used by certain parts of correlation calculations. This option only has an effect on systems on which a POSIX thread library is available (which currently includes Linux and Mac OSX, but excludes Windows). If zero, the number of online processors will be used if it can be determined dynamically, otherwise correlation calculations will use 2 threads. 
verbose 
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose. 
indent 
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces. 
... 
Other arguments. At present these can include 
The function starts by optionally filtering out samples that have too many missing entries and genes
that have either too many missing entries or zero variance in at least one set. Genes that are filtered
out are left unassigned by the module detection. Returned eigengenes will contain NA
in entries
corresponding to filteredout samples.
If blocks
is not given and
the number of genes exceeds maxBlockSize
, genes are preclustered into blocks using the function
consensusProjectiveKMeans
; otherwise all genes are treated in a single block.
For each block of genes, the network is constructed and (if requested) topological overlap is calculated in each set. To minimize memory usage, calculated topological overlaps are optionally saved to disk in chunks until they are needed again for the calculation of the consensus network topological overlap.
Before calculation of the consensus Topological Overlap, individual TOMs are optionally calibrated. Calibration methods include single quantile scaling and full quantile normalization.
Single quantile
scaling raises individual TOM in sets 2,3,... to a power such that the quantiles given by
calibrationQuantile
agree with the quantile in set 1. Since the high TOMs are usually the most important
for module identification, the value of calibrationQuantile
is close to (but not equal) 1. To speed up
quantile calculation, the quantiles can be determined on a randomlychosen component subset of the TOM matrices.
Full quantile normalization, implemented in normalize.quantiles
, adjusts the
TOM matrices such that all quantiles equal each other (and equal to the quantiles of the componentwise
average of the individual TOM matrices).
Note that network calibration is performed separately in each block, i.e., the normalizing transformation may differ between blocks. This is necessary to avoid manipulating a full TOM in memory.
The consensus TOM is calculated as the componentwise consensusQuantile
quantile of the individual
(set) TOMs; that is, for each gene pair (TOM entry), the consensusQuantile
quantile across all input
sets. Alternatively, one can also use (weighted) componentwise mean across all imput data sets.
If requested, the consensus topological overlaps are saved to disk for later use.
Genes are then clustered using average linkage hierarchical clustering and modules are identified in the
resulting dendrogram by the Dynamic Hybrid tree cut. Found modules are trimmed of genes whose
consensus module membership kME (that is, correlation with module eigengene)
is less than minKMEtoStay
.
Modules in which
fewer than minCoreKMESize
genes have consensus KME higher than minCoreKME
are disbanded, i.e., their constituent genes are pronounced
unassigned.
After all blocks have been processed, the function checks whether there are genes whose KME in the module
they assigned is lower than KME to another module. If pvalues of the higher correlations are smaller
than those of the native module by the factor reassignThresholdPS
(in every set),
the gene is reassigned to the closer module.
In the last step, modules whose eigengenes are highly correlated are merged. This is achieved by
clustering module eigengenes using the dissimilarity given by one minus their correlation,
cutting the dendrogram at the height mergeCutHeight
and merging all modules on each branch. The
process is iterated until no modules are merged. See mergeCloseModules
for more details on
module merging.
The argument quick
specifies the precision of handling of missing data in the correlation
calculations. Zero will cause all
calculations to be executed precisely, which may be significantly slower than calculations without
missing data. Progressively higher values will speed up the
calculations but introduce progressively larger errors. Without missing data, all column means and
variances can be precalculated before the covariances are calculated. When missing data are present,
exact calculations require the column means and variances to be calculated for each covariance. The
approximate calculation uses the precalculated mean and variance and simply ignores missing data in the
covariance calculation. If the number of missing data is high, the precalculated means and variances may
be very different from the actual ones, thus potentially introducing large errors.
The quick
value times the
number of rows specifies the maximum difference in the
number of missing entries for mean and variance calculations on the one hand and covariance on the other
hand that will be tolerated before a recalculation is triggered. The hope is that if only a few missing
data are treated approximately, the error introduced will be small but the potential speedup can be
significant.
A list with the following components:
colors 
module assignment of all input genes. A vector containing either character strings with
module colors (if input 
unmergedColors 
module colors or numeric labels before the module merging step. 
multiMEs 
module eigengenes corresponding to the modules returned in 
goodSamples 
a list, with one component per input set. Each component is a logical vector with one entry per sample from the corresponding set. The entry indicates whether the sample in the set passed basic quality control criteria. 
goodGenes 
a logical vector with one entry per input gene indicating whether the gene passed basic quality control criteria in all sets. 
dendrograms 
a list with one component for each block of genes. Each component is the hierarchical clustering dendrogram obtained by clustering the consensus gene dissimilarity in the corresponding block. 
TOMFiles 
if 
blockGenes 
a list with one component for each block of genes. Each component is a vector giving
the indices (relative to the input 
blocks 
if input 
blockOrder 
a vector giving the order in which blocks were processed and in which

originCount 
A vector of length 
networkCalibrationSamples 
if the input 
If the input datasets have large numbers of genes, consider carefully the maxBlockSize
as it
significantly affects the memory footprint (and whether the function will fail with a memory allocation
error). From a theoretical point of view it is advantageous to use blocks as large as possible; on the
other hand, using smaller blocks is substantially faster and often the only way to work with large
numbers of genes. As a rough guide, it is unlikely a standard desktop
computer with 4GB memory or less will be able to work with blocks larger than 7000 genes.
Peter Langfelder
Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between coexpression modules. BMC Systems Biology 2007, 1:54
goodSamplesGenesMS
for basic quality control and filtering;
adjacency
, TOMsimilarity
for network construction;
hclust
for hierarchical clustering;
cutreeDynamic
for adaptive branch cutting in hierarchical clustering
dendrograms;
mergeCloseModules
for merging of close modules.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.