MatSAM: Correlation network construction, seriation and...

View source: R/MatSAM.R

MatSAMR Documentation

Correlation network construction, seriation and modularization from a matrix

Description

The MatSAM function first uses MatNet function to identify the correlation network and then uses NetSAM function to identify the module and optimize the one-dimensional ordering of the nodes in each module.

Usage

MatSAM(inputMat, sampleAnn=NULL, outputFileName, outputFormat="msm", organism="hsapiens", map_to_symbol=FALSE, idType="auto", collapse_mode="maxSD", naPer=0.7, meanPer=0.8, varPer=0.8, corrType="spearman", matNetMethod="rank", valueThr=0.5, rankBest=0.003, networkType="signed", netFDRMethod="BH", netFDRThr=0.05, minModule=0.003, stepIte=FALSE, maxStep=4, moduleSigMethod="cutoff", modularityThr=0.2, ZRanNum=10, PerRanNum=100, ranSig=0.05,  idNumThr=(-1), nThreads=3)

Arguments

inputMat

inputMat should contain a file name with extension "cct" or "cbt" or a matrix or data.frame object in R. The first column and first row of the "cct" or "cbt" file should be the row and column names, respectively and other parts are the numeric values. The detail information of "cct" or "cbt" format can be found in the manual of NetGestalt (www.netgestalt.org). A matrix or data.frame object should have row and column names and only contain numeric or integer values.

sampleAnn

sampleAnn should contain a file name with "tsi" extension (the detail information of "tsi" format can be found in the manual of NetGestalt (www.netgestalt.org)) or a data.frame object in R. If the data does not have sample annotation, this argument can be ignored. The first row of the data is the name of sample features. The second row is the type of each feature. The third row is the category of each feature. If there is no category information for the features, the sample information will start from the third row . The first column is the sample name.

outputFileName

Output file name. The file name extension is "msm" which can be uploaded to the NetGestalt directly.

outputFormat

The format of the output file. "msm" format can be used as an input in NetGestalt; "gmt" format can be used to do other network analysis (e.g. as an input in GSEA (Gene Set Enrichment Analysis) to do module enrichment analysis); "multiple" represents the MatSAM function will output five files: ruler file containing gene order information, hmi file containing module information, net file containing correlation network information, cct file containing the filtered data matrix, and tsi file containing the sample annotation with standardized format; and "none" represents the function will not output any file.

organism

The organism of the input data. Currently, the package supports the following nine organisms: hsapiens, mmusculus, rnorvegicus, drerio, celegans, scerevisiae, cfamiliaris, dmelanogaster and athaliana. The default is "hsapiens".

map_to_symbol

If map_to_symbol is TRUE, the function will first change the input ids to gene symbols and collapse multiple ids with the same gene symbol based on the collapse_mode method before identifying correlation network. The default is FALSE.

idType

The id type of the ids in the input matrix. MatSAM will use BiomaRt package to transform the input ids to gene symbols based on idType. The users can also set idType as "auto" that means MatSAM will automatically search the id type of the input data. However, this may take 10 minutes based on the users' internet speed. The default is "auto".

collapse_mode

The method to collapse duplicate ids. "mean", "median", "maxSD", "maxIQR", "max" and "min" represent the mean, median, max standard deviation, max interquartile range, maximum and minimum of values for ids in each sample. The default is "maxSD".

naPer

To remove ids with missing values in most of samples, the function calculates the percentage of missing values in all samples for each id and removes ids with over naPer missing values in all samples. The default naPer is 0.7.

meanPer

To remove ids with low values, the function calculates the mean of values for a id in all samples and remains top meanPer ids based on the mean. The default meanPer is 0.8.

varPer

Based on the remained ids filtered by meanPer, the function can also remove less variable ids by calculating the standard deviation of values for a id in all samples and remaining top varPer ids based on the standard deviation. The default varPer is 0.8.

corrType

A character string indicating which correlation coefficient is to be computed for each pair of ids. The function supports "spearman" (default) or "pearson" method.

matNetMethod

MatNet function supports three methods to construct correlation network: "value", "rank" and "directed". 1. "value" method: the correlation network only remains id pairs with correlations over cutoff threshold valueThr; 2. "rank" method: for each id A, the function first selects ids that significantly correlate with id A and then extracts a set of ids (the number of ids is calculated based on rankBest) that are most similar to id A from the significant set. Then, for each id B in the set, the function also extracts the same number of ids that are significant correlated and most similar to id B. If id A is in the set of id B, the edge between id A and id B will be remained. Combining all remained edges can construct a correlation network; 3. "directed" method: the function will only remain the best significant id for each id as the edge.Combining all edges can construct a directed correlation network.

valueThr

Correlation cutoff threshold for "value" method. The default is 0.5.

rankBest

The percentage of ids that are most similar to one id for "rank" method. The default is 0.003 which means the "rank" method will select top 30 most similar ids for each id if the number of ids in the matrix is 10,000.

networkType

If networkType is "unsigned", the correlation of all pairs of ids will be changed to absolute values. The default is "signed".

netFDRMethod

p value adjustment methods for "rank" and "directed" methods. The default is "BH".

netFDRThr

fdr threshold for identifying significant pairs for "rank" and "directed" methods. The default is 0.05

minModule

The minimum percentage of nodes in a module. The minimum size of a module is calculated by multiplying minModule by the number of nodes in the whole network. If the size of a module identified by the function is less than the minimum size, the module will not be further partitioned into sub-modules. The default is 0.003 which means the minimum module size is 30 if there are 10,000 nodes in the whole network. If the minimum module size is less than 5, the minimum module size will be set as 5. The minModule should be less than 0.2.

stepIte

Because NetSAM uses random walk distance-based hierarchical clustering to reveal the hierarchical organization of an input network, it requires a specified length of the random walks. If stepIte is TRUE, the function will test a range of lengths ranging from 2 to maxStep to get the optimal length. Otherwise, the function will directly use maxStep as the length of the random walks. The default maxStep is 4. Because optimizing the length of the random walks will take a long time, if the network is too big (e.g. the number of edges is over 200,000), we suggest to set stepIte as FALSE.

maxStep

The length or max length of the random walks.

moduleSigMethod

To test whether a network under consideration has a non-random internal modular organization, the function provides three options: "cutoff", "zscore" and "permutation". "cutoff" means if the modularity score of the network is above a specified cutoff value, the network will be considered to have internal organization and will be further partitioned. For "zscore" and "permutation", the function will first generate a set of random modularity scores. Based on a unweighted network, the function uses the edge switching method to generate a given number of random networks with the same number of nodes and an identical degree sequence and calculates the modularity scores for these random networks. Based on a weighted network, the function shuffles the weights of all edges and calculate the modularity scores for network with random weights. Then, "zscore" method will transform the real modularity score to a z score based on the random modularity scores and then transform the z score to a p value assuming a standard normal distribution. The "permutation" method will compare the real modularity score with the random ones to calculate a p value. Finally, under a specified significance level, the function determines whether the network can be further partitioned. The default is "cutoff".

modularityThr

Threshold of modularity score for the "cutoff" method. The default is 0.2

ZRanNum

The number of random networks that will be generated for the "zscore" calculation. The default is 10.

PerRanNum

The number of random networks that will be generated for the "permutation" p value calculation. The default is 100.

ranSig

The significance level for determining whether a network has non-random internal modular organization for the "zscore" or "permutation" methods. The default is 0.05.

idNumThr

If the matrix contains too many ids, it will take a long time and use a lot of memory to identify the modules. Thus, the function provides the option to set the threshold of number of ids for further analysis. After filtering by meanPer and varPer, if the number of ids is still larger than idNumThr, the function will select top idNumThr ids with the largest variance. The default is -1, which means there is no limitation for the matrix.

nThreads

MatSAM function supports parallel computing based on multiple cores. The default is 3.

Value

Including a "msm" file, the function will output a list object containing module information, gene order information, correlation network and filtered matrix based on the ids in the network. The function will also output two HTML files that contain the significant associations between sample features and modules and associated GO terms for the modules.

Note

After identifying the modules, the MatSAM function will identify the associations between sample features and modules using the featureAssociation function or the associated GO terms for the modules using the GOAssociation function. For the featureAssociation function, MatSAM only uses the default parameters. For the GOAssociation function, MatSAM sets "outputType" as "top" and "topNum" as 1. The users can use the list object returned by MatSAM as the input of the function featureAssociation and GOAssociation to perform some further analysis based on the different parameters.

Author(s)

Jing Wang

See Also

MatNet NetSAM

Examples

	inputMatDir <- system.file("extdata","exampleExpressionData.cct",package="NetSAM")
	cat(inputMatDir)
	sampleAnnDir <- system.file("extdata","sampleAnnotation.tsi",package="NetSAM")
	cat(sampleAnnDir)
	outputFileName <- paste(getwd(),"/MatSAM",sep="")
	matModule <- MatSAM(inputMat=inputMatDir, sampleAnn=sampleAnnDir, outputFileName=outputFileName, outputFormat="msm", organism="hsapiens", map_to_symbol=FALSE, idType="auto", collapse_mode="maxSD", naPer=0.7, meanPer=0.8, varPer=0.8, corrType="spearman", matNetMethod="rank", valueThr=0.6, rankBest=0.003, networkType="signed", netFDRMethod="BH", netFDRThr=0.05, minModule=0.003, stepIte=FALSE, maxStep=4, moduleSigMethod="cutoff", modularityThr=0.2, ZRanNum=10, PerRanNum=100, ranSig=0.05, idNumThr=(-1), nThreads=3)

bingzhang16/NetSAM documentation built on April 3, 2024, 3:35 a.m.