clusterOneR: An R function for calling ClusterONE command line

Description Usage Arguments Details Value Examples

Description

ClusterONE strives to discover densely connected and possibly overlapping regions within the Cytoscape network you are working with. The interpretation of these regions depends on the context (i.e. what the network represents) and it is left up to you. For instance, in protein-protein interaction networks derived from high-throughput AP-MS experiments, these dense regions usually correspond to protein complexes or fractions of them. ClusterONE works by "growing" dense regions out of small seeds (typically one or two vertices), driven by a quality function called cohesiveness.

Usage

1
2
3
4
5
6
7
clusterOneR(inputFile = paste0(system.file("extdata", package =
  "ClusterOneR"), "/Weighted_edge_lists.tsv"),
  inputFormat = c("edge_list", "sif"), outputFormat = c("plain", "csv",
  "genepro"), minDensity = "auto", minSize = 3, fluff = NULL,
  haircut = NULL, maxOverlap = 0.8, mergeMethod = c("single",
  "multi"), similarity = "match", noFluff = TRUE, noMerge = FALSE,
  penalty = 2, seedMethod = NULL)

Arguments

inputFile

the network edge file name. The each column of this file is seperated by a tab. And the elements in the first row of this file is considered column names.

inputFormat

specifies the format of the input file ("sif" or "edge_list"). Use this option only if ClusterONE failed to detect the format automatically.

outputFormat

specifies the format of the output file ("plain", "csv" or "genepro").

minDensity

sets the minimum density of predicted complexes. "auto" means that the density threshold will be set automatically based on whether the graph is weighted or not, and if not, what its clustering coefficient is. Weighted graphs will have a default density threshold of 0.3, unweighted graphs will have a density threshold of 0.5, unless their global clustering coefficient is less than 0.1, in which case the density threshold is set to 0.6.

minSize

sets the minimum size of the predicted complexes.

fluff

fluffs the clusters as a post-processing step. This is not used in the published algorithm, but it may be useful for your specific problem. The idea is to check whether the external boundary nodes of each cluster connect to more than two third of the internal nodes; if so, such external boundary nodes are added to the cluster. Fluffing is applied before the size and density filters.

haircut

apply a haircut transformation as a post-processing step on the detected clusters. This is not used in the published algorithm either, but it may be useful for your specific problem. A haircut transformation removes dangling nodes from a cluster: if the total weight of connections from a node to the rest of the cluster is less than x times the average node weight in the cluster (where x is the argument of the switch), the node will be removed. The process is repeated iteratively until there are no more nodes to be removed. Haircut is applied before the size and density filters.

maxOverlap

specifies the maximum allowed overlap between two clusters, as measured by the match coefficient, which takes the size of the overlap squared, divided by the product of the sizes of the two clusters being considered, as in the paper of Bader and Hogue.

mergeMethod

specifies the method to be used to merge highly overlapping complexes. The following values are accepted:

  • "single" calculates similarity scores between all pairs of complexes and creates a graph where the nodes are the complexes and two nodes are connected if the corresponding complexes are highly overlapping. Complexes in the same connected component of the graph will then be merged. As its name suggests, this is a single-pass method.

  • "multi" calculates similarity scores between all pairs of complexes and stores those pairs that have a score larger than a given threshold. The highest scoring pair is then merged and the similarity of the merged complex towards its neighbors is re-calculated. This is repeated until there are no more highly overlapping complexes in the result. As its name suggests, this is a multi-pass method where similarities are re-calculated after each merge.

similarity

specifies the similarity function to be used in the merging step. More precisely, this switch controls which scoring function is used to decide whether two complexes overlap significantly or not. The following values are accepted:

  • "match" calculates the intersection size squared, divided by the product of the sizes of the two complexes. This is also called the **matching score**. This is the default.

  • "simpson" or meet/min calculates the Simpson coefficient, i.e. the intersection size over the size of the smaller complex.

  • "jaccard" calculates the Jaccard similarity, i.e. the intersection size over the size of the union of the two complexes.

  • "dice" calculates the Dice similarity, i.e. twice the intersection size over the sum of the sizes of the two complexes.

noFluff

don't fluff the clusters, this is the default. For more details about fluffing, see the –fluff switch above.

noMerge

don't merge highly overlapping clusters (in other words, skip the last merging phase). This is useful for debugging purposes only.

penalty

sets a penalty value for the inclusion of each node. When you set this option to x, ClusterONE will assume that each node has an extra boundary weight of x when it considers the addition of the node to a cluster. It can be used to model the possibility of uncharted connections for each node, so nodes with only a single weak connection to a cluster will not be added to the cluster as the penalty value will outweigh the benefits of adding the node. The default penalty value is 2.

seedMethod

specifies the seed generation method to use. The following values are accepted:

  • "nodes": every node will be used as a seed.

  • "unused_nodes": nodes will be tried in the descending order of their weights (where the weight of a node is the sum of the weights on its incident edges), and whenever a cluster is found, the nodes in that cluster will be excluded from the list of potential seeds. In other words, the node with the largest weight that does not participate in any of the clusters found so far will be selected as the next seed.

  • "edges": every edge will be considered once, each yielding a seed consisting of the two endpoints of the edge.

  • "cliques": every maximal clique of the graph will be considered once as a seed.

  • "file"(*filename*): seeds will be generated from the given file. Each line of the file must contain a space-separated list of node IDs that will be part of the seed (and of course each line encodes a single seed). If a line contains a single * character only, this means that besides the seeds given in the file, every node that is not part of any of the seeds will also be considered as a potential seed on its own.

  • "'single(*node1*,*node2*,...)'": a single seed will be used with the given nodes as members. Node names must be separated by commas or spaces.

  • "stdin": seeds will be given on the standard input, one by line. Each line must contain a space-separated list of node IDs that will be part of the seed. It may be useful to use this method in conjunction with –no-merge if you don't want the result of earlier seedings to influence the result of later ones.

Details

The following input file formats are recognised:

The following output file formats are available:

Value

A matrix of complex, where each row represents the proteins in a single complex.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
## Not run: 
# Run on an example network edges in the package
file = paste0(system.file('extdata', package = 'ClusterOneR'),
              "/Weighted_edge_lists.tsv")
head(file)
y = clusterOneR(file)
View(y)

# Run on your own file "/my/path/myEdgeFile.tsv", which is a
"weighted edge lists" file type.
file = "/my/path/myEdgeFile.tsv"
y = clusterOneR(file, inputFormat = "edge_list")
View(y)

# Run on a SIF file (Standard Interaction Format)
file = "/my/path/myEdgeFile.tsv"
y = clusterOneR(file, inputFormat = "edge_list")
View(y)

## End(Not run)
}

paodan/ClusterOneR documentation built on May 9, 2019, 5:56 a.m.