R/clusterOneR.R

#' @title An R function for calling ClusterONE command line
#' @description ClusterONE strives to discover densely connected and possibly
#' overlapping regions within the Cytoscape network you are working with.
#' The interpretation of these regions depends on the context (i.e. what the
#' network represents) and it is left up to you. For instance,
#' in protein-protein interaction networks derived from high-throughput
#' AP-MS experiments, these dense regions usually correspond to protein
#' complexes or fractions of them. ClusterONE works by "growing" dense regions
#' out of small seeds (typically one or two vertices), driven by a quality
#' function called cohesiveness.
#' @param inputFile the network edge file name. The each column of this file
#' is seperated by a tab. And the elements in the first row of this file is
#' considered column names.
#' @param inputFormat specifies the format of the input file ("sif" or
#' "edge_list"). Use this option only if ClusterONE failed to detect the
#' format automatically.
#' @param outputFormat specifies the format of the output file ("plain",
#' "csv" or "genepro").
#' @param minDensity sets the minimum density of predicted complexes.
#' "auto" means that the density threshold will be set automatically
#' based on whether the graph is weighted or not, and if not, what its
#' clustering coefficient is. Weighted graphs will have a default density
#' threshold of 0.3, unweighted graphs will have a density threshold of 0.5,
#' unless their global clustering coefficient is less than 0.1, in which
#' case the density threshold is set to 0.6.
#' @param minSize sets the minimum size of the predicted complexes.
#' @param fluff fluffs the clusters as a post-processing step.
#' This is not used in the published algorithm, but it may be useful
#' for your specific problem. The idea is to check whether the external
#' boundary nodes of each cluster connect to more than two third of the
#' internal nodes; if so, such external boundary nodes are added to the
#' cluster. Fluffing is applied before the size and density filters.
#' @param haircut apply a haircut transformation as a post-processing
#' step on the detected clusters. This is not used in the published
#' algorithm either, but it may be useful for your specific problem.
#' A haircut transformation removes dangling nodes from a cluster:
#' if the total weight of connections from a node to the rest of the
#' cluster is less than x times the average node weight in the cluster
#' (where x is the argument of the switch), the node will be removed.
#' The process is repeated iteratively until there are no more nodes to
#' be removed. Haircut is applied before the size and density filters.
#' @param maxOverlap specifies the maximum allowed overlap between two
#' clusters, as measured by the match coefficient, which takes the size
#' of the overlap squared, divided by the product of the sizes of the
#' two clusters being considered, as in the paper of Bader and Hogue.
#' @param mergeMethod specifies the method to be used to merge highly
#' overlapping complexes. The following values are accepted: \cr
#' \itemize{
#'   \item "single" calculates similarity scores between all pairs of
#'   complexes and creates a graph where the nodes are the complexes
#'   and two nodes are connected if the corresponding complexes are
#'   highly overlapping. Complexes in the same connected component
#'   of the graph will then be merged. As its name suggests,
#'   this is a single-pass method. \cr
#'   \item "multi" calculates similarity scores between all pairs of complexes
#'   and stores those pairs that have a score larger than a given threshold.
#'   The highest scoring pair is then merged and the similarity of the
#'   merged complex towards its neighbors is re-calculated. This is repeated
#'   until there are no more highly overlapping complexes in the result.
#'   As its name suggests, this is a multi-pass method where similarities
#'   are re-calculated after each merge. \cr
#'   }
#' @param similarity specifies the similarity function to be used in
#' the merging step. More precisely, this switch controls which scoring
#' function is used to decide whether two complexes overlap significantly
#' or not. The following values are accepted: \cr
#' \itemize{
#'   \item "match" calculates the intersection size squared, divided by
#'   the product of the sizes of the two complexes. This is also called
#'   the **matching score**. This is the default. \cr
#'   \item "simpson" or meet/min calculates the Simpson coefficient, i.e. the
#'   intersection size over the size of the smaller complex. \cr
#'   \item "jaccard" calculates the Jaccard similarity, i.e. the intersection
#'   size over the size of the union of the two complexes. \cr
#'   \item "dice" calculates the Dice similarity, i.e. twice the intersection
#'   size over the sum of the sizes of the two complexes. \cr
#'   }
#' @param noFluff don't fluff the clusters, this is the default.
#' For more details about fluffing, see the --fluff switch above.
#' @param noMerge don't merge highly overlapping clusters (in other words,
#' skip the last merging phase). This is useful for debugging purposes only.
#' @param penalty sets a penalty value for the inclusion of each node.
#' When you set this option to x, ClusterONE will assume that each node has
#' an extra boundary weight of x when it considers the addition of the node
#' to a cluster. It can be used to model the possibility of uncharted
#' connections for each node, so nodes with only a single weak connection
#' to a cluster will not be added to the cluster as the penalty value will
#' outweigh the benefits of adding the node. The default penalty value is 2.
#' @param seedMethod specifies the seed generation method to use.
#' The following values are accepted: \cr
#' \itemize{
#'   \item "nodes": every node will be used as a seed.
#'   \item "unused_nodes": nodes will be tried in the descending
#'   order of their weights
#'   (where the weight of a node is the sum of the weights on its incident
#'   edges), and whenever a cluster is found, the nodes in that cluster will
#'   be excluded from the list of potential seeds. In other words, the node
#'   with the largest weight that does not participate in any of the clusters
#'   found so far will be selected as the next seed. \cr
#'   \item "edges": every edge will be considered once, each yielding a seed
#'   consisting of the two endpoints of the edge. \cr
#'   \item "cliques": every maximal clique of the graph will be considered
#'   once as a seed. \cr
#'   \item "file"(*filename*): seeds will be generated from the given file.
#'   Each line of the file must contain a space-separated list of node
#'   IDs that will be part of the seed (and of course each line encodes
#'   a single seed). If a line contains a single * character only, this
#'   means that besides the seeds given in the file, every node that is not
#'   part of any of the seeds will also be considered as a potential seed
#'   on its own. \cr
#'   \item "'single(*node1*,*node2*,...)'": a single seed will be used with the given
#'   nodes as members. Node names must be separated by commas or spaces. \cr
#'   \item "stdin": seeds will be given on the standard input, one by line. Each
#'   line must contain a space-separated list of node IDs that will be
#'   part of the seed. It may be useful to use this method in conjunction
#'   with --no-merge if you don't want the result of earlier seedings to
#'   influence the result of later ones. \cr
#' }
#' @details The following input file formats are recognised: \cr
#' \itemize{
#'   \item *Cytoscape SIF files* \cr
#'   When the extension of the input file is .sif, ClusterONE will
#'   automatically try to parse the file according to the SIF format of
#'   Cytoscape. Each line of the file must be according to the following
#'   format: \cr
#'   id1 type id2 \cr
#'   where id1 and id2 are the IDs of the two interacting proteins and
#'   type is the interaction type (which will silently be ignored by
#'   ClusterONE). Each edge will have unit weight. The columns of the
#'   input file may be separated by spaces or tabs; however, make sure
#'   that you do not mix these separator characters. \cr
#'   \item *Weighted edge lists* \cr
#'   This is the default file format assumed by ClusterONE unless the
#'   file extension suggests otherwise. Each line of the file has the
#'   following format: \cr
#'     id1 id2 weight \cr
#'   where id1 and id2 are the IDs of the interaction proteins and weight
#'   is the associated confidence value between 0 and 1. If the weight is
#'   omitted, it is considered to be equal to 1. Lines starting with hash
#'   marks (#) or percentage signs (\%) are considered as comments and they
#'   are silently ignored. \cr \cr
#' If ClusterONE fails to recognise the input format of your file, feel
#' free to specify it using the "inputFormat" option.
#' }
#' The following output file formats are available:
#'  \itemize{
#'  \item *Plain text output (plain)* \cr
#'  A simple and easy-to-parse output format, where each line represents a
#'  cluster. Members of the clusters are separated by Tab characters.
#'  \item *CSV output (csv)* \cr
#'  This format is suitable is you need more details about each cluster
#'  and/or you want to import the clusters to Microsoft Excel or OpenOffice.
#'  Each line corresponds to a cluster and contain the size, density, total
#'  internal and boundary weight, the value of the quality function, a P-value
#'  and the list of members for each cluster. Columns are separated by commas,
#'  and each individual column may optionally be quoted within quotation marks
#'  if necessary.
#'  \item *GenePro output (genepro)*
#'  Use this format if you want to visualize the clusters later on using the
#'  [GenePro](http://wodaklab.org/genepro) plugin of Cytoscape.
#'  }
#' @return A matrix of complex, where each row represents the proteins in
#' a single complex.
#' @export
#' @examples {
#' \dontrun{
#' # Run on an example network edges in the package
#' file = paste0(system.file('extdata', package = 'ClusterOneR'),
#'               "/Weighted_edge_lists.tsv")
#' head(file)
#' y = clusterOneR(file)
#' View(y)
#'
#' # Run on your own file "/my/path/myEdgeFile.tsv", which is a
#' "weighted edge lists" file type.
#' file = "/my/path/myEdgeFile.tsv"
#' y = clusterOneR(file, inputFormat = "edge_list")
#' View(y)
#'
#' # Run on a SIF file (Standard Interaction Format)
#' file = "/my/path/myEdgeFile.tsv"
#' y = clusterOneR(file, inputFormat = "edge_list")
#' View(y)
#' }
#' }

clusterOneR = function(inputFile = paste0(system.file('extdata', package = 'ClusterOneR'),
                                          "/Weighted_edge_lists.tsv"),
                       inputFormat = c("edge_list", "sif"),
                       outputFormat = c("plain", "csv", "genepro"),
                       minDensity = "auto",
                       minSize = 3, fluff = NULL, haircut = NULL,
                       maxOverlap = 0.8,
                       mergeMethod = c("single", "multi"),
                       similarity = "match", noFluff = TRUE, noMerge = FALSE,
                       penalty = 2, seedMethod = NULL){
  inputFormat = match.arg(inputFormat)
  stopifnot(inputFormat %in% c("sif", "edge_list"))

  outputFormat = match.arg(outputFormat)
  stopifnot(outputFormat %in% c("plain", "csv", "genepro"))

  if (is.character(minDensity)){
    stopifnot(minDensity == "auto")
  } else{
    stopifnot(is.numeric(minDensity))
  }

  mergeMethod = match.arg(mergeMethod)
  similarity = match.arg(similarity)

  if (noFluff){
    noFluff = ""
  } else {
    noFluff = NULL
  }

  if (noMerge){
    noMerge = ""
  } else {
    noMerge = NULL
  }

  args = as.list(environment())

  endCMD = unlist(lapply(setNames(names(args[-1]), names(args[-1])), function(x){
    params = c(inputFormat = "f", outputFormat = "F",
               minDensity = "d", minSize = "s", fluff = "-fluff",
               haircut = "-haircut", maxOverlap = "-max-overlap",
               mergeMethod = "-merge-method",
               similarity = "-similarity", noFluff = "-no-fluff",
               noMerge = "-no-merge", penalty = "-penalty",
               seedMethod = "-seed-method")
    params = setNames(paste0("-", params), names(params))
    if(!is.null(args[[x]])){
      y = paste(params[x], args[[x]])
    } else{ y = NULL }
    return(y)}))
  endCMD = paste0(endCMD[endCMD != ""], collapse = " ")

  jarFile = paste0(system.file('extdata', package = 'ClusterOneR'), "/cluster_one.jar")
  preCMD = paste("java -jar", jarFile)
  CMD = paste(preCMD, endCMD, inputFile)
  resJar = system(CMD, intern = TRUE)

  resMat = strSplit(resJar, split = "\t")
}
paodan/ClusterOneR documentation built on May 9, 2019, 5:56 a.m.