wrapperRunClustering: Run a clustering pipeline of protein/peptide abundance...
In DAPAR: Tools for the Differential Analysis of Proteins Abundance with R

Description Usage Arguments Details Value Author(s) References Examples

This function does all of the steps necessary to obtain a clustering model and its graph from average abundances of proteins/peptides. It is possible to carry out either a kmeans model or an affinity propagation model. See details for exact steps.

wrapperRunClustering(
  obj,
  clustering_method,
  conditions_order = NULL,
  k_clusters = NULL,
  adjusted_pvals,
  ttl = "",
  subttl = "",
  FDR_thresholds = NULL
)

`obj`	ExpressionSet or MSnSet object.
`clustering_method`	character string. Three possible values are "kmeans", "affinityProp" and "affinityPropReduced. See the details section for more explanation.
`conditions_order`	vector specifying the order of the Condition factor levels in the phenotype data. Default value is NULL, which means that it is the order of the condition present in the phenotype data of "obj" which is taken to create the profiles.
`k_clusters`	integer or NULL. Number of clusters to run the kmeans algorithm. If 'clustering_method' is set to "kmeans" and this parameter is set to NULL, then a kmeans model will be realized with an optimal number of clusters 'k' estimated by the Gap statistic method. Ignored for the Affinity propagation model.
`adjusted_pvals`	vector of adjusted pvalues returned by the [wrapperClassic1wayAnova()]
`ttl`	the title for the final plot
`subttl`	the subtitle for the final plot
`FDR_thresholds`	vector containing the different threshold values to be used to color the profiles according to their adjusted pvalue. The default value (NULL) generates 4 thresholds: [0.001, 0.005, 0.01, 0.05]. Thus, there will be 5 intervals therefore 5 colors: the pvalues <0.001, those between 0.001 and 0.005, those between 0.005 and 0.01, those between 0.01 and 0.05, and those> 0.05. The highest given value will be considered as the threshold of insignificance, the profiles having a pvalue> this threshold value will then be colored in gray.

The first step consists in averaging the abundances of proteins/peptides according to the different conditions defined in the phenotype data of the expressionSet / MSnSet. Then we standardize the data if there are more than 2 conditions. If the user asks to realize a kmeans model without specifying the desired number of clusters ('clustering_method =" kmeans "' and 'k_clusters = NULL'), the function checks data's clusterability and estimates a number of clusters k using the gap statistic method. It is advise however to specify a k for the kmeans, because the gap stat gives the smallest possible k, whereas in biology a small number of clusters can turn out to be uninformative. If you want to run a kmeans but you don't know what number of clusters to give, you can let the pipeline run the first time without specifying 'k_clusters', in order to view the profiles the first time and choose by the following is a more appropriate value of k. If it is assumed that the data can be structured with a large number of clusters, it is recommended to use the affinity propagation model instead. This method simultaneously considers all the data as exemplary potentials, unlike hard clustering (kmeans) which initializes with a number k of points taken at random. The "affinityProp" model will use a q parameter set to NA, meaning that exemplar preferences are set to the median of non-Inf values in the similarity matrix (set q to 0.5 will be the same). The "affinityPropReduced" model will use a q set to 0, meaning that exemplar preferences are set to the sample quantile with threshold 0 of non-Inf values. This should lead to a smaller number of final clusters.

a list of 2 elements: "model" is the clustering model, "ggplot" is the ggplot of profiles clustering.

Helene Borges

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. *Journal of the Royal Statistical Society* B, 63, 411–423.

Frey, B. J. and Dueck, D. (2007) Clustering by passing messages between data points. *Science* 315, 972-976. DOI: 10.1126/science.1136800

utils::data(Exp1_R25_prot, package='DAPARdata')
obj <- Exp1_R25_prot[1:1000]
keepThat <- mvFilterGetIndices(obj, condition='WholeMatrix', threshold=ncol(obj))
obj <- mvFilterFromIndices(obj, keepThat)
expR25_ttest <- compute_t_tests(obj)
wrapperRunClustering(obj = obj, adjusted_pvals = expR25_ttest$P_Value$`25fmol_vs_10fmol_pval`)