CScluster: CScluster
In ewouddt/CSFA: Connectivity Scores with Factor Analysis

Description Usage Arguments Details Value Author(s) Examples

Apply the Connectivity Scores to a K clustering result. More information can be found in the Details section below.

1
2
3

CScluster(data, clusterlabels, type = "CSmfa", WithinABS = TRUE,
  BetweenABS = TRUE, FactorABS = FALSE, verbose = FALSE, Within = NULL,
  Between = NULL, WithinSave = FALSE, BetweenSave = TRUE, ...)

`data`	A gene expression matrix with the compounds in the columns.
`clusterlabels`	A vector of integers that represents the cluster grouping of the columns (compounds) in `data`. The labels should be integers starting from 1 to the total number of clusters. (e.g. the output of `cutree`)
`type`	Type of CS anaylsis (default=`"CSmfa"`): `"CSmfa"` (MFA or PCA) `"CSsmfa"` (Sparse MFA or Sparse PCA) `"CSfabia"` (Fabia) `"CSzhang"` (Zhang and Gant) In the first two options, either MFA or PCA is used depending on the cluster size. If the query set only contains a single compound, the latter is used. Also note that if a cluster only contains a single compound, no Within-CS can be computed.
`WithinABS`	Boolean value to take the mean of the absolute values in the final step of the Within-Cluster CS (default=`TRUE`).
`BetweenABS`	Boolean value to take the mean of the absolute values in the final step of the Between-Cluster CS (default=`TRUE`).
`FactorABS`	Boolean value to take the absolute value of the query loadings when determining the best factor (= factor with highest query loadings) in a `CSanalysis` application (default=`FALSE`). This option might be helpful if the 'best factor' contains large positive and negative query loading which would average to zero.
`verbose`	Boolean value to output warnings and information about which factor is chosen in a CS analysis (if applicable).
`Within`	A vector for which cluster numbers the Within-Cluster CS should be computed. By default (=`NULL`) all within-cluster scores are computed, but this might not be feasible for larger data in which a single `CSanalysis` run might already take a sufficient amount of computation time.
`Between`	A vector fir which cluster numbers the Beween-Cluster CS (with the cluster as a query set) should be computed. By default (=`NULL`) all between-cluster scores are computed, but this might not be feasible for larger data in which a single `CSanalysis` run might already take a sufficient amount of computation time.
`WithinSave`	Boolean value to save the `Within` object in the `Save` slot of the returned list (default=`FALSE`).
`BetweenSave`	Boolean value to save the `Between` object in the `Save` slot of the returned list (default=`TRUE`).
`...`	Additional parameters given to `CSanalysis` specific to a certain `type` of CS analysis.

After applying cluster analysis on the additional data matrix, K clusters are obtained. Each cluster will be seen as a potential query set (for CSanalysis) for which 2 connectivity score metrics can be computed, the Within-Cluster CS and the Between-Cluster CS.

Within-Cluster CS
This metric will answer the question if the kth cluster is connected on a gene expression level (in addition to the samples being similar based on the other data source). The Within-Cluster CS for a cluster is computed as following:

Repeatedly for the ith sample in the kth cluster, apply CSMFA with:
- Query Set: All cluster samples excluding the ith sample.
- Reference: All samples including the ith sample of the kth cluster.
- Retrieve the CS of the ith sample in the cluster.
The Within-Cluster CS for cluster k is now defined as the average of all retrieved CS.

The concept of this metric is to investigate the connectivity for each compound with the cluster. The average of the 'leave-one-out' connectivity scores, the Within-Cluster CS, gives an indication of the gene expression connectivity of this cluster. A high Within-Cluster CS implies that the cluster is both similar on the external data source and on the gene expression level. A low score indicates that the cluster does not share a similar latent gene profile structure.

Between-Cluster CS
In this stage of the analysis, we focus on the lth cluster and use all compounds in this cluster as the query set. A CSMFA is performed in which all other clusters are the reference set. Next, the connectivity scores are calculated for all reference compounds and averaged over the clusters (=the between connectivity score). A high Between-Cluster CS between the lth and jth clusters implies that, while the two clusters are not similar based on the other data source, they do share a latent structure when considering the gene expression data.

A list object with components:

CSmatrix: A K\timesK matrix containing the Within scores on the diagonal and the Between scores elsewhere with the rows being the query set clusters (e.g. m_{13}= Between CS between cluster 1 (as query set) and cluster 3).
CSRankmatrix: The same as CSmatrix, but with connectivity ranking scores (if applicable).
clusterlabels: The provided clusterlabels
Save: A list with components:
- Within: A list with a component for each cluster k that contains:
  - LeaveOneOutCS: Each leave-one-out connectivity score for cluster k.
  - LeaveOneOutCSRank: Each leave-one-out connectivity ranking score for cluster k (if applicable).
  - factorselect: A vector containing which factors/BCs were selected in each leave-one-out CS analysis (if applicable).
  - CS: A (columns (compounds) \times size of cluster k) matrix that contains all the connectivity scores in a leave-one-out CS analysis for each left out compound.
  - CSRank: The same as CS, but with connectivity ranking scores (if applicable).
- Between: List:
  - DataBetweenCS: A (columns (compounds) \times clusters) matrix containing all compound connectivity scores for each query cluster set.
  - DataBetweenCSRank: The same as DataBetweenCS, but with connectivity ranking scores (if applicable).
  - queryindex: The column indices for each query set in all CS analyses.
  - factorselect: A vector containing which factors/BCs were selected in each CS analysis (if applicable).

Ewoud De Troyer

 

  # Example Data Set
  data("dataSIM",package="CSFA")
  # Remove some no-connectivity compounds
  nosignal <- sapply(colnames(dataSIM),FUN=function(x){grepl("c-",x)})
  data <- dataSIM[,-which(nosignal)[1:250]]
  
  # Toy example with random cluster assignment:
  # Note: clusterlabels can be acquired through cutree(hclust(...))
  clusterlabels <- sample(1:10,size=ncol(data),replace=TRUE)
  
  result1 <- CScluster(data,clusterlabels,type="CSmfa")
  result2 <- CScluster(data,clusterlabels,type="CSzhang")
  
  result1$CSmatrix
  result1$CSRankmatrix
  
  result2$CSmatrix