Introduction

Clustering algorithms are a staple exploratory tool in studying omics profiles of patients in order to identify molecular patterns corresponding to groups of patients that indicate different subtypes of disease. Omics data tends to be very high-dimensional, which can be addressed with different strategies. In COPS we have included data-driven strategies based on dimensionality reduction as well as knowledge-driven strategies based on integrating omics data with pathway gene sets and biological networks. Another crucial aspect to consider when trying to identify disease subtypes is the evaluation of the subtype candidates (i.e., patient clusters). The best number of clusters and the best clustering strategy are highly dependent on the dataset and the criteria used to evaluate clustering results.

While other R-packages such as clValid and clusterCrit provide access to a plethora of internal metrics to evaluate clustering results, COPS provides a few unique functionalities for omics data in life sciences. First, COPS uses repeated cross-fold validation to assess the stability of clusters and metrics for subsets of the data. Second, COPS implements several pathway based clustering approaches that integrate biological knowledge with omics data. Third, COPS can be used to evaluate mechanistic relevance for a given set of meta-genes identified using WGCNA for example. Fourth, COPS provides robust test for survival differences between clusters while accounting for relevant covariates.

COPS has been built to parallelise most operations and has been optimised to run relatively efficiently in terms of memory so that most analyses with relatively small datasets (up to 1000 samples) can be run on a personal computer.

Vignette required packages

In addition to the basic installation of COPS, this vignette makes use of additional formatting for ggplot.

if (!requireNamespace("ggh4x", quietly = TRUE)) {
  install.packages("ggh4x", repos='http://cran.rstudio.com/')
}

Clustering functionality

Clustering analysis in COPS is divided into several steps, many of which are optional:

  1. subsampling generates data subsets using cross-validation or bootstrap.
  2. subsample_pathway_enrichment transforms the data from gene features to pathway features thereby integrating biological knowledge with omics data.
  3. subsample_dimred transforms the data into a lower dimensional representation that is more suitable for clustering.
  4. subsample_clustering_evaluation runs clustering algorithms and computes clustering internal clustering metrics.
  5. stability_evaluation evaluates clustering stability by comparing cluster assignments obtained within each fold to a reference obtained from the full dataset.
  6. subsample_survival_evaluation evaluates survival differences between clusters.
  7. subsample_module_evaluation evaluates cluster-gene-module associations as in https://doi.org/10.1093/bib/bbab314.
  8. subsample_association_analysis calculates various statistics and indices for cluster association with external variables.

Function signatures

To simplify the usage the full pipeline can be called using the eponymous COPS-function. Most steps are also associated with a separate function that processes each CV fold. Below is a list of arguments that are relevant to each step, starting from general arguments shared by many functions:

Quick example

The purpose of this tutorial is to go over typical steps in the clustering analysis of multi-cohort gene-expression data, i.e., data gathered from multiple sources. As an example we will use a Psoriasis related dataset and use COPS to check for batch effect related issues, before applying COPS for evaluation of different clustering strategies. Towards the end we go over different metrics and how to compare methods based on multiple criteria.

In COPS we have included an example dataset related to psoriasis RNA-Seq data from multiple studies collected and harmonised by Federico et al. (2020). The full dataset including atopic dermatitis studies is available at Zenodo. We also only consider Psoriasis patients and protein coding genes in this tutorial.

Biological knowledge integration and feature extraction

COPS offers a few high performance pathway based clustering approaches that can be used for large datasets while also utilizing resampling for evaluation. Specifically, COPS can use GSVA, DiffRank or RWR-FGSEA to transform gene-expression profiles into pathway activity profiles. The provided the unsupervised pathway activation inference methods are summarised the in table below.

tabl <- "
| Method    | Description                     | Reference                                 |
|-----------|:--------------------------------|:------------------------------------------|
| GSVA      | Gene Set Variation Analysis summarises gene-expression at the level of pathways by using non-parametric gene statistics. | https://doi.org/10.1186/1471-2105-14-7    |
| DiffRank  | A ranking based approach to score pathway activity in individual samples.     | https://doi.org/10.1186/s12920-018-0449-4 |
| RWR-FGSEA | Combines Random Walk with Restart and Fast Gene Set Enrichment Analysis to first propagate dysregulated gene information through a network to generate gene-statistics which used to identify sample specific pathway enrichment.     | https://doi.org/10.1093/bib/bbab314       |
"
cat(tabl)

These pathway activity inference methods require pathway gene-sets. To download pathway gene sets we prefer to use MSigDB which combines many sources of pathway annotations. To keep the runtime short we will only use KEGG and filter the pathways based on size.

pw_db <- msigdbr::msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG")
pw_list <- lapply(split(pw_db, pw_db$gs_name), function(x) x$ensembl_gene)
pw_list <- pw_list[which(sapply(pw_list, length) <= 200 & sapply(pw_list, length) >= 100)]

The RWR-FGSEA method also requires a gene-network, since it uses random walk with restart to generate gene statistics for gene set enrichment analysis. We can use networks based on biological databases such as protein protein interactions (PPI) or we can generate a co-expression network from the gene-expression data.

COPS has a built-in function to retrieve PPIs STRINGdb and process the network into suitable form:

ppi_net <- COPS::getHumanPPIfromSTRINGdb(gene_id_mart_column = "ensembl_gene_id")

Internally, COPS defines a set of maximally dysregulated genes for each sample. These are used as seeds for the random walk to discover associated genes and to generate gene statistics for FGSEA. This is not a trivial task, but in practice simple strategies like selecting the genes with highest expression in each sample yield reasonable results. RWRFGSEA can do this automatically given an expression matrix as input. Another possibility to select gene seeds is to use non-parametric gene statistics similarly to GSVA, which can be achieved by using rwr_ecdf=TRUE. The options to this method are are explained on the help-page (?RWRFGSEA) in more detail.

With default settings RWRFGSEA behaves like a single-sample pathway scoring function. Therefore it can be run before subsampling to save time.

set.seed(0)
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
pso_degs <- COPS::psoriasis_degs_example
pso_rwrfgsea <- COPS::RWRFGSEA(
  log2(COPS::psoriasis_rnaseq_corrected[pso_degs,lesional_id]+1), 
  gene_network = ppi_net, 
  parallel = 1, 
  rwr_seed_size = 50, 
  gene_set_list = pw_list, 
  fgsea_nperm = 1e2)

For this vignette we want to consider pathway transformation as an alternative to dimensionality reduction. Since the number of considered pathways is still reasonably high at 171, we choose to utilise Spearman correlation distance and divisive hierarchical clustering.

set.seed(0)
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
pso_degs <- COPS::psoriasis_degs_example
res_rwrfgsea <- COPS::COPS(
  pso_rwrfgsea, 
  parallel = 1, nruns = 2, nfolds = 5, 
  dimred_methods = c("none"), 
  cluster_methods = c("diana"), 
  distance_metric = "correlation", 
  n_clusters = 2:6,
  association_data = COPS::psoriasis_clinical_example[["lesional"]])

Result data structures

The COPS-function runs and aggregates the outputs from each step and returns a named list of the following objects that contain results for all runs, folds and method settings:

Ranking results

To summarise the results into one data.frame we can use the scoring-function that takes the list of results and returns a table with one line per input data and COPS setting. The wsum argument is used to define a composite metric comprised of multiple metrics for ranking the results. By default, the metrics are summarised across CV folds (mean and sd), but this can be disabled by setting summarise=FALSE. P-values are summarised based on the null hypothesis rejection rate which is parameterized by the significance_level:

scores_rwrfgsea <- COPS::scoring(
  res_rwrfgsea, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  significance_level = 0.05)

This returns a list containing the whole summarised table in the element "all", and only the line corresponding to the best result (according to wsum) in "best". COPS also includes tools for Pareto-based multi-objective evaluation, but these are discussed later.

score_cols <- c(
  "Approach", 
  "Embedding", 
  "Clustering", 
  "k", 
  "Silhouette", 
  "ClusteringStabilityJaccard", 
  "GSE.nmi")
scores_rwrfgsea$all[,score_cols]

For visualising the spread of metrics, we can use boxplots after reorganising the scores without summary:

scores_rwrfgsea_dist <- COPS::scoring(
  res_rwrfgsea, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  summarise = FALSE)

Note that with summarise set to FALSE, the function will also return results for the reference fold (k+1). Hence, when plotting resampled metrics, the reference should be omitted manually.

require(ggplot2)
ggplot(
  scores_rwrfgsea_dist$all[scores_rwrfgsea_dist$all$fold != 6,], 
  aes(x = interaction(Embedding, Clustering),  
      y = ClusteringStabilityJaccard, 
      fill = interaction(Embedding, Clustering))) + 
  geom_boxplot() + 
  theme_bw() + 
  ggh4x::facet_grid2(
    Approach ~ k, 
    scales = "free_x", 
    independent = "x",
    labeller = labeller(k = function(x) paste0("k=", x))) +
  theme(legend.position="bottom", 
        axis.text.x = element_blank(), 
        axis.ticks.x = element_blank(), 
        axis.title.x = element_blank()) + 
  guides(fill = guide_legend(ncol = 3, title.position = "top"))

Comparing different methods

For benchmarking clustering algorithms, one of the most common metrics is the concordance of the grouping result with ground truth labels. However, in the context of bioinformatics there are many different aspects could be used to assess the usefulness of clustering results, including associations to clinical variables, survival, clustering stability and mechanistic relevance.

Internal clustering indices

By default, COPS computes the average silhouette index (or score) for each clustering result using distances calculated for the input matrix. The silhouette score is a summary of the cohesion and separability of clusters. In addition to the silhouette, it is possible to utilize any internal index implemented by clusterCrit::intCriteria. To do this we can pass the desired index names in the internal_metrics argument to COPS, subsample_clustering_evaluation or clustering_metrics. Note that since the clusterCrit::intCriteria-function takes the data matrix as input and computations will be repeated for each fold and setting, this can take much longer than the silhouette calculation by COPS which calculates the distances only once.

Clustering stability

A key clustering metric that is often omitted by other packages, is the clustering stability which COPS estimates by using cross-validation. To be clear, the validation folds are not used for clustering or unsupervised dimensionality reduction, the purpose of CV is just to re-sample the original data efficiently. Compared to bootstrapping, CV does not duplicate any data points which could yield more accurate estimates of metrics within the folds. The downside is that clustering data size is decreased within the folds which can affect results and tends to affect metric values, especially those based on statistical tests.

Clustering stability, i.e., the grouping similarity within training folds, in COPS is calculated with respect to a reference clustering result obtained with identical settings on the full dataset. Three different indices are used, namely the Jaccard index, adjusted Rand index and normalized mutual information.

The CV schema can also be used to estimate the uncertainty of the metrics. Using repeated cross-fold validation is recommended to obtain credible estimates for expected values and their variance.

Module score

The module score (https://doi.org/10.1093/bib/bbab314) is based on the association of clusters to gene meta-features. The score is defined as: $$ \text{score} = E\left[\frac{\min{(1,\mathbf{S^{+})}}+\min{(1,\mathbf{S^{-})}}}{\mathbf{S^{+}} + \mathbf{S^{-}}}\right] $$ where $$ \mathbf{S^\pm_{i}} = \sum_j{\delta^\pm\left(\rho(\mathbf{e}^{(i)}, \mathbf{c}^{(j)})\right)} $$ where $\rho$ corresponds to spearman correlation, $\mathbf{e}^{(i)}$ corresponds to the $i^{th}$ column of the WGCNA eigen-gene matrix, $\mathbf{c}^{(j)}$ corresponds to the $j^{th}$ cluster indicator variable and
$$ \delta^{-}(x) = \begin{cases} 1 & \mbox{if } x \leq -\alpha \ 0 & \mbox{otherwise} \end{cases} $$ $$ \delta^{+}(x) = \begin{cases} 1 & \mbox{if } x \geq \alpha \ 0 & \mbox{otherwise} \end{cases} $$ where $\alpha$ is a parameter that defines a threshold for counting correlations. The fraction is undefined when the denominator is zero which occurs when no clusters are strongly correlated to an eigen-gene, in which case we assign a score of $\beta$. When $\beta$ is close to zero, clustering results that do not exhibit differential expression between clusters in each gene module are penalized.

In the gene_module_score-function $\alpha$ corresponds to the argument module_cor_threshold while $\beta$ corresponds to the argument module_nan.substitute.

To generate these for previously calculated results we can call subsample_module_evaluation. We have pre-computed disease gene-modules on this data by using the WGCNA eigen-gene method.

res_rwrfgsea$modules <- COPS::subsample_module_evaluation(
  clusters = res_rwrfgsea$clusters, 
  parallel = 1, 
  module_eigs = COPS::psoriasis_example_MEs,
  module_cor_threshold = 0.3,
  module_nan.substitute = 0)

Recompute summaries:

scores_rwrfgsea <- COPS::scoring(
  res_rwrfgsea, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  significance_level = 0.05)

Survival relevance evaluation

Typically, Kaplan-Meier estimates and the log-rank test are used to compare survival between two groups. This strategy is easy to understand and visualize. However, using the KM curves can be misleading as it does not account for any covariates such as age and cancer stage which are crucial for understanding the prognosis of a patient.

In COPS, we provide an alternative strategy for estimating the significance of a survival difference between groups by using the Cox proportional hazards model. With a likelihood-ratio test we can compare a baseline model of only covariates to an augmented model that also has clustering information. The survival related functionalities of COPS are discussed in detail in the multi-omic cancer clustering analysis vignette.

Pairwise metric plots

One of the main purposes of COPS it to provide tools for comparing clustering results from multiple perspectives. To do multi-objective evaluation we can plot the results in pair-wise metric plots. In case different objectives are in conflict, the related metrics may exhibit trade-offs given that the included clustering strategies adequately explore these trade-offs. The solutions that are on the edges of the distribution in each scatter plot may be Pareto optimal, and are worth exploring further.

But first we should generate some alternative results to compare against:

set.seed(0)
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
pso_degs <- COPS::psoriasis_degs_example
res_dr <- COPS::COPS(
  log2(COPS::psoriasis_rnaseq_corrected[pso_degs,lesional_id]+1), 
  parallel = 1, nruns = 2, nfolds = 5, 
  dimred_methods = c("pca"), 
  pca_dims = 2:8,
  cluster_methods = c("kmeans", "diana", "model"), 
  distance_metric = "euclidean", 
  n_clusters = 2:6,
  association_data = COPS::psoriasis_clinical_example[["lesional"]],
  module_eigs = COPS::psoriasis_example_MEs)
scores_dr <- COPS::scoring(
  res_dr, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  significance_level = 0.05)

In order to compare the scores, we can combine the two data.frames containing all scores using rbind:

scores_combined <- rbind(scores_dr$all, scores_rwrfgsea$all)

Finally, we can do a comparison by using the pareto_plot function. By setting plot_pareto_front = TRUE we can also visualise the first Pareto front which consists of all the non-dominated solutions. In order to do so, we also need to provide a list of comparators that are used to determine if a result is better than another one. COPS can guess the comparators to some extent, but for variable associations it is necessary to define whether it should be minimized or maximized.

selection <- scores_combined[["drname"]] %in% paste0("pca", c(2,5,7,10))
selection <- selection | (scores_combined[["Approach"]] == "RWR-FGSEA")
selection <- selection & scores_combined[["Smallest_cluster_size"]] >= 10

p <- COPS::pareto_plot(
  scores_combined[selection,],
  metrics = c(
    "ClusteringStabilityJaccard", 
    "Silhouette", 
    "Smallest_cluster_size", 
    "GSE.nmi", 
    "Module_score"),
  plot_palette = RColorBrewer::brewer.pal(9, "Set1"),
  plot_pareto_front = TRUE,
  metric_comparators = list(`>`, `>`, `>`, `<`, `>`))

In terms of internal metrics like average silhouette score and clustering stability, it is often the case that fewer clusters yield better metrics. For stability we consider 0.8 as reasonable threshold for stable results. The silhouette score is often very low for high-dimensional input data, in the COPS pipeline silhouette is defined based on distance in the input data by default. If silhouette distances are measured in the low dimensional embedded space the score will tend to favour embeddings with fewer dimensions. This behaviour can be changed by setting the COPS argument pre_compute_silhouette_dissimilarity = TRUE.

Regardless, in this comparison there seems to be a trade-off between silhouette and module score. Two cluster results form one group while three and four cluster results vary significantly more in terms of metrics. Batch NMI is also very low overall so it might not be considered a trade-off metric. Hence, we would probably select one of the PCA-based results with two clusters, since they have the highest silhouette and stability and one three cluster result which have the high module score and stability.

Pareto fronts

To obtain a list of the Pareto optimal results, we can use the pareto_fronts function. Note that using a high number of metrics will typically result in more Pareto optimal results as it becomes more likely that at least one of the metrics is better when comparing two results to each other and thus neither result dominates the other.

scores_combined_subset <- scores_combined[scores_combined[["Smallest_cluster_size"]] >= 10,]
pfs <- COPS::pareto_fronts(scores_combined_subset,
                           metrics = c("ClusteringStabilityJaccard", "Silhouette", 
                                       "Smallest_cluster_size", "GSE.nmi", 
                                       "Module_score"),
                           metric_comparators = list(`>`, `>`, `>`, `<`, `>`))
table(pfs)
interest_cols <- c("Approach", "Embedding", "Clustering", "k", "pareto_front")
scores_combined_subset$pareto_front <- pfs
scores_combined_subset[scores_combined_subset$pareto_front == 1,interest_cols]

Batch effects and data preparation

Up to this point we have not discussed the pre-processing of input data, which is crucial for obtaining good clustering results. Firstly, we expect the input data to be pre-processed appropriately. For example, RNA-Seq counts should be normalized and variance stabilised, i.e. log-scaled TPM or CPM and DNA-methylation should be given as methylation percentages or M-values. Secondly, feature selection is an important factor for obtaining relevant results, although not always necessary. Finally, if present, effects related to known batches should be corrected by using the appropriate tools (such as ComBat). COPS does not provide tools for processing sequencing data or performing batch-correction. However, COPS does provide tools for assessing batch-effects prior to clustering.

Pre-clustering batch effect metrics

In this example, our purpose is to discover subtypes of psoriasis in the gene-expression profiles of patient skin samples affected by the disease, i.e. the lesional samples. Therefore it is important that the batch effect within the affected subset is small enough to be negligible in clustering analysis.

tabl <- "
| function               | Description                               |
|------------------------|-------------------------------------------|
| DSC                    | Dispersion Separability Criterion         |
| feature_associations   | Calculates several batch effect indices including DSC, silhouette and regression based for multiple categories with one function call. |
"
cat(tabl)

Dispersion Separability Criterion is used by TCGA batch viewer. In their guidance they suggest 0.5 as a threshold for some concern.

lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
study_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSE"]]
log_norm <- log2(COPS::psoriasis_rnaseq_normalized[,lesional_id]+1)
log_corrected <- log2(COPS::psoriasis_rnaseq_corrected[,lesional_id]+1)
print(paste0("DSC before correction: ", COPS::DSC(log_norm, study_id)))
print(paste0("DSC after  correction: ", COPS::DSC(log_corrected, study_id)))

Clearly, batch correction (using sva::ComBat_seq) was necessary and effective. The DSC is a bit high for batch corrected data, but probably will not affect clustering too much.

We can also use feature_associations to run DSC and a two other PCA-based approaches for batch effect estimation from the kBET-package.

feats <- COPS::feature_associations(log_corrected, list(GSE = study_id), n_pc_max = 20)

The associations are based on the average silhouette score between categories, the maximum coefficient of determination (R2) between PCs and a linear model based on the categories, and the aforementioned DSC. In the absence of batch-effects, the silhouette should be below 0 while DSC should be below 0.5. If the R2 max is close to 1 it means there are PCs that are strongly associate with batches and they should be analysed further. It is worth keeping in mind that the R2 max is more sensitive than the other metrics.

feats$associations

This function also returns the PCA eigenvalues and explained variance:

feats$eigenvalues[1:20,]

We can see that after the 18 first components, each remaining component is explaining less than 1% of the remaining variance. We presume that the latter components comprise biological and technological noise and that the main biological effects that we want to cluster are contained within the first 10 or so components. Selecting the number of components is a non-trivial problem, but we can test different numbers and see which yields the best clustering results.

Session info

sessionInfo()


vittoriofortino84/COPS documentation built on Jan. 28, 2025, 3:16 p.m.