In vittoriofortino84/COPS: Clustering algorithms for Omics-driven Patient Stratification

Introduction

Clustering algorithms are a staple exploratory tool in studying omics profiles of patients in order to identify molecular patterns corresponding to groups of patients that indicate different subtypes of disease. Omics data tends to be very high-dimensional, which can be addressed with different strategies. In COPS we have included data-driven strategies based on dimensionality reduction as well as knowledge-driven strategies based on integrating omics data with pathway gene sets and biological networks. Another crucial aspect to consider when trying to identify disease subtypes is the evaluation of the subtype candidates (i.e., patient clusters). The best number of clusters and the best clustering strategy are highly dependent on the dataset and the criteria used to evaluate clustering results.

While other R-packages such as clValid and clusterCrit provide access to a plethora of internal metrics to evaluate clustering results, COPS provides a few unique functionalities for omics data in life sciences. First, COPS uses repeated cross-fold validation to assess the stability of clusters and metrics for subsets of the data. Second, COPS implements several pathway based clustering approaches that integrate biological knowledge with omics data. Third, COPS can be used to evaluate mechanistic relevance for a given set of meta-genes identified using WGCNA for example. Fourth, COPS provides robust test for survival differences between clusters while accounting for relevant covariates.

COPS has been built to parallelise most operations and has been optimised to run relatively efficiently in terms of memory so that most analyses with relatively small datasets (up to 1000 samples) can be run on a personal computer.

Vignette required packages

In addition to the basic installation of COPS, this vignette makes use of additional formatting for ggplot.

if (!requireNamespace("ggh4x", quietly = TRUE)) {
  install.packages("ggh4x", repos='http://cran.rstudio.com/')
}

Clustering functionality

Clustering analysis in COPS is divided into several steps, many of which are optional:

subsampling generates data subsets using cross-validation or bootstrap.
subsample_pathway_enrichment transforms the data from gene features to pathway features thereby integrating biological knowledge with omics data.
subsample_dimred transforms the data into a lower dimensional representation that is more suitable for clustering.
subsample_clustering_evaluation runs clustering algorithms and computes clustering internal clustering metrics.
stability_evaluation evaluates clustering stability by comparing cluster assignments obtained within each fold to a reference obtained from the full dataset.
subsample_survival_evaluation evaluates survival differences between clusters.
subsample_module_evaluation evaluates cluster-gene-module associations as in https://doi.org/10.1093/bib/bbab314.
subsample_association_analysis calculates various statistics and indices for cluster association with external variables.

Function signatures

To simplify the usage the full pipeline can be called using the eponymous COPS-function. Most steps are also associated with a separate function that processes each CV fold. Below is a list of arguments that are relevant to each step, starting from general arguments shared by many functions:

generic arguments that apply to most functions
- dat : Input data which can be a matrix or list of matrices, columns will be treated as samples and rows as features.
- dat_list : List of data.frames or data.tables with features on columns where feature names have been replaced with 'dim[0-9]+', they may also contain other columns with metadata such as sample id, CV fold etc. Managed by the COPS function.
- parallel : Number of parallel threads to execute, defaults to 1.
- by : Column names by which data is split into smaller tasks corresponding to different CV folds and method settings. Managed by the COPS function.
COPS arguments
- pathway_enrichment_method : (Optional) Name of the pathway enrichment method to use, passed to subsample_pathway_enrichment. Affects certain routines in the pipeline.
- multi_omic_methods : (Optional) Names of the multi-view clustering algorithms to use. Affects certain routines in the pipeline.
- vertical_parallelization : Controls whether each analysis step waits for threads to finish on all folds and settings before moving to next step (horizontal) or to create one thread per fold and execute all steps and method settings in sequentially within each thread (vertical). The latter can help lower memory utilization when using fewer threads than folds.
- verbose : Controls, whether progress and time taken is printed to the console.
- ... : All additional arguments are passed on to each step in the pipeline so that all relevant functionality is accessible via just one function call. See ?COPS for more information.
subsampling, cv_fold
- nfolds : Number of CV folds.
- nruns : Number of CV repeats or bootstrap iterations
- nruns : Number of CV repeats or bootstrap iterations
- stratified_cv : Sets use of stratified CV, sample given stratification variable as evenly as possible across the folds.
- anti_stratified : Sets use of anti-stratified CV, sample given stratification variable as unevenly as possible across the folds.
- cv_stratification_var : Categorical variable for stratified CV.
- extra_fold : If set to TRUE, the function will return an additional k+1th fold containing all data points, this fold is used as reference by stability_evaluation.
subsample_pathway_enrichment, genes_to_pathways
- sub_index : List of data.frames corresponding to subset indices produced by subsampling. Managed by the COPS function.
- gene_id_list : List of gene id vectors of the corresponding columns in dat_list used to translate 'dim[0-9]+' back to genes for pathway analysis. Managed by the COPS function.
- x : Gene feature matrix, samples on columns and genes on rows. Only used when calling genes_to_pathways directly.
- enrichment_method : The options are: "DiffRank", "GSVA", "RWRFGSEA". Only one method is allowed at a time.
- gene_set_list : List of gene sets corresponding to pathways with gene names that correspond to rows in expr (i.e., rownames in input data).
- batch_label_pw : Batch labels for batch-wise enrichment which is not performed if this is NULL.
- ... : See ?genes_to_pathways for more options.
subsample_dimred, dimensionality_reduction
- sub_index : List of data.frames corresponding to subset indices produced by subsampling. Managed by the COPS function.
- sub_split_data : Can be set to FALSE if dat_list elements already contain the columns "run" and "fold". The first transformation step in the analysis (either pathway-based or DR-based) splits the data according to the folds, so if DR is performed after PW transformation the data will be already split. Managed by the COPS function.
- dimred_methods : Character vector of method names. Options are: "none", "pca", "tsne" and "umap".
- x : Data matrix, features on columns and samples on rows. Only used when calling dimensionality_reduction directly.
- ... : See ?dimensionality_reduction for more options.
subsample_clustering_evaluation, clustering_analysis, clustering_metrics
- dat_embedded : Same format as dat_list. Managed by the COPS function.
- x : A data.frame with features on columns labeled as "dim[0-9]+", must also contain "id" column. Used only when calling clustering_analysis directly.
- n_clusters : A vector of integers defining the number of clusters.
- cluster_methods : A vector of clustering method names. Options are: "hierarchical", "diana", "kmeans", "model", "knn_communities", "spectral", "SC3", "kkmeans" and "kkmeanspp".
- silhouette_dissimilarity : Dissimilarity matrix used for silhouette evaluation, should contain all samples in the data. If this is not given, silhouette will be estimated using the extracted features which tends to favour lower dimensional representations.
- internal_metrics : Names of internal indices corresponding to clusterCrit::intCriteria crit argument. Note that using this interface is much slower than the inbuilt average silhouette metric.
- ... : See ?clustering_analysis and ?clustering_metrics for more options.
stability_evaluation
- clusters : A clustering data.frame such as returned by subsample_clustering_evaluation. Managed by the COPS function.
- reference_fold : Fold number that corresponds to reference which other folds are compared against, inferred from input by default.
subsample_survival_evaluation, subsample_survival_evaluation
- event_data : A data.frame that contains survival times, event data and covariates..
- clusters : A clustering data.frame such as returned by subsample_clustering_evaluation. Managed by the COPS function.
- survival_time_col : Name of the column in event_data that contains survival time.
- survival_event_col : Name of the column in event_data that contains event indicators.
- survival_covariate_names : Names of covariate columns in event_data.
- row_id : Name of column in event_data that matches "id" column in clusters.
subsample_module_evaluation, gene_module_score
- clusters : A clustering data.frame such as returned by subsample_clustering_evaluation. Managed by the COPS function.
- x : A data.frame with columns "id" and "cluster" corresponding to a single clustering result.
- module_eigs : A matrix of meta-features, e.g. gene module eigen-genes for each sample (samples x modules) from COPS.
- ... : See ?gene_module_score for more options.
subsample_association_analysis, cluster_associations
- clusters : A clustering data.frame such as returned by subsample_clustering_evaluation. Managed by the COPS function.
- association_data : A data.frame with external variables corresponding to the input samples (categoric or numeric), rownames are matched from clusters "id" column (i.e., dat colnames), missing values are removed automatically.

Quick example

The purpose of this tutorial is to go over typical steps in the clustering analysis of multi-cohort gene-expression data, i.e., data gathered from multiple sources. As an example we will use a Psoriasis related dataset and use COPS to check for batch effect related issues, before applying COPS for evaluation of different clustering strategies. Towards the end we go over different metrics and how to compare methods based on multiple criteria.

In COPS we have included an example dataset related to psoriasis RNA-Seq data from multiple studies collected and harmonised by Federico et al. (2020). The full dataset including atopic dermatitis studies is available at Zenodo. We also only consider Psoriasis patients and protein coding genes in this tutorial.

Biological knowledge integration and feature extraction

COPS offers a few high performance pathway based clustering approaches that can be used for large datasets while also utilizing resampling for evaluation. Specifically, COPS can use GSVA, DiffRank or RWR-FGSEA to transform gene-expression profiles into pathway activity profiles. The provided the unsupervised pathway activation inference methods are summarised the in table below.

tabl <- "
| Method    | Description                     | Reference                                 |
|-----------|:--------------------------------|:------------------------------------------|
| GSVA      | Gene Set Variation Analysis summarises gene-expression at the level of pathways by using non-parametric gene statistics. | https://doi.org/10.1186/1471-2105-14-7    |
| DiffRank  | A ranking based approach to score pathway activity in individual samples.     | https://doi.org/10.1186/s12920-018-0449-4 |
| RWR-FGSEA | Combines Random Walk with Restart and Fast Gene Set Enrichment Analysis to first propagate dysregulated gene information through a network to generate gene-statistics which used to identify sample specific pathway enrichment.     | https://doi.org/10.1093/bib/bbab314       |
"
cat(tabl)

These pathway activity inference methods require pathway gene-sets. To download pathway gene sets we prefer to use MSigDB which combines many sources of pathway annotations. To keep the runtime short we will only use KEGG and filter the pathways based on size.

pw_db <- msigdbr::msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG")
pw_list <- lapply(split(pw_db, pw_db$gs_name), function(x) x$ensembl_gene)
pw_list <- pw_list[which(sapply(pw_list, length) <= 200 & sapply(pw_list, length) >= 100)]

The RWR-FGSEA method also requires a gene-network, since it uses random walk with restart to generate gene statistics for gene set enrichment analysis. We can use networks based on biological databases such as protein protein interactions (PPI) or we can generate a co-expression network from the gene-expression data.

COPS has a built-in function to retrieve PPIs STRINGdb and process the network into suitable form:

ppi_net <- COPS::getHumanPPIfromSTRINGdb(gene_id_mart_column = "ensembl_gene_id")

Internally, COPS defines a set of maximally dysregulated genes for each sample. These are used as seeds for the random walk to discover associated genes and to generate gene statistics for FGSEA. This is not a trivial task, but in practice simple strategies like selecting the genes with highest expression in each sample yield reasonable results. RWRFGSEA can do this automatically given an expression matrix as input. Another possibility to select gene seeds is to use non-parametric gene statistics similarly to GSVA, which can be achieved by using rwr_ecdf=TRUE. The options to this method are are explained on the help-page (?RWRFGSEA) in more detail.

With default settings RWRFGSEA behaves like a single-sample pathway scoring function. Therefore it can be run before subsampling to save time.

set.seed(0)
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
pso_degs <- COPS::psoriasis_degs_example
pso_rwrfgsea <- COPS::RWRFGSEA(
  log2(COPS::psoriasis_rnaseq_corrected[pso_degs,lesional_id]+1), 
  gene_network = ppi_net, 
  parallel = 1, 
  rwr_seed_size = 50, 
  gene_set_list = pw_list, 
  fgsea_nperm = 1e2)

For this vignette we want to consider pathway transformation as an alternative to dimensionality reduction. Since the number of considered pathways is still reasonably high at 171, we choose to utilise Spearman correlation distance and divisive hierarchical clustering.

set.seed(0)
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
pso_degs <- COPS::psoriasis_degs_example
res_rwrfgsea <- COPS::COPS(
  pso_rwrfgsea, 
  parallel = 1, nruns = 2, nfolds = 5, 
  dimred_methods = c("none"), 
  cluster_methods = c("diana"), 
  distance_metric = "correlation", 
  n_clusters = 2:6,
  association_data = COPS::psoriasis_clinical_example[["lesional"]])

Result data structures

The COPS-function runs and aggregates the outputs from each step and returns a named list of the following objects that contain results for all runs, folds and method settings:

clusters data.frame defining clusters
internal_metrics data.frame of internal metrics
stability data.frame of stability scores
survival data.frame of survival analysis results
modules data.frame of gene module association scores
association data.frame of association results to variables of interest
cluster_sizes data.frame giving the sizes of clusters

Ranking results

To summarise the results into one data.frame we can use the scoring-function that takes the list of results and returns a table with one line per input data and COPS setting. The wsum argument is used to define a composite metric comprised of multiple metrics for ranking the results. By default, the metrics are summarised across CV folds (mean and sd), but this can be disabled by setting summarise=FALSE. P-values are summarised based on the null hypothesis rejection rate which is parameterized by the significance_level:

scores_rwrfgsea <- COPS::scoring(
  res_rwrfgsea, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  significance_level = 0.05)

This returns a list containing the whole summarised table in the element "all", and only the line corresponding to the best result (according to wsum) in "best". COPS also includes tools for Pareto-based multi-objective evaluation, but these are discussed later.

score_cols <- c(
  "Approach", 
  "Embedding", 
  "Clustering", 
  "k", 
  "Silhouette", 
  "ClusteringStabilityJaccard", 
  "GSE.nmi")
scores_rwrfgsea$all[,score_cols]

For visualising the spread of metrics, we can use boxplots after reorganising the scores without summary:

scores_rwrfgsea_dist <- COPS::scoring(
  res_rwrfgsea, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  summarise = FALSE)

Note that with summarise set to FALSE, the function will also return results for the reference fold (k+1). Hence, when plotting resampled metrics, the reference should be omitted manually.

require(ggplot2)
ggplot(
  scores_rwrfgsea_dist$all[scores_rwrfgsea_dist$all$fold != 6,], 
  aes(x = interaction(Embedding, Clustering),  
      y = ClusteringStabilityJaccard, 
      fill = interaction(Embedding, Clustering))) + 
  geom_boxplot() + 
  theme_bw() + 
  ggh4x::facet_grid2(
    Approach ~ k, 
    scales = "free_x", 
    independent = "x",
    labeller = labeller(k = function(x) paste0("k=", x))) +
  theme(legend.position="bottom", 
        axis.text.x = element_blank(), 
        axis.ticks.x = element_blank(), 
        axis.title.x = element_blank()) + 
  guides(fill = guide_legend(ncol = 3, title.position = "top"))

Comparing different methods

For benchmarking clustering algorithms, one of the most common metrics is the concordance of the grouping result with ground truth labels. However, in the context of bioinformatics there are many different aspects could be used to assess the usefulness of clustering results, including associations to clinical variables, survival, clustering stability and mechanistic relevance.

Internal clustering indices

By default, COPS computes the average silhouette index (or score) for each clustering result using distances calculated for the input matrix. The silhouette score is a summary of the cohesion and separability of clusters. In addition to the silhouette, it is possible to utilize any internal index implemented by clusterCrit::intCriteria. To do this we can pass the desired index names in the internal_metrics argument to COPS, subsample_clustering_evaluation or clustering_metrics. Note that since the clusterCrit::intCriteria-function takes the data matrix as input and computations will be repeated for each fold and setting, this can take much longer than the silhouette calculation by COPS which calculates the distances only once.

Clustering stability

A key clustering metric that is often omitted by other packages, is the clustering stability which COPS estimates by using cross-validation. To be clear, the validation folds are not used for clustering or unsupervised dimensionality reduction, the purpose of CV is just to re-sample the original data efficiently. Compared to bootstrapping, CV does not duplicate any data points which could yield more accurate estimates of metrics within the folds. The downside is that clustering data size is decreased within the folds which can affect results and tends to affect metric values, especially those based on statistical tests.

Clustering stability, i.e., the grouping similarity within training folds, in COPS is calculated with respect to a reference clustering result obtained with identical settings on the full dataset. Three different indices are used, namely the Jaccard index, adjusted Rand index and normalized mutual information.

The CV schema can also be used to estimate the uncertainty of the metrics. Using repeated cross-fold validation is recommended to obtain credible estimates for expected values and their variance.

Module score

The module score (https://doi.org/10.1093/bib/bbab314) is based on the association of clusters to gene meta-features. The score is defined as: $$ \text{score} = E\left[\frac{\min{(1,\mathbf{S^{+})}}+\min{(1,\mathbf{S^{-})}}}{\mathbf{S^{+}} + \mathbf{S^{-}}}\right] $$ where $$ \mathbf{S^\pm_{i}} = \sum_j{\delta^\pm\left(\rho(\mathbf{e}^{(i)}, \mathbf{c}^{(j)})\right)} $$ where $\rho$ corresponds to spearman correlation, $\mathbf{e}^{(i)}$ corresponds to the $i^{th}$ column of the WGCNA eigen-gene matrix, $\mathbf{c}^{(j)}$ corresponds to the $j^{th}$ cluster indicator variable and
$$ \delta^{-}(x) = \begin{cases} 1 & \mbox{if } x \leq -\alpha \ 0 & \mbox{otherwise} \end{cases} $$ $$ \delta^{+}(x) = \begin{cases} 1 & \mbox{if } x \geq \alpha \ 0 & \mbox{otherwise} \end{cases} $$ where $\alpha$ is a parameter that defines a threshold for counting correlations. The fraction is undefined when the denominator is zero which occurs when no clusters are strongly correlated to an eigen-gene, in which case we assign a score of $\beta$. When $\beta$ is close to zero, clustering results that do not exhibit differential expression between clusters in each gene module are penalized.

In the gene_module_score-function $\alpha$ corresponds to the argument module_cor_threshold while $\beta$ corresponds to the argument module_nan.substitute.

To generate these for previously calculated results we can call subsample_module_evaluation. We have pre-computed disease gene-modules on this data by using the WGCNA eigen-gene method.

res_rwrfgsea$modules <- COPS::subsample_module_evaluation(
  clusters = res_rwrfgsea$clusters, 
  parallel = 1, 
  module_eigs = COPS::psoriasis_example_MEs,
  module_cor_threshold = 0.3,
  module_nan.substitute = 0)

Recompute summaries:

scores_rwrfgsea <- COPS::scoring(
  res_rwrfgsea, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  significance_level = 0.05)

Survival relevance evaluation

Typically, Kaplan-Meier estimates and the log-rank test are used to compare survival between two groups. This strategy is easy to understand and visualize. However, using the KM curves can be misleading as it does not account for any covariates such as age and cancer stage which are crucial for understanding the prognosis of a patient.

In COPS, we provide an alternative strategy for estimating the significance of a survival difference between groups by using the Cox proportional hazards model. With a likelihood-ratio test we can compare a baseline model of only covariates to an augmented model that also has clustering information. The survival related functionalities of COPS are discussed in detail in the multi-omic cancer clustering analysis vignette.

Pairwise metric plots

One of the main purposes of COPS it to provide tools for comparing clustering results from multiple perspectives. To do multi-objective evaluation we can plot the results in pair-wise metric plots. In case different objectives are in conflict, the related metrics may exhibit trade-offs given that the included clustering strategies adequately explore these trade-offs. The solutions that are on the edges of the distribution in each scatter plot may be Pareto optimal, and are worth exploring further.

But first we should generate some alternative results to compare against:

set.seed(0)
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
pso_degs <- COPS::psoriasis_degs_example
res_dr <- COPS::COPS(
  log2(COPS::psoriasis_rnaseq_corrected[pso_degs,lesional_id]+1), 
  parallel = 1, nruns = 2, nfolds = 5, 
  dimred_methods = c("pca"), 
  pca_dims = 2:8,
  cluster_methods = c("kmeans", "diana", "model"), 
  distance_metric = "euclidean", 
  n_clusters = 2:6,
  association_data = COPS::psoriasis_clinical_example[["lesional"]],
  module_eigs = COPS::psoriasis_example_MEs)

scores_dr <- COPS::scoring(
  res_dr, 
  wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi,
  significance_level = 0.05)

In order to compare the scores, we can combine the two data.frames containing all scores using rbind:

scores_combined <- rbind(scores_dr$all, scores_rwrfgsea$all)

Finally, we can do a comparison by using the pareto_plot function. By setting plot_pareto_front = TRUE we can also visualise the first Pareto front which consists of all the non-dominated solutions. In order to do so, we also need to provide a list of comparators that are used to determine if a result is better than another one. COPS can guess the comparators to some extent, but for variable associations it is necessary to define whether it should be minimized or maximized.

selection <- scores_combined[["drname"]] %in% paste0("pca", c(2,5,7,10))
selection <- selection | (scores_combined[["Approach"]] == "RWR-FGSEA")
selection <- selection & scores_combined[["Smallest_cluster_size"]] >= 10

p <- COPS::pareto_plot(
  scores_combined[selection,],
  metrics = c(
    "ClusteringStabilityJaccard", 
    "Silhouette", 
    "Smallest_cluster_size", 
    "GSE.nmi", 
    "Module_score"),
  plot_palette = RColorBrewer::brewer.pal(9, "Set1"),
  plot_pareto_front = TRUE,
  metric_comparators = list(`>`, `>`, `>`, `<`, `>`))

In terms of internal metrics like average silhouette score and clustering stability, it is often the case that fewer clusters yield better metrics. For stability we consider 0.8 as reasonable threshold for stable results. The silhouette score is often very low for high-dimensional input data, in the COPS pipeline silhouette is defined based on distance in the input data by default. If silhouette distances are measured in the low dimensional embedded space the score will tend to favour embeddings with fewer dimensions. This behaviour can be changed by setting the COPS argument pre_compute_silhouette_dissimilarity = TRUE.

Regardless, in this comparison there seems to be a trade-off between silhouette and module score. Two cluster results form one group while three and four cluster results vary significantly more in terms of metrics. Batch NMI is also very low overall so it might not be considered a trade-off metric. Hence, we would probably select one of the PCA-based results with two clusters, since they have the highest silhouette and stability and one three cluster result which have the high module score and stability.

Pareto fronts

To obtain a list of the Pareto optimal results, we can use the pareto_fronts function. Note that using a high number of metrics will typically result in more Pareto optimal results as it becomes more likely that at least one of the metrics is better when comparing two results to each other and thus neither result dominates the other.

scores_combined_subset <- scores_combined[scores_combined[["Smallest_cluster_size"]] >= 10,]
pfs <- COPS::pareto_fronts(scores_combined_subset,
                           metrics = c("ClusteringStabilityJaccard", "Silhouette", 
                                       "Smallest_cluster_size", "GSE.nmi", 
                                       "Module_score"),
                           metric_comparators = list(`>`, `>`, `>`, `<`, `>`))
table(pfs)

interest_cols <- c("Approach", "Embedding", "Clustering", "k", "pareto_front")
scores_combined_subset$pareto_front <- pfs
scores_combined_subset[scores_combined_subset$pareto_front == 1,interest_cols]

Batch effects and data preparation

Up to this point we have not discussed the pre-processing of input data, which is crucial for obtaining good clustering results. Firstly, we expect the input data to be pre-processed appropriately. For example, RNA-Seq counts should be normalized and variance stabilised, i.e. log-scaled TPM or CPM and DNA-methylation should be given as methylation percentages or M-values. Secondly, feature selection is an important factor for obtaining relevant results, although not always necessary. Finally, if present, effects related to known batches should be corrected by using the appropriate tools (such as ComBat). COPS does not provide tools for processing sequencing data or performing batch-correction. However, COPS does provide tools for assessing batch-effects prior to clustering.

Pre-clustering batch effect metrics

In this example, our purpose is to discover subtypes of psoriasis in the gene-expression profiles of patient skin samples affected by the disease, i.e. the lesional samples. Therefore it is important that the batch effect within the affected subset is small enough to be negligible in clustering analysis.

tabl <- "
| function               | Description                               |
|------------------------|-------------------------------------------|
| DSC                    | Dispersion Separability Criterion         |
| feature_associations   | Calculates several batch effect indices including DSC, silhouette and regression based for multiple categories with one function call. |
"
cat(tabl)

Dispersion Separability Criterion is used by TCGA batch viewer. In their guidance they suggest 0.5 as a threshold for some concern.

lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]]
study_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSE"]]
log_norm <- log2(COPS::psoriasis_rnaseq_normalized[,lesional_id]+1)
log_corrected <- log2(COPS::psoriasis_rnaseq_corrected[,lesional_id]+1)
print(paste0("DSC before correction: ", COPS::DSC(log_norm, study_id)))
print(paste0("DSC after  correction: ", COPS::DSC(log_corrected, study_id)))

Clearly, batch correction (using sva::ComBat_seq) was necessary and effective. The DSC is a bit high for batch corrected data, but probably will not affect clustering too much.

We can also use feature_associations to run DSC and a two other PCA-based approaches for batch effect estimation from the kBET-package.

feats <- COPS::feature_associations(log_corrected, list(GSE = study_id), n_pc_max = 20)

The associations are based on the average silhouette score between categories, the maximum coefficient of determination (R2) between PCs and a linear model based on the categories, and the aforementioned DSC. In the absence of batch-effects, the silhouette should be below 0 while DSC should be below 0.5. If the R2 max is close to 1 it means there are PCs that are strongly associate with batches and they should be analysed further. It is worth keeping in mind that the R2 max is more sensitive than the other metrics.

feats$associations

This function also returns the PCA eigenvalues and explained variance:

feats$eigenvalues[1:20,]

We can see that after the 18 first components, each remaining component is explaining less than 1% of the remaining variance. We presume that the latter components comprise biological and technological noise and that the main biological effects that we want to cluster are contained within the first 10 or so components. Selecting the number of components is a non-trivial problem, but we can test different numbers and see which yields the best clustering results.

Session info

sessionInfo()

vittoriofortino84/COPS documentation built on Jan. 28, 2025, 3:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vittoriofortino84/COPS
Clustering algorithms for Omics-driven Patient Stratification

In vittoriofortino84/COPS: Clustering algorithms for Omics-driven Patient Stratification

Introduction

Vignette required packages

Clustering functionality

Function signatures

Quick example

Biological knowledge integration and feature extraction

Result data structures

Ranking results

Comparing different methods

Internal clustering indices

Clustering stability

Module score

Survival relevance evaluation

Pairwise metric plots

Pareto fronts

Batch effects and data preparation

Pre-clustering batch effect metrics

Session info

R Package Documentation

Browse R Packages

We want your feedback!

vittoriofortino84/COPS Clustering algorithms for Omics-driven Patient Stratification

In vittoriofortino84/COPS: Clustering algorithms for Omics-driven Patient Stratification

Introduction

Vignette required packages

Clustering functionality

Function signatures

Quick example

Biological knowledge integration and feature extraction

Result data structures

Ranking results

Comparing different methods

Internal clustering indices

Clustering stability

Module score

Survival relevance evaluation

Pairwise metric plots

Pareto fronts

Batch effects and data preparation

Pre-clustering batch effect metrics

Session info

R Package Documentation

Browse R Packages

We want your feedback!

vittoriofortino84/COPS
Clustering algorithms for Omics-driven Patient Stratification