Clustering algorithms are a staple exploratory tool in studying omics profiles of patients in order to identify molecular patterns corresponding to groups of patients that indicate different subtypes of disease. Omics data tends to be very high-dimensional, which can be addressed with different strategies. In COPS we have included data-driven strategies based on dimensionality reduction as well as knowledge-driven strategies based on integrating omics data with pathway gene sets and biological networks. Another crucial aspect to consider when trying to identify disease subtypes is the evaluation of the subtype candidates (i.e., patient clusters). The best number of clusters and the best clustering strategy are highly dependent on the dataset and the criteria used to evaluate clustering results.
While other R-packages such as clValid
and clusterCrit
provide access to a plethora of internal metrics to evaluate clustering results,
COPS
provides a few unique functionalities for omics data in life
sciences. First, COPS
uses repeated cross-fold validation to assess
the stability of clusters and metrics for subsets of the data. Second,
COPS
implements several pathway based clustering approaches that
integrate biological knowledge with omics data. Third, COPS
can be
used to evaluate mechanistic relevance for a given set of meta-genes identified
using WGCNA for example. Fourth, COPS
provides robust test for
survival differences between clusters while accounting for relevant covariates.
COPS
has been built to parallelise most operations and has been
optimised to run relatively efficiently in terms of memory so that most analyses
with relatively small datasets (up to 1000 samples) can be run on a personal
computer.
In addition to the basic installation of COPS
, this vignette makes
use of additional formatting for ggplot.
if (!requireNamespace("ggh4x", quietly = TRUE)) { install.packages("ggh4x", repos='http://cran.rstudio.com/') }
Clustering analysis in COPS
is divided into several steps, many of
which are optional:
subsampling
generates data subsets using cross-validation or
bootstrap.subsample_pathway_enrichment
transforms the data from gene features
to pathway features thereby integrating biological knowledge with omics data.subsample_dimred
transforms the data into a lower dimensional
representation that is more suitable for clustering.subsample_clustering_evaluation
runs clustering algorithms and
computes clustering
internal clustering metrics. stability_evaluation
evaluates clustering stability by comparing
cluster assignments obtained within each fold to a reference obtained from the
full dataset. subsample_survival_evaluation
evaluates survival differences between
clusters. subsample_module_evaluation
evaluates cluster-gene-module associations as
in https://doi.org/10.1093/bib/bbab314.subsample_association_analysis
calculates various statistics and indices
for cluster association with external variables. To simplify the usage the full pipeline can be called using the eponymous
COPS
-function. Most steps are also associated with a separate
function that processes each CV fold. Below is a list of arguments that are
relevant to each step, starting from general arguments shared by many functions:
dat
: Input data which can be a matrix or list of matrices,
columns will be treated as samples and rows as features. dat_list
: List of data.frames or data.tables with features
on columns where feature names have been replaced with 'dim[0-9]+', they may
also contain other columns with metadata such as sample id, CV fold etc.
Managed by the COPS
function. parallel
: Number of parallel threads to execute, defaults to 1.by
: Column names by which data is split into smaller tasks
corresponding to different CV folds and method settings. Managed by the
COPS
function. COPS
argumentspathway_enrichment_method
: (Optional) Name of the pathway
enrichment method to use, passed to subsample_pathway_enrichment. Affects certain
routines in the pipeline.multi_omic_methods
: (Optional) Names of the multi-view
clustering algorithms to use. Affects certain routines in the pipeline.vertical_parallelization
: Controls whether each analysis
step waits for threads to finish on all folds and settings before moving to
next step (horizontal) or to create one thread per fold and execute all
steps and method settings in sequentially within each thread (vertical).
The latter can help lower memory utilization when using fewer threads than
folds.verbose
: Controls, whether progress and time taken is
printed to the console....
: All additional arguments are passed on to each step in
the pipeline so that all relevant functionality is accessible via just one
function call. See ?COPS
for more information. subsampling
, cv_fold
nfolds
: Number of CV folds.nruns
: Number of CV repeats or bootstrap iterationsnruns
: Number of CV repeats or bootstrap iterationsstratified_cv
: Sets use of stratified CV, sample given
stratification variable as evenly as possible across the folds.anti_stratified
: Sets use of anti-stratified CV, sample
given stratification variable as unevenly as possible across the folds.cv_stratification_var
: Categorical variable for stratified CV.extra_fold
: If set to TRUE, the function will return an
additional k+1th fold containing all data points, this fold is used as
reference by stability_evaluation
.subsample_pathway_enrichment
, genes_to_pathways
sub_index
: List of data.frames corresponding to
subset indices produced by subsampling
.
Managed by the COPS
function. gene_id_list
: List of gene id vectors of the corresponding
columns in dat_list
used to translate 'dim[0-9]+' back to genes
for pathway analysis. Managed by the COPS
function. x
: Gene feature matrix, samples on columns and genes on rows.
Only used when calling genes_to_pathways
directly. enrichment_method
: The options are: "DiffRank", "GSVA",
"RWRFGSEA". Only one method is allowed at a time. gene_set_list
: List of gene sets corresponding to pathways
with gene names that correspond to rows in expr
(i.e., rownames
in input data). batch_label_pw
: Batch labels for batch-wise enrichment which
is not performed if this is NULL
. ...
: See ?genes_to_pathways
for more options. subsample_dimred
, dimensionality_reduction
sub_index
: List of data.frames corresponding to
subset indices produced by subsampling
.
Managed by the COPS
function. sub_split_data
: Can be set to FALSE
if
dat_list
elements already contain the columns "run" and "fold".
The first transformation step in the analysis (either pathway-based or DR-based)
splits the data according to the folds, so if DR is performed after PW
transformation the data will be already split. Managed by the
COPS
function. dimred_methods
: Character vector of method names. Options
are: "none", "pca", "tsne" and "umap".x
: Data matrix, features on columns and samples on rows.
Only used when calling dimensionality_reduction
directly. ...
: See ?dimensionality_reduction
for more
options. subsample_clustering_evaluation
, clustering_analysis
,
clustering_metrics
dat_embedded
: Same format as dat_list
.
Managed by the COPS
function. x
: A data.frame with features on columns labeled as
"dim[0-9]+", must also contain "id" column. Used only when calling
clustering_analysis
directly. n_clusters
: A vector of integers defining the number of
clusters.cluster_methods
: A vector of clustering method names.
Options are: "hierarchical", "diana", "kmeans", "model", "knn_communities",
"spectral", "SC3", "kkmeans" and "kkmeanspp". silhouette_dissimilarity
: Dissimilarity matrix used for
silhouette evaluation, should contain all samples in the data. If this is
not given, silhouette will be estimated using the extracted features which
tends to favour lower dimensional representations. internal_metrics
: Names of internal indices corresponding
to clusterCrit::intCriteria crit
argument. Note that using this
interface is much slower than the inbuilt average silhouette metric....
: See ?clustering_analysis
and
?clustering_metrics
for more options. stability_evaluation
clusters
: A clustering data.frame such as returned by
subsample_clustering_evaluation
. Managed by the COPS
function. reference_fold
: Fold number that corresponds to reference
which other folds are compared against, inferred from input by default.subsample_survival_evaluation
, subsample_survival_evaluation
event_data
: A data.frame that contains survival times, event
data and covariates.. clusters
: A clustering data.frame such as returned by
subsample_clustering_evaluation
. Managed by the COPS
function. survival_time_col
: Name of the column in
event_data
that contains survival time.survival_event_col
: Name of the column in
event_data
that contains event indicators. survival_covariate_names
: Names of covariate columns in
event_data
.row_id
: Name of column in event_data
that
matches "id" column in clusters
.subsample_module_evaluation
, gene_module_score
clusters
: A clustering data.frame such as returned by
subsample_clustering_evaluation
. Managed by the
COPS
function. x
: A data.frame with columns "id" and "cluster"
corresponding to a single clustering result.module_eigs
: A matrix of meta-features, e.g. gene module
eigen-genes for each sample (samples x modules) from COPS
....
: See ?gene_module_score
for more options. subsample_association_analysis
, cluster_associations
clusters
: A clustering data.frame such as returned by
subsample_clustering_evaluation
. Managed by the COPS
function. association_data
: A data.frame with external variables
corresponding to the input samples (categoric or numeric), rownames are
matched from clusters
"id" column (i.e., dat
colnames), missing values are removed automatically.The purpose of this tutorial is to go over typical steps in the clustering analysis of multi-cohort gene-expression data, i.e., data gathered from multiple sources. As an example we will use a Psoriasis related dataset and use COPS to check for batch effect related issues, before applying COPS for evaluation of different clustering strategies. Towards the end we go over different metrics and how to compare methods based on multiple criteria.
In COPS we have included an example dataset related to psoriasis RNA-Seq data from multiple studies collected and harmonised by Federico et al. (2020). The full dataset including atopic dermatitis studies is available at Zenodo. We also only consider Psoriasis patients and protein coding genes in this tutorial.
COPS offers a few high performance pathway based clustering approaches that can be used for large datasets while also utilizing resampling for evaluation. Specifically, COPS can use GSVA, DiffRank or RWR-FGSEA to transform gene-expression profiles into pathway activity profiles. The provided the unsupervised pathway activation inference methods are summarised the in table below.
tabl <- " | Method | Description | Reference | |-----------|:--------------------------------|:------------------------------------------| | GSVA | Gene Set Variation Analysis summarises gene-expression at the level of pathways by using non-parametric gene statistics. | https://doi.org/10.1186/1471-2105-14-7 | | DiffRank | A ranking based approach to score pathway activity in individual samples. | https://doi.org/10.1186/s12920-018-0449-4 | | RWR-FGSEA | Combines Random Walk with Restart and Fast Gene Set Enrichment Analysis to first propagate dysregulated gene information through a network to generate gene-statistics which used to identify sample specific pathway enrichment. | https://doi.org/10.1093/bib/bbab314 | " cat(tabl)
These pathway activity inference methods require pathway gene-sets. To download pathway gene sets we prefer to use MSigDB which combines many sources of pathway annotations. To keep the runtime short we will only use KEGG and filter the pathways based on size.
pw_db <- msigdbr::msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG") pw_list <- lapply(split(pw_db, pw_db$gs_name), function(x) x$ensembl_gene) pw_list <- pw_list[which(sapply(pw_list, length) <= 200 & sapply(pw_list, length) >= 100)]
The RWR-FGSEA method also requires a gene-network, since it uses random walk with restart to generate gene statistics for gene set enrichment analysis. We can use networks based on biological databases such as protein protein interactions (PPI) or we can generate a co-expression network from the gene-expression data.
COPS has a built-in function to retrieve PPIs STRINGdb and process the network into suitable form:
ppi_net <- COPS::getHumanPPIfromSTRINGdb(gene_id_mart_column = "ensembl_gene_id")
Internally, COPS defines a set of maximally dysregulated genes for each sample.
These are used as seeds for the random walk to discover associated genes and to
generate gene statistics for FGSEA. This is not a trivial task, but in practice
simple strategies like selecting the genes with highest expression in each sample
yield reasonable results. RWRFGSEA
can do this automatically given an
expression matrix as input. Another possibility to select gene seeds is to use
non-parametric gene statistics similarly to GSVA, which can be achieved by using
rwr_ecdf=TRUE
. The options to this method are are explained on the
help-page (?RWRFGSEA
) in more detail.
With default settings RWRFGSEA behaves like a single-sample pathway scoring function. Therefore it can be run before subsampling to save time.
set.seed(0) lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]] pso_degs <- COPS::psoriasis_degs_example pso_rwrfgsea <- COPS::RWRFGSEA( log2(COPS::psoriasis_rnaseq_corrected[pso_degs,lesional_id]+1), gene_network = ppi_net, parallel = 1, rwr_seed_size = 50, gene_set_list = pw_list, fgsea_nperm = 1e2)
For this vignette we want to consider pathway transformation as an alternative to dimensionality reduction. Since the number of considered pathways is still reasonably high at 171, we choose to utilise Spearman correlation distance and divisive hierarchical clustering.
set.seed(0) lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]] pso_degs <- COPS::psoriasis_degs_example res_rwrfgsea <- COPS::COPS( pso_rwrfgsea, parallel = 1, nruns = 2, nfolds = 5, dimred_methods = c("none"), cluster_methods = c("diana"), distance_metric = "correlation", n_clusters = 2:6, association_data = COPS::psoriasis_clinical_example[["lesional"]])
The COPS
-function runs and aggregates the outputs from each step
and returns a named list of the following objects that contain results for all
runs, folds and method settings:
clusters
data.frame defining clustersinternal_metrics
data.frame of internal metricsstability
data.frame of stability scoressurvival
data.frame of survival analysis resultsmodules
data.frame of gene module association scoresassociation
data.frame of association results to variables of interestcluster_sizes
data.frame giving the sizes of clustersTo summarise the results into one data.frame we can use the
scoring
-function that takes the list of results and
returns a table with one line per input data and COPS
setting.
The wsum
argument is used to define a composite metric comprised of
multiple metrics for ranking the results. By default, the metrics are summarised
across CV folds (mean and sd), but this can be disabled by setting
summarise=FALSE
. P-values are summarised based on the null hypothesis
rejection rate which is parameterized by the significance_level
:
scores_rwrfgsea <- COPS::scoring( res_rwrfgsea, wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi, significance_level = 0.05)
This returns a list containing the whole summarised table in the element "all",
and only the line corresponding to the best result (according to wsum) in "best".
COPS
also includes tools for Pareto-based multi-objective evaluation,
but these are discussed later.
score_cols <- c( "Approach", "Embedding", "Clustering", "k", "Silhouette", "ClusteringStabilityJaccard", "GSE.nmi") scores_rwrfgsea$all[,score_cols]
For visualising the spread of metrics, we can use boxplots after reorganising the scores without summary:
scores_rwrfgsea_dist <- COPS::scoring( res_rwrfgsea, wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi, summarise = FALSE)
Note that with summarise set to FALSE, the function will also return results for the reference fold (k+1). Hence, when plotting resampled metrics, the reference should be omitted manually.
require(ggplot2) ggplot( scores_rwrfgsea_dist$all[scores_rwrfgsea_dist$all$fold != 6,], aes(x = interaction(Embedding, Clustering), y = ClusteringStabilityJaccard, fill = interaction(Embedding, Clustering))) + geom_boxplot() + theme_bw() + ggh4x::facet_grid2( Approach ~ k, scales = "free_x", independent = "x", labeller = labeller(k = function(x) paste0("k=", x))) + theme(legend.position="bottom", axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.x = element_blank()) + guides(fill = guide_legend(ncol = 3, title.position = "top"))
For benchmarking clustering algorithms, one of the most common metrics is the concordance of the grouping result with ground truth labels. However, in the context of bioinformatics there are many different aspects could be used to assess the usefulness of clustering results, including associations to clinical variables, survival, clustering stability and mechanistic relevance.
By default, COPS
computes the average silhouette index (or score)
for each clustering result using distances calculated for the input matrix.
The silhouette score is a summary of the cohesion and separability of clusters.
In addition to the silhouette, it is possible to utilize any internal index
implemented by clusterCrit::intCriteria
. To do this we can pass the
desired index names in the internal_metrics
argument to
COPS
, subsample_clustering_evaluation
or
clustering_metrics
.
Note that since the clusterCrit::intCriteria
-function takes the
data matrix as input and computations will be repeated for each fold and setting,
this can take much longer than the silhouette calculation by COPS
which calculates the distances only once.
A key clustering metric that is often omitted by other packages, is the
clustering stability which COPS
estimates by using cross-validation.
To be clear, the validation folds are not used for clustering or unsupervised
dimensionality reduction, the purpose of CV is just to re-sample the original
data efficiently. Compared to bootstrapping, CV does not duplicate any data
points which could yield more accurate estimates of metrics within the folds.
The downside is that clustering data size is decreased within the folds which
can affect results and tends to affect metric values, especially those based on
statistical tests.
Clustering stability, i.e., the grouping similarity within training folds, in
COPS
is calculated with respect to a reference clustering result
obtained with identical settings on the full dataset. Three different indices
are used, namely the Jaccard index, adjusted Rand index and normalized mutual
information.
The CV schema can also be used to estimate the uncertainty of the metrics. Using repeated cross-fold validation is recommended to obtain credible estimates for expected values and their variance.
The module score (https://doi.org/10.1093/bib/bbab314) is based on the
association of clusters to gene meta-features. The score is defined as:
$$
\text{score} = E\left[\frac{\min{(1,\mathbf{S^{+})}}+\min{(1,\mathbf{S^{-})}}}{\mathbf{S^{+}} + \mathbf{S^{-}}}\right]
$$
where
$$
\mathbf{S^\pm_{i}} = \sum_j{\delta^\pm\left(\rho(\mathbf{e}^{(i)}, \mathbf{c}^{(j)})\right)}
$$
where $\rho$ corresponds to spearman correlation, $\mathbf{e}^{(i)}$ corresponds
to the $i^{th}$ column of the WGCNA eigen-gene matrix, $\mathbf{c}^{(j)}$
corresponds to the $j^{th}$ cluster indicator variable and
$$
\delta^{-}(x) = \begin{cases}
1 & \mbox{if } x \leq -\alpha \
0 & \mbox{otherwise}
\end{cases}
$$
$$
\delta^{+}(x) = \begin{cases}
1 & \mbox{if } x \geq \alpha \
0 & \mbox{otherwise}
\end{cases}
$$
where $\alpha$ is a parameter that defines a threshold for counting correlations.
The fraction is undefined when the denominator is zero which occurs when no
clusters are strongly correlated to an eigen-gene, in which case we assign
a score of $\beta$. When $\beta$ is close to zero, clustering results that
do not exhibit differential expression between clusters in each gene module
are penalized.
In the gene_module_score
-function $\alpha$ corresponds to the
argument module_cor_threshold
while $\beta$ corresponds to the
argument module_nan.substitute
.
To generate these for previously calculated results we can call
subsample_module_evaluation
. We have pre-computed disease gene-modules
on this data by using the WGCNA eigen-gene method.
res_rwrfgsea$modules <- COPS::subsample_module_evaluation( clusters = res_rwrfgsea$clusters, parallel = 1, module_eigs = COPS::psoriasis_example_MEs, module_cor_threshold = 0.3, module_nan.substitute = 0)
Recompute summaries:
scores_rwrfgsea <- COPS::scoring( res_rwrfgsea, wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi, significance_level = 0.05)
Typically, Kaplan-Meier estimates and the log-rank test are used to compare survival between two groups. This strategy is easy to understand and visualize. However, using the KM curves can be misleading as it does not account for any covariates such as age and cancer stage which are crucial for understanding the prognosis of a patient.
In COPS
, we provide an alternative strategy for estimating the
significance of a survival difference between groups by using the Cox proportional
hazards model. With a likelihood-ratio test we can compare a baseline model of
only covariates to an augmented model that also has clustering information.
The survival related functionalities of COPS
are discussed in
detail in the multi-omic cancer clustering analysis vignette.
One of the main purposes of COPS it to provide tools for comparing clustering results from multiple perspectives. To do multi-objective evaluation we can plot the results in pair-wise metric plots. In case different objectives are in conflict, the related metrics may exhibit trade-offs given that the included clustering strategies adequately explore these trade-offs. The solutions that are on the edges of the distribution in each scatter plot may be Pareto optimal, and are worth exploring further.
But first we should generate some alternative results to compare against:
set.seed(0) lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]] pso_degs <- COPS::psoriasis_degs_example res_dr <- COPS::COPS( log2(COPS::psoriasis_rnaseq_corrected[pso_degs,lesional_id]+1), parallel = 1, nruns = 2, nfolds = 5, dimred_methods = c("pca"), pca_dims = 2:8, cluster_methods = c("kmeans", "diana", "model"), distance_metric = "euclidean", n_clusters = 2:6, association_data = COPS::psoriasis_clinical_example[["lesional"]], module_eigs = COPS::psoriasis_example_MEs)
scores_dr <- COPS::scoring( res_dr, wsum = ClusteringStabilityJaccard + Silhouette - GSE.nmi, significance_level = 0.05)
In order to compare the scores, we can combine the two data.frames containing all scores using rbind:
scores_combined <- rbind(scores_dr$all, scores_rwrfgsea$all)
Finally, we can do a comparison by using the pareto_plot
function.
By setting plot_pareto_front = TRUE
we can also visualise the first
Pareto front which consists of all the non-dominated solutions. In order to do so,
we also need to provide a list of comparators that are used to determine if a
result is better than another one. COPS
can guess the comparators to
some extent, but for variable associations it is necessary to define whether it
should be minimized or maximized.
selection <- scores_combined[["drname"]] %in% paste0("pca", c(2,5,7,10)) selection <- selection | (scores_combined[["Approach"]] == "RWR-FGSEA") selection <- selection & scores_combined[["Smallest_cluster_size"]] >= 10 p <- COPS::pareto_plot( scores_combined[selection,], metrics = c( "ClusteringStabilityJaccard", "Silhouette", "Smallest_cluster_size", "GSE.nmi", "Module_score"), plot_palette = RColorBrewer::brewer.pal(9, "Set1"), plot_pareto_front = TRUE, metric_comparators = list(`>`, `>`, `>`, `<`, `>`))
In terms of internal metrics like average silhouette score and clustering
stability, it is often the case that fewer clusters yield better metrics.
For stability we consider 0.8 as reasonable threshold for stable results.
The silhouette score is often very low for high-dimensional input data, in the
COPS
pipeline silhouette is defined based on distance in the input
data by default. If silhouette distances are measured in the low dimensional
embedded space the score will tend to favour embeddings with fewer dimensions.
This behaviour can be changed by setting the COPS
argument
pre_compute_silhouette_dissimilarity = TRUE
.
Regardless, in this comparison there seems to be a trade-off between silhouette and module score. Two cluster results form one group while three and four cluster results vary significantly more in terms of metrics. Batch NMI is also very low overall so it might not be considered a trade-off metric. Hence, we would probably select one of the PCA-based results with two clusters, since they have the highest silhouette and stability and one three cluster result which have the high module score and stability.
To obtain a list of the Pareto optimal results, we can use the pareto_fronts
function. Note that using a high number of metrics will typically result in more
Pareto optimal results as it becomes more likely that at least one of the metrics is
better when comparing two results to each other and thus neither result dominates
the other.
scores_combined_subset <- scores_combined[scores_combined[["Smallest_cluster_size"]] >= 10,] pfs <- COPS::pareto_fronts(scores_combined_subset, metrics = c("ClusteringStabilityJaccard", "Silhouette", "Smallest_cluster_size", "GSE.nmi", "Module_score"), metric_comparators = list(`>`, `>`, `>`, `<`, `>`)) table(pfs)
interest_cols <- c("Approach", "Embedding", "Clustering", "k", "pareto_front") scores_combined_subset$pareto_front <- pfs scores_combined_subset[scores_combined_subset$pareto_front == 1,interest_cols]
Up to this point we have not discussed the pre-processing of input data,
which is crucial for obtaining good clustering results. Firstly, we expect
the input data to be pre-processed appropriately. For example, RNA-Seq counts should be
normalized and variance stabilised, i.e. log-scaled TPM or CPM and DNA-methylation
should be given as methylation percentages or M-values. Secondly, feature selection is
an important factor for obtaining relevant results, although not always necessary.
Finally, if present, effects related to known batches should be corrected by
using the appropriate tools (such as ComBat). COPS
does not provide
tools for processing sequencing data or performing batch-correction.
However, COPS
does provide tools for assessing batch-effects prior
to clustering.
In this example, our purpose is to discover subtypes of psoriasis in the gene-expression profiles of patient skin samples affected by the disease, i.e. the lesional samples. Therefore it is important that the batch effect within the affected subset is small enough to be negligible in clustering analysis.
tabl <- " | function | Description | |------------------------|-------------------------------------------| | DSC | Dispersion Separability Criterion | | feature_associations | Calculates several batch effect indices including DSC, silhouette and regression based for multiple categories with one function call. | " cat(tabl)
Dispersion Separability Criterion is used by TCGA batch viewer. In their guidance they suggest 0.5 as a threshold for some concern.
lesional_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSM"]] study_id <- COPS::psoriasis_clinical_example[["lesional"]][["GSE"]] log_norm <- log2(COPS::psoriasis_rnaseq_normalized[,lesional_id]+1) log_corrected <- log2(COPS::psoriasis_rnaseq_corrected[,lesional_id]+1) print(paste0("DSC before correction: ", COPS::DSC(log_norm, study_id))) print(paste0("DSC after correction: ", COPS::DSC(log_corrected, study_id)))
Clearly, batch correction (using sva::ComBat_seq) was necessary and effective. The DSC is a bit high for batch corrected data, but probably will not affect clustering too much.
We can also use feature_associations
to run DSC and a two other
PCA-based approaches for batch effect estimation from the kBET
-package.
feats <- COPS::feature_associations(log_corrected, list(GSE = study_id), n_pc_max = 20)
The associations are based on the average silhouette score between categories, the maximum coefficient of determination (R2) between PCs and a linear model based on the categories, and the aforementioned DSC. In the absence of batch-effects, the silhouette should be below 0 while DSC should be below 0.5. If the R2 max is close to 1 it means there are PCs that are strongly associate with batches and they should be analysed further. It is worth keeping in mind that the R2 max is more sensitive than the other metrics.
feats$associations
This function also returns the PCA eigenvalues and explained variance:
feats$eigenvalues[1:20,]
We can see that after the 18 first components, each remaining component is explaining less than 1% of the remaining variance. We presume that the latter components comprise biological and technological noise and that the main biological effects that we want to cluster are contained within the first 10 or so components. Selecting the number of components is a non-trivial problem, but we can test different numbers and see which yields the best clustering results.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.