cluster_goodness: Take a dataset along with its cluster assignments and cluster...
In corwms/ZunderPipelineFunctions: Zunder Lab pipeline helper functions

View source: R/ClusterStabilityAnalysis.R

Outputs

1
2
3

cluster_goodness(clust_dataset = NULL, clust_assigns = NULL,
  cutoff_values = NULL, cutoff_trim_factor = 0.8,
  knn_to_sum_factor = 0.3, knn_to_sum_min = 5)

`clust_dataset`	original dataset which was clustered
`clust_assigns`	cluster assignments from consensus clustering
`cutoff_values`	cluster stability confidence values
`cutoff_trim_factor`	minimum cluster stability confidence value to keep cell
`knn_to_sum_factor`	fraction of clusters to be compared to each clusetr
`knn_to_sum_min`	minimum number of clusters to be compared to each cluster

In the outputted list, "cluster_stats"

"min_dist_btwn_means" Take the centroid of every cluster, and then for each cluster, look at how far away its closest neighbor is in n-dimensional space. If a cluster is far away from its closest neighbor, then it's well separated more likely to be a "good cluster." Distances here are calculated by city block/manhattan.

"min_dist_btwn_means_single_param" Same as above, but calculate for each individual dimension, and then output the biggest distance to a closest neighbor in 1D space. The purpose of this is to identify if there's one cluster that's very far away from all the clusters (well separated) by just one parameter - this is another way of identifying a "good" cluster

"sum_dists_btwn_means" This is the same as "min_dist_btwn_means", but instead of taking just the closest neighbor, it looks at the n-closest neighbors, and sums/averages the distance to all of them. This statistic is intended to identify "good" clusters that are very close to each other, but far from everything else. Uses a diminishing scale, so 1st neighbor is counted more than 2nd neighbor, which is counted more than 3rd neighbor, etc. This statistic will be useful for automated parameter detection (where we're trying to see if we get "good" clustering with a given set of parameters) but will not be as useful as a general readout of individual cluster goodness.

"combined_stdevs" Standard deviation for each cluster. This is calculated for each parameter/dimension individually, and then summed/averaged. Clusters with lower standard deviation are "better" (if everything else is equal).

"combined_dist_from_mean_by_cluster" This is another way of looking at variance within a cluster in addition to standard deviation. Find the centroid of each cluster, and then calculate how far every point in the cluster is away from its centroid. Take the average of these distances and that tells you a bit about how spread out the cluster is in n-dimensional space. Clusters with low variance here are "better" (when everything else is equal).

"c_dist_to_nearest_point" This is similar to "min_dist_btwn_means" in that it wants to know the distances between clusters, but it calculates it in a different way. Instead of looking at means, this looks at two clusters and finds the shortest distance between a single point in each cluster. So what this does is for each cluster, calculate the distance to the nearest point outside that cluster. If this distance is far, then the cluster is "well separated" and this makes it a better cluster.

"c_sum_dists_to_nearest_points" Like above, but this should sum/average the distance not just to the nearest point/cluster, but find the n-nearest clusters (by individual points) and then sum average over all these distances. Uses a diminishing scale like "sum_dists_btwn_means." Rationale for looking over multiple closest clusters instead of just the single closest is the same as for "sum_dists_btwn_means".

"c_dist_to_nearest_point_trim" Same as "c_dist_to_nearest_point", but using clusters that are trimmed by stability here. The idea for this is that you may have two clusters that are very far apart (well separated), except that they're connected by a few low stability points that fall in between, so it seems like they're actually very close to each other.

"c_sum_dists_to_nearest_points_trim" Like "c_dist_to_nearest_point_trim", but summed/averaged over the n-nearest points, rather than looking just at the single nearest cluster. See other parameters for more explanation.

"stability_by_cluster" See "stability_by_cell" and average this over the whole cluster. 1) Iterate clustering over many different subsamples. 2) Find the universal cluster/groups and re-assign. 3) For each cell, what percentage of the time does it fall into it's universal cluster? That's the cluster stability by cell. 4) Average over the whole cluster. Higher stability means better/more reproducible clusters.

"c_dist_to_nearest_point_single_param" Same as above - but looking one parameter at a time. To take into account the scenario where a cluster is VERY different from its neighbors in one parameter, but this gets drowned out by being similar in everything else.

"c_dist_to_nearest_point_trim_single_param" Same as above, but trimmed to remove low stability cells.

"combined_dist_from_mean_by_cell" Same as above ("combined_dist_from_mean_by_cluster") but looking at each cell individually rather than by cluster

"stability_by_cell" Same as described above in "stability_by_cluster", but by cell. This is what we calculated and output in the older version of this script.

In the outputted list, "param_names"

These are to identify which marker is driving the biggest difference between clusters. This should give an idea about which marker is most important for defining each cell type.

"c_dist_to_nearest_point_single_param_name" The parameter that drives "c_dist_to_nearest_point_single_param" from above, for each cluster

"c_dist_to_nearest_point_trim_single_param_name" The parameter that drives "c_dist_to_nearest_point_trim_single_param" from above, for each cluster

"min_dist_btwn_means_single_param_name" The parameter that drives "min_dist_btwn_means_single_param" from above, for each cluster

In the outputted list, "mst_gap_params"

This section is mainly to be used for selecting optimal clustering parameters, and is perhaps not as useful as a general tool for cluster description. The idea is to find the biggest "gap" in the dataset, with the idea being that if there is a big gap between sections of the dataset, then the parameters must be identifying some useful modality/variation. How it works is like this: identify the cluster means, and then treat these as individual points and connect them with a minimum spanning tree (mst). After this is done, the longest edge in the mst should identify the biggest gap between regions of cell identity in the dataset.

"params_ranked" Once the biggest "gap" is identified, look at the two vertices (cell clusters) in the mst that form that edge, and see what parameters they're most different for. I just want to know what parameters are driving the biggest separation between neighboring clusters.

"v1_stats" and "v2_stats" One concern about using the MST-gap to identify differences between clusters is that it might be dominated by a "garbage" cluster that is made up of say, debris and not real cells. This could be very different from everything else, but not useful for our analysis. To protect against this, check for each of the connected cell clusters 1) how big they are (as a percentage of total cells), and 2) how stable they are. Presumably, spurious/debris clusters would be small and/or have low cluster stability. . .

corwms/ZunderPipelineFunctions documentation built on Aug. 29, 2019, 4:17 p.m.