tof_assess_clusters_entropy: Assess a clustering result by calculating the shannon entropy...

View source: R/quality_control.R

tof_assess_clusters_entropyR Documentation

Assess a clustering result by calculating the shannon entropy of each cell's mahalanobis distance to all cluster centroids and flagging outliers.

Description

This function evaluates the result of a clustering procedure by calculating the mahalanobis distance between each cell and the centroids of all clusters in the dataset and finding the shannon entropy of the resulting vector of distances. All cells with an entropy threshold above a user-specified threshold are flagged as potentially anomalous. Entropy is minimized (to 0) when a cell is close to one (or a small number) of clusters, but far from the rest of them. If a cell is close to multiple cluster centroids (i.e. has an ambiguous phenotype), its entropy will be large.

Usage

tof_assess_clusters_entropy(
  tof_tibble,
  cluster_col,
  marker_cols = where(tof_is_numeric),
  entropy_threshold,
  entropy_quantile = 0.9,
  num_closest_clusters,
  augment = FALSE
)

Arguments

tof_tibble

A 'tof_tbl' or 'tibble'.

cluster_col

An unquoted column name indicating which column in 'tof_tibble' stores the cluster ids for the cluster to which each cell belongs. Cluster labels can be produced via any method the user chooses - including manual gating, any of the functions in the 'tof_cluster_*' function family, or any other method.

marker_cols

Unquoted column names indicating which column in 'tof_tibble' should be interpreted as markers to be used in the mahalanobis distance calculation. Defaults to all numeric columns. Supports tidyselection.

entropy_threshold

A scalar indicating the entropy threshold above which a cell should be considered anomalous. If unspecified, a threshold will be computed using 'entropy_quantile' (see below). (Note: Entropy is often between 0 and 1, but can be larger with many classes/clusters).

entropy_quantile

A scalar between 0 and 1 indicating the entropy quantile above which a cell should be considered anomalous. Defaults to 0.9, which means that cells with an entropy above the 90th percentile will be flagged. Ignored if entropy_threshold is specified directly.

num_closest_clusters

An integer indicating how many of a cell's closest cluster centroids should have their mahalanobis distance included in the entropy calculation. Playing with this argument will allow you to ignore distances to clusters that are far away from each cell (and thus may distort the result, as many distant centroids with large distances can artificially inflate a cells' entropy value; that being said, this is rarely an issue empirically). Defaults to all clusters in tof_tibble.

augment

A boolean value indicating if the output should column-bind the computed flags for each cell (see below) as new columns in 'tof_tibble' (TRUE) or if a tibble including only the computed flags should be returned (FALSE, the default).

Value

If augment = FALSE (the default), a tibble with 2 + NUM_CLUSTERS columns. where NUM_CLUSTERS is the number of unique clusters in cluster_col. Two of the columns will be "entropy" (the entropy value for each cell) and "flagged_cell" (a boolean value indicating if each cell had an entropy value above entropy_threshold). The other NUM_CLUSTERS columns will contain the mahalanobis distances from each cell to each of the clusters in cluster_col (named ".mahalanobis_{cluster_name}"). If augment = TRUE, the same 2 + NUM_CLUSTERS columns will be column-bound to tof_tibble, and the resulting tibble will be returned.

Examples


# simulate data
sim_data <-
    dplyr::tibble(
        cd45 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cd38 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cd34 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cd19 = c(rnorm(n = 1000, sd = 1.5), rnorm(n = 1000, mean = 2), rnorm(n = 1000, mean = -2)),
        cluster_id = c(rep("a", 1000), rep("b", 1000), rep("c", 1000))
    )

# imagine a "reference" dataset in which "cluster a" isn't present
sim_data_reference <-
    sim_data |>
    dplyr::filter(cluster_id %in% c("b", "c"))

# if we cluster into the reference dataset, we will force all cells in
# cluster a into a population where they don't fit very well
sim_data <-
    sim_data |>
    tof_cluster(
        healthy_tibble = sim_data_reference,
        healthy_label_col = cluster_id,
        method = "ddpr"
    )

# we can evaluate the clustering quality by calculating by the entropy of the
# mahalanobis distance vector for each cell to all cluster centroids
entropy_result <-
    sim_data |>
    tof_assess_clusters_entropy(
        cluster_col = .mahalanobis_cluster,
        marker_cols = starts_with("cd"),
        entropy_quantile = 0.8,
        augment = TRUE
    )

# most cells in "cluster a" are flagged, and few cells in the other clusters are
flagged_cluster_proportions <-
    entropy_result |>
    dplyr::group_by(cluster_id) |>
    dplyr::summarize(
        prop_flagged = mean(flagged_cell)
    )


keyes-timothy/tidytof documentation built on March 31, 2024, 12:01 p.m.