This vignette demonstrates predicting compound mechanism-of-action using
morphological profiling data. See the vignette single_cell_analysis
for
details about this dataset.
library(dplyr) library(magrittr) library(ggplot2) library(stringr) library(cytominergallery)
Per-well profiles computed in single_cell_analysis
are loaded, as well as
metadata associated with these profiles (obtained from BBBC021)
profiles <- readr::read_csv(system.file("extdata", "ljosa_jbiomolscreen_2013_per_well_mean.csv.gz", package = "cytominergallery")) moa <- readr::read_csv(system.file("extdata", "BBBC021_v1_moa.csv", package = "cytominergallery")) %>% rename(Image_Metadata_Compound = compound, Image_Metadata_Concentration = concentration, Image_Metadata_MoA = moa ) metadata <- readr::read_csv(system.file("extdata", "BBBC021_v1_image.csv.gz", package = "cytominergallery")) %>% rename(Image_Metadata_Plate = Image_Metadata_Plate_DAPI, Image_Metadata_Well = Image_Metadata_Well_DAPI ) %>% select(matches("^Image_Metadata")) %>% inner_join(moa) %>% distinct() profiles %<>% inner_join(metadata) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
How many compounds?
profiles %>% filter(Image_Metadata_Compound != "DMSO") %>% distinct(Image_Metadata_Compound) %>% tally() %>% rename(`Number of compounds` = n) %>% knitr::kable()
How many unique treatments (compound-concentration pairs)?
profiles %>% filter(Image_Metadata_Compound != "DMSO") %>% distinct(Image_Metadata_Compound, Image_Metadata_Concentration) %>% tally() %>% rename(`Number of unique treatments` = n) %>% knitr::kable()
How many replicates per unique treatment?
profiles %>% filter(Image_Metadata_Compound != "DMSO") %>% count(Image_Metadata_Compound, Image_Metadata_Concentration) %>% rename(`Number of replicates` = n) %>% knitr::kable()
How many DMSO wells per plate?
profiles %>% filter(Image_Metadata_Compound == "DMSO") %>% count(Image_Metadata_Plate) %>% rename(`Number of DMSO wells` = n) %>% knitr::kable()
Next, lets filter the set of features based on various measures of quality
Remove features that have near-zero variance. This dataset doesn't have any such features, so nothing is removed.
profiles <- cytominer::variable_select( population = profiles, variables = variables, sample = profiles, operation = "variance_threshold" ) %>% collect() variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
Remove features that have poor correlation across replicates. To do so, lets first compute the correlations.
doParallel::registerDoParallel(cores = 2) feature_replicate_correlations <- profiles %>% cytominer::variable_importance( variables = variables, strata = c("Image_Metadata_Compound", "Image_Metadata_Concentration"), replicates = 3, cores = 2)
What the does the distribution look like?
ggplot(feature_replicate_correlations, aes(median)) + stat_ecdf() + geom_vline(xintercept = 0.5, color = "red") + xlab("median replicate correlation (Pearson)") + ylab("F(x)")
Here, we select a threshold and remove features that have a replicate correlation lower than that threshold
profiles %<>% select_(.dots = setdiff(x = colnames(profiles), y = feature_replicate_correlations %>% filter(median < 0.5) %>% magrittr::extract2("variable")) ) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
Filter based on correlation between features. The morphological features
extracted contain several highly correlated groups. We want to to prune the set
of features, retaining only one feature from each of these highly correlated
sets. The function correlation_threshold
provides an approximate (greedy)
solution to this problem. After excluding the features, no pair of features
have a correlation greater than cutoff
indicated below.
profiles <- cytominer::variable_select( population = profiles, variables = variables, sample = profiles, operation = "correlation_threshold", cutoff = 0.95) %>% collect() variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
There may be plate-to-plate variations, which can be compensated for to some extent by normalizing the features with respect to the DMSO wells per plate.
profiles <- cytominer::normalize( population = profiles, variables = variables, strata = c("Image_Metadata_Plate"), sample = profiles %>% filter(Image_Metadata_Compound == "DMSO") ) profiles <- cytominer::variable_select( population = profiles, variables = variables, operation = "drop_na_columns" ) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
We have selected features and normalized the data. We can now compute treatment profiles by averaging across replicates.
profiles <- cytominer::aggregate( population = profiles, variables = variables, strata = c("Image_Metadata_Compound", "Image_Metadata_Concentration", "Image_Metadata_MoA"), operation = "mean" ) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
Let's visualize this data using t-SNE.
profiles %<>% filter(Image_Metadata_Compound != "DMSO") correlation <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t() %>% cor() mechanism <- as.character(profiles$Image_Metadata_MoA) set.seed(123) df <- tibble::as_data_frame( tsne::tsne(as.dist(1-correlation)) ) %>% mutate(mechanism = mechanism) p <- ggplot(df, aes(V1, V2, color=mechanism)) + geom_point() + ggtitle("t-SNE visualization of compound profiles") print(p)
The data clusters into mechanisms quite nicely. Let's quantify this
by evaluating how well we can predict mechanism-of-action by simply assigning
a treatment the mechanism of its nearest neighbor.
NOTE: A common mistake when analyzing this dataset is to not exclude other
concentrations of the same compound when looking up the nearest neighbor. That is cheating! mask
in the code below addresses this.
compound <- profiles$Image_Metadata_Compound mask <- as.integer(outer(compound, compound, FUN="!=")) mask[mask == 0] <- -Inf correlation_masked <- correlation * mask prediction <- sapply(1:nrow(correlation_masked), function(i) mechanism[order(correlation_masked[i,], decreasing = TRUE)[1]]) confusion_matrix <- caret::confusionMatrix(as.factor(prediction), as.factor(mechanism))
What's the classification accuracy?
tibble::frame_data( ~metric, ~value, "Accuracy", sprintf("%.2f", confusion_matrix$overall["Accuracy"]), "95% CI", sprintf("(%.2f, %.2f)", confusion_matrix$overall[["AccuracyLower"]], confusion_matrix$overall[["AccuracyUpper"]]) ) %>% knitr::kable(digits = 2)
What does the whole confusion matrix look like?
confusion_matrix$table %>% knitr::kable()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.