This vignette demonstrates predicting compound mechanism-of-action using
morphological profiling data. The data is "cleaned up" using the similarity
network fusion algorithm (SNF). This vignette bases on the vignette
predict_moa
. See the vignette single_cell_analysis
for
details about this dataset.
library(dplyr) library(magrittr) library(ggplot2) library(stringr) library(cytominergallery) library(SNFtool)
Per-well profiles computed in single_cell_analysis
are loaded, as well as
metadata associated with these profiles (obtained from BBBC021).
This is the same data as used in the vignette predict_moa
.
profiles <- readr::read_csv(system.file("extdata", "ljosa_jbiomolscreen_2013_per_well_mean.csv.gz", package = "cytominergallery")) moa <- readr::read_csv(system.file("extdata", "BBBC021_v1_moa.csv", package = "cytominergallery")) %>% rename(Image_Metadata_Compound = compound, Image_Metadata_Concentration = concentration, Image_Metadata_MoA = moa ) metadata <- readr::read_csv(system.file("extdata", "BBBC021_v1_image.csv.gz", package = "cytominergallery")) %>% rename(Image_Metadata_Plate = Image_Metadata_Plate_DAPI, Image_Metadata_Well = Image_Metadata_Well_DAPI ) %>% select(matches("^Image_Metadata")) %>% inner_join(moa) %>% distinct() profiles %<>% inner_join(metadata) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
Next, lets filter the set of features based on various measures of quality
Remove features that have poor correlation across replicates. To do so, lets first compute the correlations.
doParallel::registerDoParallel(cores = 4) feature_replicate_correlations <- profiles %>% cytominer::variable_importance( variables = variables, strata = c("Image_Metadata_Compound", "Image_Metadata_Concentration"), replicates = 3, cores = 2)
Similar to the predict_moa
vignette, we select a threshold and remove features that
have a replicate correlation lower than that threshold.
profiles %<>% select_(.dots = setdiff(x = colnames(profiles), y = feature_replicate_correlations %>% filter(median < 0.5) %>% magrittr::extract2("variable")) ) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
Filter based on correlation between features, similar to predict_moa
.
profiles <- cytominer::variable_select( population = profiles, variables = variables, sample = profiles, operation = "correlation_threshold", cutoff = 0.95) %>% collect() variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
There may be plate-to-plate variations, which can be compensated for to some extent by normalizing the features with respect to the DMSO wells per plate.
profiles <- cytominer::normalize( population = profiles, variables = variables, strata = c("Image_Metadata_Plate"), sample = profiles %>% filter(Image_Metadata_Compound == "DMSO") ) profiles <- cytominer::variable_select( population = profiles, variables = variables, operation = "drop_na_columns" ) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
We have selected features and normalized the data. We can now compute treatment profiles by averaging across replicates.
profiles <- cytominer::aggregate( population = profiles, variables = variables, strata = c("Image_Metadata_Compound", "Image_Metadata_Concentration", "Image_Metadata_MoA"), operation = "mean" ) variables <- colnames(profiles) %>% str_subset("^Nuclei_|^Cells_|^Cytoplasm_")
Let's visualize this data using t-SNE.
profiles %<>% filter(Image_Metadata_Compound != "DMSO") correlation <- profiles %>% select(one_of(variables)) %>% as.matrix() %>% t() %>% cor() mechanism <- as.character(profiles$Image_Metadata_MoA) plot_tsne <- function(correlation, mechanism){ set.seed(42) df <- tibble::as_data_frame( tsne::tsne(as.dist(1 - correlation)) ) %>% mutate(mechanism = mechanism) p <- ggplot(df, aes(V1, V2, color = mechanism)) + geom_point() + ggtitle("t-SNE visualization of compound profiles clean using SNF") print(p) } plot_tsne(correlation = correlation, mechanism = mechanism)
As we saw in predict_moa
, the data clusters into mechanisms quite nicely. Let's
quantify this by evaluating how well we can predict mechanism-of-action by simply
assigning a treatment the mechanism of its nearest neighbor.
NOTE: A common mistake when analyzing this dataset is to not exclude other
concentrations of the same compound when looking up the nearest neighbor. That is
cheating! mask
in the code below addresses this.
predict_moa <- function(correlation, compound, mechanism){ mask <- as.integer(outer(compound, compound, FUN = "!=")) mask[mask == 0] <- -Inf correlation_masked <- correlation * mask return(sapply(1:nrow(correlation_masked), function(i) mechanism[order(correlation_masked[i,], decreasing = TRUE)[1]]) ) } compound <- profiles$Image_Metadata_Compound prediction <- predict_moa(correlation, compound, mechanism) confusion_matrix <- caret::confusionMatrix(as.factor(prediction), as.factor(mechanism))
What's the classification accuracy?
evaluate_prediction <- function(confusion_matrix ){ tibble::frame_data( ~metric, ~value, "Accuracy", sprintf("%.2f", confusion_matrix$overall["Accuracy"]), "95% CI", sprintf("(%.2f, %.2f)", confusion_matrix$overall[["AccuracyLower"]], confusion_matrix$overall[["AccuracyUpper"]]) ) } evaluate_prediction(confusion_matrix) %>% knitr::kable(digits = 2)
What does the whole confusion matrix look like?
confusion_matrix$table %>% knitr::kable()
Next we test the similarity network fusion (SNF) algorithm to clean the data.
distance <- 1-correlation affinity <- affinityMatrix(distance, K = 20, sigma = 0.5) snf_distance <- SNF(list(affinity, affinity), K = 20, t = 20 )
Do we get better results using SNF?
prediction <- predict_moa(snf_distance, compound, mechanism) confusion_matrix <- caret::confusionMatrix(as.factor(prediction), as.factor(mechanism)) evaluate_prediction(confusion_matrix) %>% knitr::kable(digits = 2)
Visualize the results using t-SNE.
plot_tsne(correlation = snf_distance, mechanism = mechanism)
Now let's split the data by the feature categories, compute a similarity matrix per category, then combine these matrices using SNF.
feature_label = list("_AreaShape_", "_Intensity_","_Texture_","_Neighbors_") feature_names_list <- lapply(feature_label, function(x) ( profiles %>% select(matches(x)) %>% colnames()) ) correlation_matrices <- lapply(feature_names_list, function(x) ( profiles %>% select(one_of(x)) %>% as.matrix() %>% t() %>% cor()) ) affinity_list <- lapply(correlation_matrices, function(x) ( affinityMatrix(1 - x, K = 20, sigma = 0.5 )) ) snf_distance <- SNF(affinity_list, K = 20, t = 20 )
How is the performance?
prediction <- predict_moa(snf_distance, compound, mechanism) confusion_matrix <- caret::confusionMatrix(as.factor(prediction), as.factor(mechanism)) evaluate_prediction(confusion_matrix) %>% knitr::kable(digits = 2)
Again plot using t-SNE
plot_tsne(correlation = snf_distance, mechanism = mechanism)
Now let's split the data by the cell constituents, and as before, compute a similarity matrix per constituent, then combine these matrices using SNF.
#profiles %<>% # filter(Image_Metadata_Compound != "DMSO") constituents = list("Cells", "Nuclei","Cytoplasm") feature_list <- lapply(constituents, function(x) (profiles %>% select(matches(x)) %>% colnames()) ) correlation_matrices <- lapply(feature_list, function(x) ( profiles %>% select(one_of(x)) %>% as.matrix() %>% t() %>% cor()) ) affinity_list <- lapply(correlation_matrices, function(x) ( affinityMatrix(1 - x, K = 20, sigma = 0.5 ) )) snf_distance <- SNF(affinity_list, K = 20, t = 20 )
Do we get better results?
prediction <- predict_moa(snf_distance, compound, mechanism) confusion_matrix <- caret::confusionMatrix(as.factor(prediction), as.factor(mechanism)) evaluate_prediction(confusion_matrix) %>% knitr::kable(digits = 2) plot_tsne(correlation = snf_distance, mechanism = mechanism)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.