The primary dataset has 32 chemicals tested using the same protocol as the other dataset with 80 chemicals. The data is dichotomous incidence data and the same bootstrap procedue (n = 1000) was applied. For the endpoint percent_affected_96, the same baseline noise threshold was identified (25%) using bootstrap number = 1000 and all the chemicals (thresDist_raw see below)
However, it took a lot of time for the above procedue. It will be better to use fewer chemicals and/or fewer number of bootstrap samples to achieve the same outcome.
Currently, there are two methods to get the threshold based on bootstrap results. 1. max distance to line by original pooled variance values (y) and the thresholds (thresDist_raw) 2. max distance to line by pooled variance values after exponential fitting and the thresholds (thresDist_exp)
library(here) library(Rcurvep) library(tibble) library(dplyr) library(tidyr) library(purrr) library(ggplot2)
zfishdev_boot <- readRDS(here("data-raw", "big_dataset", "biobide32_percent_boot_1000.rds")) bootd <- zfishdev_boot[["percent_affected_96"]] #make old format as the new one bootd <- bootd %>% rename(threshold = thres) %>% separate(dduid, c("endpoint", "chemical", "direction"), sep = "#") %>% separate(direction, c("directionality", "direction"), sep = "@") %>% select(-directionality)
sum_act <- extract_curvep_data(bootd, "summary", "INVERSE") sum_act_hit <- sum_act %>% group_by(chemical, endpoint, direction) %>% summarize(hit_confidence = mean(hit_confidence)) actived <- sum_act_hit %>% filter(hit_confidence > 0.5) inactived <- sum_act_hit %>% filter(hit_confidence <= 0.5) p1 <- ggplot(sum_act_hit, aes(x = hit_confidence)) + geom_histogram() p1
In total, regardless of the curvep threshold used, there are r nrow(actived)
active compounds out of r nrow(sum_act_hit)
compounds. The inactive compound names are: r inactived %>% pull(chemical)
.
This function is to subset the previously stored bootstrap results and re-calculate the statistics. Replacement = FALSE
subset_boot_data <- function(dd, chem_frac, n_sample) { sel_sample_ids <- sample(1:1000, n_sample) chem_names <- dd %>% pull(chemical) %>% unique() sel_chem_names <- sample(chem_names, round(length(chem_names)*chem_frac)) dd_p <- dd %>% filter(chemical %in% sel_chem_names) %>% filter(repeat_id %in% sel_sample_ids) acts <- extract_curvep_data(dd_p, "act") %>% mutate(POD = ifelse(is.na(POD), conc_highest, POD)) result <- acts %>% identify_basenoise_threshold(., plot = FALSE) return(list(thres_d = result[[1]], thres_o = result[[2]], sel_chem = sel_chem_names, sel_sample = sel_sample_ids)) #return(list(acts, sel_chem_names, sel_sample_ids)) }
set.seed(100) test_boot_100 <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 1, n_sample = 100)})
Reducing to n_sample = 100 but with all chemicals can still get the 25%. There are cases with threDist_raw as 15%. However, when visually inspected the diagnostic plot, the threshold should be 25% (see the plot below)
map(test_boot_100, ~ .x[[2]]) %>% bind_rows()
when there is a "bump", thresDist_exp can get more consistent results
generate_diagnostic_plot(test_boot_100[[20]][[1]])
In this case, the values for 25% and 30% in the dist2l_exp are pretty similar.
test_boot_100[[3]][[1]] %>% select(dist2l_exp, dist2l_raw, threshold)
set.seed(10) test_boot_10 <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 1, n_sample = 10)})
Reducing to n_sample = 10 but with all chemicals can not get the 25%. The sign that indicates ill-sampling seems to be p2.
map(test_boot_10, ~ .x[[2]]) %>% bind_rows()
Reducing to n_sample = 10 but with all chemicals, interestingly, using the exponential fitting (thresDist_exp), 25% threshold is often seen.
generate_diagnostic_plot(test_boot_10[[1]][[1]])
test_boot_10[[7]][[1]] %>% select(dist2l_exp, dist2l_raw, threshold)
set.seed(50) test_frac_50p <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 0.5, n_sample = 1000)})
Reducing to 50% of chemicals with n_sample = 1000, the thresDist_raw has more issue when p2_raw is not the last threshold. In this case, thresDist_exp seems more reliable.
map(test_frac_50p, ~ .x[[2]]) %>% bind_rows()
generate_diagnostic_plot(test_frac_50p[[8]][[1]])
set.seed(550) test_frac_50p_boot_100 <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 0.5, n_sample = 100)})
Reducing to 50% of chemicals with n_sample = 100 seems still OK if using thresDist_exp
map(test_frac_50p_boot_100, ~ .x[[2]]) %>% bind_rows()
However, there is not much sign to indicate the less optimal threshold is selected
generate_diagnostic_plot(test_frac_50p_boot_100[[3]][[1]])
test_frac_50p_boot_100[[3]][[1]] %>% select(dist2l_exp, dist2l_raw, threshold)
set.seed(350) #350 test_frac_20p <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 0.2, n_sample = 1000)})
Many "check" in thresComment. However, there are cases that the plots look OK (e.g., run 8 to 10)
map(test_frac_20p, ~ .x[[2]]) %>% bind_rows()
map(test_frac_20p, ~ .x[[2]]) %>% bind_rows() %>% rowid_to_column() %>% filter(cor_exp_fit < 0.95)
#generate_diagnostic_plot(test_frac_20p[[16]][[1]]) map(test_frac_20p, ~ generate_diagnostic_plot(.x[[1]]))
Check the picked compounds: how many of them are inactive compounds (out of 6). The number of inactive compounds are shown below.
There is not too much relationship between number of inactives and quality of the curve (cor_exp_fit)
map(test_frac_20p, ~ .x[[3]]) %>% map(., ~ length(setdiff(.x, actived %>% pull(chemical))))
Based on this dataset,
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.