Introduction

The primary dataset has 32 chemicals tested using the same protocol as the other dataset with 80 chemicals. The data is dichotomous incidence data and the same bootstrap procedue (n = 1000) was applied. For the endpoint percent_affected_96, the same baseline noise threshold was identified (25%) using bootstrap number = 1000 and all the chemicals (thresDist_raw see below)

However, it took a lot of time for the above procedue. It will be better to use fewer chemicals and/or fewer number of bootstrap samples to achieve the same outcome.

Currently, there are two methods to get the threshold based on bootstrap results. 1. max distance to line by original pooled variance values (y) and the thresholds (thresDist_raw) 2. max distance to line by pooled variance values after exponential fitting and the thresholds (thresDist_exp)

Goal

  1. to investigate how the change of the two parameters (number of bootstrap samples and fraction of chemicals) affects the baseline noise threshold identification.

load R packages

library(here)
library(Rcurvep)
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)

prepare the dataset

zfishdev_boot <- readRDS(here("data-raw", "big_dataset", "biobide32_percent_boot_1000.rds"))
bootd <- zfishdev_boot[["percent_affected_96"]]
#make old format as the new one
bootd <- bootd %>%
  rename(threshold = thres) %>%
  separate(dduid, c("endpoint", "chemical", "direction"), sep = "#") %>%
  separate(direction, c("directionality", "direction"), sep = "@") %>%
  select(-directionality)
sum_act <- extract_curvep_data(bootd, "summary", "INVERSE")
sum_act_hit <- sum_act %>% 
  group_by(chemical, endpoint, direction) %>% 
  summarize(hit_confidence = mean(hit_confidence))

actived <- sum_act_hit %>% filter(hit_confidence > 0.5)
inactived <- sum_act_hit %>% filter(hit_confidence <= 0.5)

p1 <- ggplot(sum_act_hit, aes(x = hit_confidence)) + geom_histogram()
p1

In total, regardless of the curvep threshold used, there are r nrow(actived) active compounds out of r nrow(sum_act_hit) compounds. The inactive compound names are: r inactived %>% pull(chemical).

functions

This function is to subset the previously stored bootstrap results and re-calculate the statistics. Replacement = FALSE

subset_boot_data <- function(dd, chem_frac, n_sample) {
  sel_sample_ids <- sample(1:1000, n_sample)

  chem_names <- dd %>% pull(chemical) %>% unique()
  sel_chem_names <- sample(chem_names, round(length(chem_names)*chem_frac))

  dd_p <- dd %>% 
    filter(chemical %in% sel_chem_names) %>% 
    filter(repeat_id %in% sel_sample_ids)
  acts <- extract_curvep_data(dd_p, "act") %>% 
    mutate(POD = ifelse(is.na(POD), conc_highest, POD)) 
   result <- acts %>% identify_basenoise_threshold(., plot = FALSE)
  return(list(thres_d = result[[1]], thres_o = result[[2]], sel_chem = sel_chem_names, sel_sample = sel_sample_ids))
   #return(list(acts, sel_chem_names, sel_sample_ids))

}

Tests

n_sample = 100 (with all chemicals)

set.seed(100)
test_boot_100 <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 1, n_sample = 100)})

Reducing to n_sample = 100 but with all chemicals can still get the 25%. There are cases with threDist_raw as 15%. However, when visually inspected the diagnostic plot, the threshold should be 25% (see the plot below)

map(test_boot_100, ~ .x[[2]]) %>% bind_rows() 

when there is a "bump", thresDist_exp can get more consistent results

generate_diagnostic_plot(test_boot_100[[20]][[1]])

In this case, the values for 25% and 30% in the dist2l_exp are pretty similar.

test_boot_100[[3]][[1]] %>% select(dist2l_exp, dist2l_raw, threshold)

n_sample = 10 (with all chemicals)

set.seed(10)
test_boot_10 <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 1, n_sample = 10)})

Reducing to n_sample = 10 but with all chemicals can not get the 25%. The sign that indicates ill-sampling seems to be p2.

map(test_boot_10, ~ .x[[2]]) %>% bind_rows() 

Reducing to n_sample = 10 but with all chemicals, interestingly, using the exponential fitting (thresDist_exp), 25% threshold is often seen.

generate_diagnostic_plot(test_boot_10[[1]][[1]])
test_boot_10[[7]][[1]] %>% select(dist2l_exp, dist2l_raw, threshold)

n_sample = 1000 (with 50% of chemicals)

set.seed(50)
test_frac_50p <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 0.5, n_sample = 1000)})

Reducing to 50% of chemicals with n_sample = 1000, the thresDist_raw has more issue when p2_raw is not the last threshold. In this case, thresDist_exp seems more reliable.

map(test_frac_50p, ~ .x[[2]]) %>% bind_rows()
generate_diagnostic_plot(test_frac_50p[[8]][[1]])

n_sample = 100 (with 50% of chemicals)

set.seed(550)
test_frac_50p_boot_100 <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 0.5, n_sample = 100)})

Reducing to 50% of chemicals with n_sample = 100 seems still OK if using thresDist_exp

map(test_frac_50p_boot_100, ~ .x[[2]]) %>% bind_rows()

However, there is not much sign to indicate the less optimal threshold is selected

generate_diagnostic_plot(test_frac_50p_boot_100[[3]][[1]])
test_frac_50p_boot_100[[3]][[1]] %>% select(dist2l_exp, dist2l_raw, threshold)

n_sample = 1000 (with 20% of chemicals)

set.seed(350) #350
test_frac_20p <- map(1:20, function(x) {subset_boot_data(dd = bootd, chem_frac = 0.2, n_sample = 1000)})

Many "check" in thresComment. However, there are cases that the plots look OK (e.g., run 8 to 10)

map(test_frac_20p, ~ .x[[2]]) %>% bind_rows() 
map(test_frac_20p, ~ .x[[2]]) %>% bind_rows() %>% 
  rowid_to_column() %>%
  filter(cor_exp_fit < 0.95)
#generate_diagnostic_plot(test_frac_20p[[16]][[1]])
map(test_frac_20p, ~ generate_diagnostic_plot(.x[[1]]))

Check the picked compounds: how many of them are inactive compounds (out of 6). The number of inactive compounds are shown below.

There is not too much relationship between number of inactives and quality of the curve (cor_exp_fit)

map(test_frac_20p, ~ .x[[3]]) %>% map(., ~ length(setdiff(.x, actived %>% pull(chemical))))

Summary

Based on this dataset,

  1. It looks like it is OK to use a smaller number of bootstrap samples for the threshold finding such as n = 100, even n = 10 when using all chemicals
  2. Although it may be possible to use only 20% of chemicals (i.e., 6 compounds) to get the right threshold, the results are not predictable by using the number of inactive compounds selected. In other words, very poor quality of diagnostic plot could be seen even when using 5 active compounds. (e.g., run 20 in test_frac_20p) Therefore, another parameter is needed to predict the behavior of the diagnostic plot.
  3. cor_exp_fit, p1_raw, p2_raw columns seem to be useful in indicating the quality of the diagnostic plot in addition to the current thresComment column.


moggces/Rcurvep documentation built on Feb. 6, 2024, 3:30 a.m.