This notebook outputs the bootstrap projections for a single model type whose features were previously learned and extracted for the TNBC spatial proteomic dataset.

The model type is specified by the subset parameter -- we assume that the raw_dir folder contains files of the form {name of subset}-{bootstrap number}*.tar.gz Previously trained models with this format can be downloaded by following the README.md file in the data/raw_data/ folder. It is necessary to run this script with different subset parameters in order to generate the faceted displays like in Figure 8. Each run of this notebook gives exactly one panel from those displays. An example of the loop that was used is given at the bottom of this script.

Aside from the source of the learned features, there is no difference between this analysis script and the one in analysis/simulations/spatial_process_simulation/. This is because they both apply each type of bootstrap to output a csv file of aligned projections that will appear in the params$derived_data folder.

library(knitr)
opts_chunk$set(warning = FALSE, message = FALSE)

The block below loads the necessary libraries. reticulate is needed because feature learners were trained in python, and the extracted features are saved as numpy arrays.

set.seed(1234)
library(LFBCR)
library(irlba)
library(tidyverse)
library(reticulate)
np <- import("numpy")
theme_set(min_theme())

The block below unzips a folder containing features from each bootstrap replicate matching the subset parameter. We also create a subfolder of derived_data/ for the current model of interest.

model_paths <- list.files(params$raw_dir, str_c(params$subset, "*"), full = TRUE)
derived_dir <- file.path(params$derived_dir, params$subset)
dir.create(derived_dir, recursive = TRUE)
untar_all(model_paths, derived_dir)

Next we read in the learned features and metadata describing which split each sample belongs to. Each element of the list Zb is a learned feature embedding different bootstrap replicate (before doing any dimensionality reduction). The *best.npy argument specifies which of the extracted features to use -- we save features from across many epochs of training, and *best.npy refers to the set of features at the epoch for which the feature learning model had the lowest validation loss.

prefix <- ifelse(str_detect(params$subset, "rcf"), "full_best.npy", "*best.npy")
Zb <- read_learned_features(derived_dir, prefix, params$transform)
ix <- read_split_indices(file.path(params$derived_dir, params$subset))

The block below performs the nonparametric bootstrap approach from Section 2.1. Zb_ are the dimensionality reduced features, and ud_hats are those features after alignment. We join with ix so that we can shade points in by the associated y value. Then, we save the aligned projections with metadata as a csv in the derived_data folder.

Zb_ <- Zb %>%
  map(~ .[ix$split %in% c("test", "dev"), ]) %>%
  map(~ {
    svf <- irlba(., nv=params$K) 
    svf$u %*% diag(svf$d)
  })

ud_hats <- align_to_list(Zb_) %>%
  map_dfr(~ data.frame(.), .id = "b") %>%
  group_by(b) %>%
  mutate(i = row_number(), subset = params$subset, bootstrap = "nonparametric") %>%
  left_join(ix %>% filter(split %in% c("test", "dev")) %>% mutate(i = row_number())) 
write_csv(ud_hats, str_c(params$derived_dir, "/", params$subset, "_nonparametric.csv"))

The block below performs the parametric approach. Here we pretend that we had only fit one feature extraction model by subsetting to only the first element of the list Zb. We then generate params$B parametric bootstrap samples using repeated calls to boot_fun. The last few alignment and joining steps are identical as in the block above.

Zb_ <- Zb[[1]][ix$split %in% c("test", "dev"), ]
boot_fun <- param_boot(Zb_)
ud_hats <- rerun(params$B, boot_fun()$ub) %>%
  align_to_list(df = F) %>%
  map_dfr(~ data.frame(.), .id = "b") %>%
  group_by(b) %>%
  mutate(i = row_number(), subset = params$subset, bootstrap = "parametric") %>%
  left_join(ix %>% filter(split %in% c("test", "dev")) %>% mutate(i = row_number()))
write_csv(ud_hats, str_c(params$derived_dir, "/", params$subset, "_parametric.csv"))

The compromise approach of Section 2.3 is computed below. The param_boot_cmp function generates parametric bootstrap samples based on the multiple learned features in Zb.

Zb_ <- Zb %>%
  map(~ .[ix$split %in% c("test", "dev"), ])
boot_fun <- param_boot_cmp(Zb_)
ud_hats <- rerun(params$B, boot_fun()$ub) %>%
  align_to_list() %>%
  map_dfr(~ data.frame(.), .id = "b") %>%
  group_by(b) %>%
  mutate(i = row_number(), subset = params$subset, bootstrap = "compromise") %>%
  left_join(ix %>% filter(split %in% c("test", "dev")) %>% mutate(i = row_number())) 
write_csv(ud_hats, str_c(params$derived_dir, "/", params$subset, "_compromise.csv"))

To run this notebook across all tnbc relevant settings, you can use the following in the shell in the folder containing this .Rmd file. The first two Rscript's run across the CNN and VAE features, and the last is for the RCF.

```{sh, eval = FALSE} for k in 32 64 128; do Rscript -e "rmarkdown::render('bootstrap.Rmd', params = list(subset='tnbc_cnn-k${k}', transform='log'))" Rscript -e "rmarkdown::render('bootstrap.Rmd', params = list(subset='tnbc_vae-k${k}'))" done;

for k in 256 512 1024; do Rscript -e "rmarkdown::render('bootstrap.Rmd', params = list(subset='tnbc_rcf-k${k}'))" done; ```

To visualize the outputs, you can use the bootstrap-vis.Rmd notebook in this folder.



krisrs1128/MSLF documentation built on March 21, 2023, 10:21 a.m.