This notebook outputs the bootstrap projections for a single model type whose features were previously learned and extracted for the TNBC spatial proteomic dataset.
The model type is specified by the subset
parameter -- we assume that
the raw_dir
folder contains files of the form {name of subset}-{bootstrap
number}*.tar.gz
Previously trained models with this format can be downloaded by
following the README.md
file in the data/raw_data/
folder. It is necessary
to run this script with different subset
parameters in order to generate the
faceted displays like in Figure 8. Each run of this notebook gives exactly one
panel from those displays. An example of the loop that was used is given at the
bottom of this script.
Aside from the source of the learned features, there is no difference between
this analysis script and the one in
analysis/simulations/spatial_process_simulation/
. This is because they both
apply each type of bootstrap to output a csv
file of aligned projections that
will appear in the params$derived_data
folder.
library(knitr) opts_chunk$set(warning = FALSE, message = FALSE)
The block below loads the necessary libraries. reticulate
is needed because
feature learners were trained in python, and the extracted features are saved as
numpy arrays.
set.seed(1234) library(LFBCR) library(irlba) library(tidyverse) library(reticulate) np <- import("numpy") theme_set(min_theme())
The block below unzips a folder containing features from each bootstrap
replicate matching the subset
parameter. We also create a subfolder of
derived_data/
for the current model of interest.
model_paths <- list.files(params$raw_dir, str_c(params$subset, "*"), full = TRUE) derived_dir <- file.path(params$derived_dir, params$subset) dir.create(derived_dir, recursive = TRUE) untar_all(model_paths, derived_dir)
Next we read in the learned features and metadata describing which split each
sample belongs to. Each element of the list Zb
is a learned feature embedding
different bootstrap replicate (before doing any dimensionality reduction). The
*best.npy
argument specifies which of the extracted features to use -- we save
features from across many epochs of training, and *best.npy
refers to the set
of features at the epoch for which the feature learning model had the lowest
validation loss.
prefix <- ifelse(str_detect(params$subset, "rcf"), "full_best.npy", "*best.npy") Zb <- read_learned_features(derived_dir, prefix, params$transform) ix <- read_split_indices(file.path(params$derived_dir, params$subset))
The block below performs the nonparametric bootstrap approach from Section 2.1.
Zb_
are the dimensionality reduced features, and ud_hats
are those features
after alignment. We join with ix
so that we can shade points in by the
associated y
value. Then, we save the aligned projections with metadata as a
csv in the derived_data
folder.
Zb_ <- Zb %>% map(~ .[ix$split %in% c("test", "dev"), ]) %>% map(~ { svf <- irlba(., nv=params$K) svf$u %*% diag(svf$d) }) ud_hats <- align_to_list(Zb_) %>% map_dfr(~ data.frame(.), .id = "b") %>% group_by(b) %>% mutate(i = row_number(), subset = params$subset, bootstrap = "nonparametric") %>% left_join(ix %>% filter(split %in% c("test", "dev")) %>% mutate(i = row_number())) write_csv(ud_hats, str_c(params$derived_dir, "/", params$subset, "_nonparametric.csv"))
The block below performs the parametric approach. Here we pretend that we had
only fit one feature extraction model by subsetting to only the first element of
the list Zb
. We then generate params$B
parametric bootstrap samples using
repeated calls to boot_fun
. The last few alignment and joining steps are
identical as in the block above.
Zb_ <- Zb[[1]][ix$split %in% c("test", "dev"), ] boot_fun <- param_boot(Zb_) ud_hats <- rerun(params$B, boot_fun()$ub) %>% align_to_list(df = F) %>% map_dfr(~ data.frame(.), .id = "b") %>% group_by(b) %>% mutate(i = row_number(), subset = params$subset, bootstrap = "parametric") %>% left_join(ix %>% filter(split %in% c("test", "dev")) %>% mutate(i = row_number())) write_csv(ud_hats, str_c(params$derived_dir, "/", params$subset, "_parametric.csv"))
The compromise approach of Section 2.3 is computed below. The param_boot_cmp
function generates parametric bootstrap samples based on the multiple learned
features in Zb
.
Zb_ <- Zb %>% map(~ .[ix$split %in% c("test", "dev"), ]) boot_fun <- param_boot_cmp(Zb_) ud_hats <- rerun(params$B, boot_fun()$ub) %>% align_to_list() %>% map_dfr(~ data.frame(.), .id = "b") %>% group_by(b) %>% mutate(i = row_number(), subset = params$subset, bootstrap = "compromise") %>% left_join(ix %>% filter(split %in% c("test", "dev")) %>% mutate(i = row_number())) write_csv(ud_hats, str_c(params$derived_dir, "/", params$subset, "_compromise.csv"))
To run this notebook across all tnbc
relevant settings, you can use the
following in the shell in the folder containing this .Rmd
file. The first two
Rscript
's run across the CNN and VAE features, and the last is for the RCF.
```{sh, eval = FALSE} for k in 32 64 128; do Rscript -e "rmarkdown::render('bootstrap.Rmd', params = list(subset='tnbc_cnn-k${k}', transform='log'))" Rscript -e "rmarkdown::render('bootstrap.Rmd', params = list(subset='tnbc_vae-k${k}'))" done;
for k in 256 512 1024; do Rscript -e "rmarkdown::render('bootstrap.Rmd', params = list(subset='tnbc_rcf-k${k}'))" done; ```
To visualize the outputs, you can use the bootstrap-vis.Rmd
notebook in this
folder.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.