This script generates the figures appearing in Section 3.1, comparing the three bootstrap approaches on a permuted low-rank model. First, we load necessary libraries.

set.seed(1234)
library(expm)
library(irlba)
library(tidyverse)
library(LFBCR)
theme_set(min_theme())
attach(params)

The block below simulates one realization from the mechanism in equations 4 - 7. $X$ is a random low rank matrix observed with noise, and $y$ is a linear combination of the latent coordinates from each row of $X$. In a more realistic application, each row of $X$ would be an image, text, or audio sample, for example, and we would use a feature extractor to convert it into a matrix. In this simplified setup, though, we simple assume $X$ is a matrix.

U <- r_ortho(N, D)
V <- r_ortho(D, D)
Sigma <- c(rep(100, 2), rep(0.0, D - K))
X <- U %*% diag(Sigma) %*% t(V) + rmat(N, D, 0.1)
beta <- c(c(1, -1), rep(0, D - K))
y <- U %*% diag(sqrt(Sigma)) %*% beta + rnorm(N, 0, .1)

Here is the strategy that extracts the features once and then applies the parametric bootstrap of Section 2.2. The function $f$ simulates the feature extraction step in equation 8 -- each call fo boot_fun() samples the truncated SVD coordinates with noise and randomly permutes the columns. The align_to_list function applies a Procrustes rotation to the dimensionality-reduced learned features, and align_with_truth applies a second rotation to align coordinates with the known coordinates from the previous block. This last step is just so that we can overlay the truth on the estimated coordinates, it would not be possible in a real application.

f <- features(X)
boot_fun <- param_boot(f(X), K)
ud_combined <- list()
ud_combined[["parametric"]] <- rerun(B, boot_fun()$ub) %>%
  align_to_list() %>%
  align_with_truth(U, Sigma) %>%
  left_join(data.frame(i = seq_along(y), y = y))

The block below instead applies the nonparametric bootstrap of Section 2.1. It computes $B$ feature extractors by calling f within the rerun loop (contrast with th eblock above, which ran f once). The Procrustes analysis steps are the same as above.

f <- features(X)
Zb <- rerun(B, {
  full_features <- f(X)
  svf <- irlba(full_features, nv=K) 
  svf$u %*% diag(svf$d)
})
ud_combined[["nonparametric"]] <- align_to_list(Zb)  %>%
  align_with_truth(U, Sigma) %>%
  left_join(data.frame(i = seq_along(y), y = y))

The step below gives the compromise bootstrap of Section 2.3. It offers a of compromise between repeated feature learning in the nonparametric bootstrap and the single run from the parametric bootstrap. The code is also a mix of the two approaches -- the first rerun step trains 100 feature extractors (not 1000 like before) and the second increases the total number of bootstraps to 1000 by simulating from the parametric model.

f <- features(X)
Zb <- rerun(100, f(X))
boot_fun <- param_boot_cmp(Zb)
ud_combined[["compromise"]] <- rerun(B, boot_fun()$ub) %>%
  align_to_list() %>%
  align_with_truth(U, Sigma) %>%
  left_join(data.frame(i = seq_along(y), y = y)) # adds a new list element with the aligned truth

Finally, we can compute the coverage of the confidence areas from each approach. We are using b == max(b) to extract the true coordinates, since in the ud_combined objects, the last element of each list is the aligned truth. It seems that the compromise approach is conservative, the parametric approach is potentially conservative, and the nonparametric approach is nearly exactly the expected coverage.

map(ud_combined, ~ {
  coverage(filter(., b < max(b)), filter(., b == max(b))) %>%
    mean()
})

The block below generates Figure 2. The plot_overlay_truth function is in R/vis.R in this compendium. We are showing only the first 100 samples, since showing all 1000 clusters the display.

bind_rows(ud_combined, .id = "bootstrap") %>% 
  mutate(bootstrap = factor(bootstrap, levels = c("parametric", "nonparametric", "compromise"))) %>%
  plot_overlay_truth(100)


krisrs1128/MSLF documentation built on March 21, 2023, 10:21 a.m.