In dipterix/focr: A False Overlapped-Cluster Rate Control ('FOCR') Framework

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In continuum data, we often want to find the support of the functions. This problem can be solved by simultaneously testing $H_{0}(s): f(s)=0$ and find $s$ such that null hypotheses are rejected. In many conventional multiple testing procedures, each $s$ is treated separately. However, in functional settings, $f(s)$ is affected by $s'$ nearby. It is more reasonable to take $H_{0}(s'), \forall s'$ near $s$ into account.

The FOCR framework provides a flexible framework allowing aggregating blocks/clusters of $s$. The clusters of different $s$ can be overlapped. Our first goal is to find connected regions that contain significant points. Then we perform post-selection inferences on the blocks and further select hypotheses that controls the FDR. This two-stage framework tends to be more powerful than applying FDR control methods directly to marginal unconditioned p-values. It also prevents inflated FDR for methods such as LAWS[3] and SABHA[4] under dependence.

This package focr provides a two-stage multiple testing control procedure:

In the first stage, clusters of hypotheses are rejected as regions of interests (ROI), with FOCR controlled at desired level; (see function focr_initial for documentation)
In the second stage, conditional p-values are calculated in a post-selection inference fashion. FDR control procedures are applied to these p-values to further select individual hypotheses. (See function focr for documentation)

In the rest of this vignette, we refer to FOCR framework or FOCR-XXX as this two-stage procedure where XXX is the name of post-selection procedure, and FOCR or FOCR error as the type-I error measure for overlapped blocks.

A basic example

We generate a functional data as follows. In the figure, the red solid line is the underlying step function. The blue line is the data sample mean, and ribbon around is the 95% point-wise confidence interval. Our goal is to find the support of the function.

library(focr)
set.seed(400)
generator <- simulation_data_1D(n_points = 1000, mu_type = 'step')
data <- generator$gen_data(snr = 0.5)
plot(generator, data = data, snr = 0.5)

We apply the FOCR-BH approach to control the FDR at level 0.05. FOCR-BH uses the Benjamini-Hochberg (BH) procedure as the FDR control method in stage-II.

res <- focr(data = data, block_size = 21, 
            alpha = 0.05, fdr_method = 'BH', verbose = TRUE)

Let's plot $s$ where the null hypotheses $\mu(s) = 0$ are rejected. The orange line segments indicate the initial rejections that control the FOCR (cluster-level ROI). The blue line segments are the FOCR-BH final rejections that control the FDR (individual level).

plot(generator$mu, type = 'l', col = 'red', ylim = c(-1,1))
abseg(res$rej_hypotheses, y = -0.3, col = 'orange3', lwd = 2)
abseg(res$post_selection$rejs, y = -0.5, col = 'blue', lwd = 2)

To calculate the false discovery proportion and statistical power of this run,

# power & FDP
fdp(res$post_selection$rejs, generator$support)
pwr(res$post_selection$rejs, generator$support)

2D, 3D spatial data

Let's generate a 2D sample data ($150\times 158$). The red colored pixels are non-zeros, and white pixels are true null pixels. Our goal is to extract non-zero pixels (red).

library(focr)
set.seed(400)
generator <- simulation_data_2D(n_obs = 100, rho = 0.8)
data <- generator$gen_data(snr = 0.4)
mu <- generator$mu
pal <- colorRampPalette(c('white', 'red'))(64)
pal2 <- colorRampPalette(c('red', 'white', 'white'))(64)
pal3 <- colorRampPalette(c("#FF8181", 'white', "#FF8181", 'red'))(64)
par(mfrow = c(1,2))
image(mu, zlim = c(0,1), col = pal, main = "Underlying image")
image(array(colMeans(data), dim(mu)), col = pal3, main = "Sample mean")

For each pixel $s$, we test $H_{0}(s): f(s)=0$ versus $H_{1}(s): f(s)\neq 0$. For FOCR-based methods, we need to provide block size (diameter in pixels + 1). A proper block size can be determined from cor(data[,1:50])[1,]. The correlation between the first pixel and the second pixel is 0.3, and the correlation between the first pixel and the third pixel is 0.009 (small). Therefore, only adjacent pixels are correlated, hence the block radius is 1, and block_size=1x2+1=3.

# Run FOCR-BH
dim(data)
block_size <- 15
bandwidth <- 3.5
res_LAWS <- focr(data = data, block_size = block_size, 
            alpha = 0.05, fdr_method = 'LAWS', bandwidth = bandwidth,
            dimension = dim(mu), verbose = TRUE)
res_SABHA <- focr(data = data, block_size = block_size, 
            alpha = 0.05, fdr_method = 'SABHA', bandwidth = bandwidth,
            dimension = dim(mu), verbose = TRUE)
res_BH <- focr(data = data, block_size = block_size,
               alpha = 0.05, fdr_method = 'BH', 
               dimension = dim(mu), verbose = TRUE)

Notice data is flattened as a matrix of dimension $100 \times 23700$ (100 observations and $150\times 158$ columns)

The following figure shows the FOCR conditional p-values and FOCR-LAWS final rejections side-by-side. As comparisons, we also show the results of LAWS[3] and BH[1] procedure on unconditional p-values. Under dependence, the FOCR-LAWS avoids inflated FDR compared to original LAWS method. Our approach also tends to select less isolated pixels outside of the cluster. It is also more powerful than the BH procedure.

par(mfrow = c(2, 4), mar = c(2.1, 4.1, 4.1, 2.1))
image_dimension <- dim(mu)
plot_rejections <- function(rej, method){
  img <- array(NA, image_dimension)
  focr_bh <- rej$rejs
  img[focr_bh] <- 1
  image(img, zlim = c(0,1), col = pal, asp = 1, axes = FALSE,
        main = sprintf('%s rejections\nFDP: %.1f%%, Power: %.1f%%',
                       method, fdp(focr_bh, generator$support)*100,
                       pwr(focr_bh, generator$support)*100))
}

# Plot conditional p-values
cond_pvals <- res_LAWS$cond_pvals
dim(cond_pvals) <- dim(mu)
image(cond_pvals, zlim = c(0,1), col = pal2, 
      main = 'FOCR conditional p-values', asp = 1, axes = FALSE)

# Plot final rejections
plot_rejections(res_LAWS$post_selection, res_LAWS$method)
plot_rejections(res_SABHA$post_selection, res_SABHA$method)
plot_rejections(res_BH$post_selection, res_BH$method)

# Compare with methods on unconditional p-values
pvals <- res_BH$uncond_pvals
dim(pvals) <- image_dimension
image(pvals, zlim = c(0,1), col = pal2, 
      main = 'Unconditional p-values', asp = 1, axes = FALSE)


# Under dependence, LAWS must use large bandwidth
# and small purity value (initial filter) to avoid 
# inflated FDR, therefore we cannot use small ones 
# as FOCR-LAWS
rej_LAWS <- LAWS(pvals, alpha = 0.05, bandwidth = bandwidth, 
                 dimension = 'two', verbose = TRUE)
rej_SABHA <- SABHA(pvals, alpha = 0.05, bandwidth = bandwidth, 
                   dimension = 'two', verbose = TRUE)
rej_BH <- BH(pvals, alpha = 0.05)

plot_rejections(rej_LAWS, 'LAWS')
plot_rejections(rej_SABHA, 'SABHA')
plot_rejections(rej_BH, 'BH')

# bib <- RefManageR::ReadBib(system.file('REFERENCES.bib', package = 'focr'))
# RefManageR::BibOptions(check.entries = FALSE, style = "markdown", bib.style = "alphabetic", cite.style = 'authoryear', hyperlink = "to.doc", super = TRUE)

# * `BH` `r RefManageR::Citep(bib, 'benjamini1995controlling')`: controls the FDR under independence or `PRDS`
# * `BY` `r RefManageR::Citep(bib, 'benjamini2001control')`: controls the FDR under arbitrary dependence
# * `SABHA` `r RefManageR::Citep(bib, 'li2019multiple')`: controls the FDR under independence (or asymptotically under weak dependence)
# * `LAWS` `r RefManageR::Citep(bib, 'cai2021laws')`: controls the FDR under independence (or asymptotically under weak dependence)

Choices of post-selection FDR control procedures

The built-in post-selection FDR procedures are:

BH (Benjamini and Hochberg, 1995): controls the FDR under independence or PRDS
BY (Benjamini and Yekutieli, 2001): controls the FDR under arbitrary dependence
SABHA (Li and Barber, 2019): controls the FDR under independence (or asymptotically under weak dependence)
LAWS (Cai, Sun, and Xia, 2021): controls the FDR under independence (or asymptotically under weak dependence)

The post-selection FDR method can be specified via fdr_method parameter. For example,

# Change `fdr_method` to BH, BY, SABHA, or LAWS
res <- focr(data = data, block_size = 3, 
            alpha = 0.05, fdr_method = 'BY')

Alternatively, fdr_method can be a custom function as follows:

#' @param pvals conditional p-values
#' @param alpha FDR level to control
#' @param ... other parameters needed
my_method <- function(pvals, alpha, ...){
  (your code here)
}
res <- focr(data = data, block_size = 3, 
            alpha = 0.05, fdr_method = my_method)

The returned values by my_method will be accessible as res$post_selection.

Advanced usage

The function focr() runs the post-selection FDR control procedure right after the stage-I. Although convenient, it might not be optimal in some scenarios. Under the following scenarios, it is highly recommended to use the focr_initial function:

cluster block is not sliding window and need to be specified manually
data correlation is given

The definition of focr_initial function is:

focr_initial <- function(
  data, data_corr, blocks, nblocks = ncol(data), mu = 0, alpha = 0.05, 
  verbose = FALSE, side = c('two', 'left', 'right'), ...)

The data_corr allows prespecified correlation matrix. blocks can be a list of column indices or a function that returns column indices. When blocks is a list, nblocks is ignored. When blocks is a function, nblocks defines the total number of blocks.

library(focr)
set.seed(10)
generator <- simulation_data_1D(n_points = 1000, mu_type = 'step', cov_type = 'iid')
data <- generator$gen_data(snr = 0.5)
dim(data)

# Run stage-I with customized blocks
res <- focr_initial(data, blocks = function(col){
  if(col %in% 1:100){ return(1:100) }
  if(col %in% 101:400){ return(101:600) }
  if(col %in% 401:800){ return(401:800) }
  if(col %in% 801:1000){ return(601:1000) }
})

There are 4 blocks in this case:

Block 1: column 1-100
Block 2: column 101-600 (kernel 101-400, overlapped with block 3)
Block 3: column 401-800 (kernel 401-800, overlapped with block 2 and 4)
Block 4: column 601-1000 (kernel 801-1000, overlapped with block 3)

par(mfrow=c(1,2))

plot(generator$mu, col = 'red', type = 'n', ylab = '', 
     ylim = c(-1,0), yaxt = 'n')
text(x = 50, y = -0.3, "Block 1", col = 'blue')
abseg(1:100, y = -0.4, col = 'blue')
text(x = 350, y = -0.5, "Block 2", col = 'green')
abseg(101:600, y = -0.6, col = 'green')
text(x = 600, y = -0.6, "Block 3", col = 'orange')
abseg(401:800, y = -0.8, col = 'orange')
text(x = 800, y = -0.9, "Block 4", col = 'purple')
abseg(601:1000, y = -1, col = 'purple')


plot(res$cond_pvals, type='l', ylab = '', 
     main = bquote('Conditional p-value & '~mu(s)))
lines(generator$mu, col = 'red')

The conditional p-values are calculated and stored at res$cond_pvals. It can be used for further inference (such as FDR control). For example:

rej <- LAWS(res$cond_pvals, bandwidth = 20, dimension = 'one')
plot(generator$mu, col = 'red', ylab = '', 
     ylim = c(-0.2,1), yaxt = 'n', type = 'l',
     main = "Undelying mean (red) and rejected hypotheses (blue)")
abseg(rej$rejs, y = -0.2, col = 'blue')
text(x = 750, y = -0.1, cex = 0.8, sprintf(
  "FDP: %.1f%%; Power: %.1f%%",
  fdp(rej$rejs, generator$support) * 100,
  pwr(rej$rejs, generator$support) * 100
))

# RefManageR::PrintBibliography(bib, .opts = list(check.entries = FALSE))

[1] Y. Benjamini and Y. Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B (Methodological) 57.1 (1995), pp. 289-300.

[2] Y. Benjamini and D. Yekutieli. “The control of the false discovery rate in multiple testing under dependency”. In: Annals of statistics (2001), pp. 1165-1188.

[3] T. T. Cai, W. Sun, and Y. Xia. “Laws: A locally adaptive weighting and screening approach to spatial multiple testing”. In: Journal of the American Statistical Association (2021), pp. 1-14.

[4] A. Li and R. F. Barber. “Multiple testing with the structure-adaptive Benjamini-Hochberg algorithm”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81.1 (2019), pp. 45-74.

dipterix/focr documentation built on Dec. 20, 2021, 12:03 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com