In this module, we show how testing for multiple hypotheses (genes) can increase the chance of false positives, especially for small sample sizes.

knitr::opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE)
library(ComplexHeatmap)
library(ComplexHeatmap)
Nrow <- 10000
Ncol <- 200

Sample Size's effect on heatmap visualization

Here we show the scenario presented in class (slide "Gene markers selection: better than chance?"). In the examples below, we show heatmaps corresponding to random noise, and we show that, if enough hypotheses are tested (in this case, r Nrow), and the sample size is sufficiently small (e.g., n=6), we can easily identify 'genes' whose expression pattern seems to be strongly associated with the phenotype (in this case, a random head/tail), as suggested by the heatmap with a clear blue-to-red pattern. As the sample size increases (e.g., n=r Ncol), it is more difficult to be 'fooled', as the corresponding heatmap shows a less clear blue-to-red pattern.

Data generation

We start by generating a large [r Nrowxr Ncol] matrix filled with random values drawn from a Gaussian distribution with mean=0 and stdev=0.5.

set.seed(123) # for reproducible results
DAT <- matrix(rnorm(Ncol*Nrow,mean=0,sd=0.5),nrow=Nrow,ncol=Ncol)
hist(DAT)

We then pick a small subset of columns from this matrix and randomly assign them a binary (head-tail) phenotype. We then pick the top 25 markers associated to head and tail, and plot the corresponding heatmaps.

Heatmaps ~ sample size {.tabset}

heatmap_wrapper <- function(DAT, Ncol, ndraw) {
  ## randomly select Ncol columns from the full matrix
  DATi <- DAT[, colDraw <- sample(Ncol, size = ndraw)]
  ## generate a (head/tail) phenotype of proper size
  pheno <- factor(rep(c("head", "tail"), each = ndraw / 2))
  ## perform t.test on each data row with respect to the random phenotype
  DIFi <- t(apply(DATi, 1, tscore, x = pheno))

  ## pick top 25 markers in each direction
  topMarkers <- c(
    order(DIFi[, 1], decreasing = FALSE)[1:25],
    order(DIFi[, 1], decreasing = TRUE)[1:25]
  )
  ## visualize the corresponding heatmap of 50 markers
  annot_col <- ComplexHeatmap::HeatmapAnnotation(
    toss = pheno,
    col = list(toss = c("head" = "green", "tail" = "orange"))
  )
  print(ComplexHeatmap::Heatmap(DATi[topMarkers, ],
    name = paste("sample size =", ndraw),
    col = circlize::colorRamp2(c(-1, 0, 1), c("#072448", "white", "#ff6150")),
    top_annotation = annot_col,
    cluster_rows = FALSE,
    cluster_columns = FALSE,
    row_title = "",
    show_column_names = FALSE,
    show_row_names = FALSE
  ))
  ## show the top markers (by p-value)
  print(head(cbind(DIFi, FDR = p.adjust(DIFi[, 2], method = "BH"))[topMarkers, ]))
}
## creating a black-to-red palette for heatmap display
ramp.br <- grDevices::colorRamp(c( "blue","white","red"))
palette.blue2red <- rgb( ramp.br(seq(0, 1, length = 14)), max = 255)

## wrapper function extracting t-statistic and p-value from a call to function t.test
tscore <- function(y,x) { 
  tmp <- t.test(y~x)
  c(score=tmp$statistic,pval=tmp$p.value)
}

Sample size = 6

heatmap_wrapper( DAT=DAT, Ncol=Ncol, ndraw = 6)

Sample size = 14

heatmap_wrapper( DAT=DAT, Ncol=Ncol, ndraw = 14)

Sample size = 30

heatmap_wrapper( DAT=DAT, Ncol=Ncol, ndraw = 30)

Sample size = 100

heatmap_wrapper( DAT=DAT, Ncol=Ncol, ndraw = 100)

Sample size = 200

heatmap_wrapper( DAT=DAT, Ncol=Ncol, ndraw = 200)

{-}



montilab/BS831 documentation built on April 17, 2024, 4:51 p.m.