class: fullscreen, inverse, top, center, text-white background-image: url("../inst/images/many-black-chair-backs.jpg")
.font150[Multiplicity]
knitr::opts_chunk$set(fig.width=4.25, fig.height=3.5, fig.retina=3, message=FALSE, warning=FALSE, cache = TRUE, autodep = TRUE, hiline=TRUE) knitr::opts_hooks$set(fig.callout = function(options) { if (options$fig.callout) { options$echo <- FALSE options$out.height <- "99%" options$fig.width <- 16 options$fig.height <- 8 } options }) hook_source <- knitr::knit_hooks$get('source') knitr::knit_hooks$set(source = function(x, options) { if (!is.null(options$hiline) && options$hiline) { x <- stringr::str_replace(x, "^ ?(.+)\\s?#<<", "*\\1") } hook_source(x, options) }) options(htmltools.dir.version = FALSE, width = 90) as_table <- function(...) knitr::kable(..., format='html', digits = 3) library(tidyverse)
Traditional vs OMICS hypothesis testing
Where does multiplicity arise
What is Family Wise Error Rate (FWER)
What is FDR
layout: false
Experiment: - 250 subjects chosen "randomly". - Diet for 1 week. - Repeated Measurement (Data in kg.): - Weight at the start of the week - Weight at the end of week.
Average weight loss is $0.13$kg.
Paired t-test for weight loss gives a
t-statistic of $t=0.126/0.068=1.84$,
giving a p-value of $0.067$ (using a two-sided test).
Not significant at the $5\%$ level!
.img-right[
]
.footnote[Example from Andersen (1990) ]
.left-code[
2*(1 - pt(1.84, df = 250 - 1)) # Asymptotic test 2*(1 - pnorm(1.84, 0,1)) #<< # one sided tests (1 - pt(1.84, df = 250 - 1)) #<<
]
.right-plot[
]
.footnote[
Assymptotic test - does not "help" (and is biased); smaller sample size larger bias.
]
layout: false
You use a one-tailed test to improve the test's ability to learn whether the new diet is better.
You risk missing valuable information by testing in only one direction. Test cannot determine whether diet leads to weight gain.
Can anything be done to get a significant result
out of this study?
--
.img-right[
]
--
What is the problem of this approach?
--
Requests for subgroup analysis common for clinical studies.
exclude: true
Interim Analysis
Multiple Regression
.footnote[Example : 50 Control samples, 50 Treatment, no significant result. Split data into female and male, and young and old group. Four tests, instead of one and maybe one is significant.]
A type I error (false positive) occurs when
the null hypothesis (H0) is true, but is rejected.
The type I error rate or significance level (p-Value)
is the probability of rejecting the
null hypothesis given that it is true.
A type II error (false negative) occurs when
the null hypothesis
is false,
but erroneously fails to be rejected.
The the type II error rate is denoted by the Greek letter $\beta$
and is related to the power of a test (which equals $1−\beta$).
For a given test, the only way to reduce both error rates
is to increase the sample size, and this may not be feasible.
.img-right[
]
.left-code[
# improved statistic set.seed(2); N = 20; N_obs = 4; mu = 0; sigma = 1 bb <- function(y){ x <- rnorm( N_obs, mu, sigma ) # compute mean and sd data.frame(mean = mean( x ), sd = sd(x)) } res <- purrr::map_df(1:N, bb) T0 <- (mu - res$mean)/ (sigma/sqrt(N_obs)) #<< plot(T0,rep(0.4,20 ), type="h", xlim=c(-3,3), ylim=c(0,0.4)) x <- seq(-10,10,0.01) lines(x, dnorm(x, sd= 1), #<< col=2) abline(v = c(qnorm(0.025), qnorm(0.975)), col= "orange", lwd=2)
$T|H_0 \sim N(0,1)$ ]
.right-plot[
]
In statistics, family-wise error rate (FWER) is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests.
If multiple hypotheses are tested, the chance of a rare event increases, and therefore, the likelihood of incorrectly rejecting a null hypothesis (making a type I error) increases.
The Bonferroni correction compensates for that increase by testing each individual hypothesis at a significance level of $\epsilon = \alpha /k$, $\alpha$ is the desired overall size of test and $k$ is the number of hypotheses.
$$k=20;~ \alpha = 0.05;~~~~~~~\epsilon = 0.05/20 = 0.0025$$
Bonferroni adjustments are typically very conservative (assumes that the tests are independent - however tests are frequently correlated) and more complex methods are usually used (e.g. multcomp).
p.adjust
transforms the p-values (makes them larger) instead transforming the threshold..footnote[wikipedia, Medical Statistics - Sheffield University 2018]
Family Wise Error Rate (FWER) - control the probability of at least one Type I error.
$$ \begin{align} Pr(\textrm{at least one Type I error}|H_0) = \epsilon &= 1 - Pr(\textrm{no rejections}|H_0)\ &= 1- \prod^k Pr(p_i > \alpha)\ &= 1- \prod^k (1-\alpha)\ &= 1-(1-\alpha)^k \end{align} $$ Solving for $\alpha$ gives $$\epsilon \approx \alpha/k$$ or exact $$\epsilon = 1 - \exp(1/k\log(1-\alpha))$$.
.footnote[http://genomicsclass.github.io/book/pages/multiple_testing.html]
exclude: true
multcomp
.left-code[ Model with NO interactions.
#help(swiss) #<< lmod <- lm(Fertility ~ ., #<< data = swiss) lmtable <- broom::tidy(lmod) K <- diag(length(coef(lmod)))[-1,] #<< rownames(K) <- names(coef(lmod))[-1] #<< adjusted <- broom::tidy(summary( multcomp::glht(lmod, linfct = K))) #<< comp <- inner_join(lmtable[-1,-c(2,3,4)], adjusted[-c(2,3,4,5)], by = c("term" = "contrast")) colnames(comp)[c(2,3)] <- c("p.value","p.adjusted")
]
.pull-right[
knitr::kable(comp, format = "html", caption = "compare p.value and adjusted p.value")
]
.footnote[multcomp: Simultaneous Inference in General Parametric Models by Torsten Hothorn,
Mathematics and theory complex, uses asymptotic properties for to make it tractable. ]
Controlling the FWER, will demand a unrealistically small p-value.
multcomp
to correct p-valuesexclude: true
.footnote[Simulating p-values Figure A) 1000 p-values where H0 true, B) 600 p-values where H0 true and 400 HA true. C) closup]
.left-code[
m <- 1000 simulate.p.values <- function( i, delta = 2, fraction = 0.1){ control <- rnorm(6,0,1) treatment <- rnorm(6,0,1) if (runif(1) < fraction) treatment <- treatment + delta return(t.test(treatment,control)$p.value) } pvals00 <- sapply(1:m, simulate.p.values, delta = 0,fraction = 0 ) pvals24 <- sapply(1:m, simulate.p.values, delta = 2, fraction = 0.4 )
par(mfrow=c(1,3)) hist(pvals00, breaks=20, ylim=c(0,300), main="A") abline(h=1000/20, col=3) abline(v = 0.05, col=2) text(x = c(0.02,0.02,0.2,0.2),y=c(20,100,20,100), labels=c("FP","TP","TN","FN")) hist(pvals24, breaks=20, ylim=c(0,300), main="B") abline(h=(1000-400)/20, col=3) abline(v = 0.05, col=2) text(x = c(0.02, 0.02, 0.2, 0.2),y=c(20,100,20,100), labels=c("FP","TP","TN","FN")) hist(pvals24, breaks=20, ylim=c(0,300), xlim=c(0,0.2), main="C") abline(h=(1000-400)/20, col=3) abline(v = 0.05, col=2) text(x = c(0.02,0.02,0.1,0.1),y=c(20,100,20,100), labels=c("FP","TP","TN","FN"))
]
.right-plot[
]
In Figure B and C we have p-values less than significance threshold where H0 is true (FP) and a proportion of those where HA is true (TP).
FDR-controlling procedures are designed to control the expected proportion of "discoveries" (rejected null hypotheses) that are false (incorrect rejections).
$$FDR = \frac{FP}{FP + TP}$$
layout: false
.pull-left[ - TP true positives (H0 rejected if HA true) - FP false positives (H0 rejected if H0 true) - FN false negatives (H0 accepted if HA true) - TN true negatives (H0 accepted if H0 true)
$$FDR = \frac{FP}{FP + TP}$$ ]
.right-plot[
hist(pvals24, breaks = 20, ylim = c(0,300), xlim = c(0,0.2), main = "D") abline(h = (1000 - 400)/20, col = 3) abline(v = 0.05, col = 2) text(x = c(0.02, 0.02, 0.1, 0.1), y = c(20,100,20,100), labels = c("FP","TP","TN","FN"))
]
.footnote[John D.Storey 2002; Storey and Tibshirani 2003; Prummer 2012]
Definition of FDR as given in the Benjamini and Hochberg paper 1995.
table <- data.frame( c("Reject H0","Accept H0", "Total"), matrix(c("V (FP)","S (TP)","R","U (TN)","T (FN)","m-R","m_0","m-m_0","m"), ncol=3, byrow=T)) colnames(table) <- c("R/C","H0 TRUE", "HA", "Total") knitr::kable(table, format="html")
the proportion of false discoveries among the discoveries (rejections of the null hypothesis)
$Q=V/R=V/(V+S); ~~~ where ~~~ Q=0 ~~if~~R=0$
$FDR = Q_e = E[Q]$ (expected value of $Q$).
For any given FDR level $\alpha$,
the Benjamini-Hochberg (1995) procedure is very practical
because it simply requires that we are able to compute p-values for each of the individual tests and this permits a procedure to be defined.
.footnote[Highlighted code illustrates the Benjmini Hochberge procedure (top line) and how you would compute the FDR in R using the method p.adjust
.]
.left-code[
alpha <- 0.05 i = seq(along=pvals24) k <- max(which(sort(pvals24) < i/m*alpha)) #<< padj <- p.adjust(pvals24,method="BH") #<< # both return same number stopifnot(k == sum(padj < 0.05)) #<< par(mfrow=c(2,2)) hist(pvals24, breaks=20) hist(padj , breaks = 20) plot(i,sort(pvals24)) abline(0,i/m*alpha, col=2) plot(i[1:k],sort(pvals24)[1:k],type="b", main="Close-up") abline(0,i/m*alpha, col=2)
]
.right-plot[
]
exclude: true
.right-code[
p.adjust.BH <- function (p, n = length(p)) { nm <- names(p) p <- as.numeric(p) p0 <- setNames(p, nm) if (all(nna <- !is.na(p))) nna <- TRUE p <- p[nna] lp <- length(p) stopifnot(n >= lp) if (n <= 1) return(p0) i <- lp:1L o <- order(p, decreasing = TRUE) ro <- order(o) p0[nna] <- pmin(1, cummin(n/i * p[o]))[ro] return(p0) }
]
.footnote[Do not use this but run p.adjust]
Although we will end up with more false positives, FDR gives us much more power. This makes it particularly appropriate for discovery phase experiments where we may accept FDR levels much higher than 0.05.
The FDR has an interpretation in a Multiple testing setting, while an unadjusted p-value in a hight throughput setting has NO explanation.
The BH procedure is valid when the $m$ tests are independent, and also in various scenarios of dependence, but is not universally valid. (e.g. gene sets.)
.footnote[@JG do not report p-values for OMICS data, just use a higher FDR threshold.]
layout: false
In practice we can observe various shapes of p-value distributions.
How to interpret a p-value histogram
This blog post discusses what types of p-value
distrubutions you might encounter when analysing
data and how to treat them.
.img-right[
]
If you do subgroup analysis use
exploratory or descriptive data analysis:
- tabulating (e.g. Venn diagrams)
- dimensionality reduction (e.g, PCA)
- clustering of samples and proteins
(e.g, time series clustering)
- Use GSEA or ORA analysis to contrast subgroups.
- Do not over-interpret your findings.
.img-small[
]
.footnote[https://xkcd.com/882/]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.