Revisiting Knudson's 2 Hit Hypothesis: A Lesson in Data Reproducibility

Knudson's "Mutation and cancer: statistical study of retinoblastoma" (1971) forms the basis for the 2-hit hypothesis in cancer and has 8000+ citations in the biomedical literature.

The goal of this vignette is to reproduce Figure 1 from this paper using the data provided in Table 1, which is also provided within this package as the object k.

Loading the knudson library

knudson is a data package that contains a single object, k, which is a machine-readable form of Table 1 from Knudson, 1971.

We'll also load some other packages that we'll need for the rest of the analysis.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning=FALSE,
  error=FALSE, 
  message=FALSE
)
library(knudson)
library(tidyverse)
library(patchwork)
library(png)
library(knitr)
library(kableExtra)

Looking at Table 1

Let's load a screenshot of Table 1 taken from the original article. We'll also peek at a machine readable form of the Table 1, k.

t1_png <- png::readPNG("knudson_tab1_1971_PNAS.png")

t1_png_gg <- ggplot() +  
  annotation_custom(grid::rasterGrob(t1_png, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)
t1_png_gg
# `k` is a machine-readable version of Table 1 
kable(knudson::k) %>%
  kable_styling(position="center") %>%
  scroll_box(width = "100%", height = "200px")

Comparing Figure 1 with Table 1

Let's load a screenshot of Figure 1 taken from Knudson, 1971. I've added the word "Original" to the top of the figure so that we know it's the figure directly from the manuscript.

knud_png <- png::readPNG("knudson_fig1_1971_PNAS.png")

fig1_orig <- ggplot() +  
  annotation_custom(grid::rasterGrob(knud_png, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)

fig1_orig

Of note, the x-axis in Figure 1 ends at 60 months. However, there are 2 patients in Table 1 who are diagnosed at or after age 60 months.

To make sure we have the right number of data points, let's count the number of data points represented in this figure and compare this with the expected number of data points.

I've annotated a screenshot of Figure 1 to count the number of points:

knud_png_annot <- png::readPNG("knudson_fig1_1971_PNAS_annot.png")

knud_fig1_annot <- ggplot() +  
  annotation_custom(grid::rasterGrob(knud_png_annot, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)
knud_fig1_annot

There are fewer data points than expected based on the legend (20 unilateral and 17 bilateral cases compared with 25 and 23 expected, respectively).

Why are there fewer points in the plot than expected? The y-axis describes the "fraction of cases not yet diagnosed at the age indicated." Therefore, it is possible that the author collapsed cases occuring in the same month into a single data point.

Let's recreate figure 1 with this approach in mind.

Recreating Figure 1

Let's put the reproduced (labeled as "Reproduced in R") and original figures side by side so that we can compare the two.

k2 <- k %>%
  arrange(side, -age_at_dx) %>%
  group_by(side) %>%
  mutate(idx_sum = 1 / length(side), 
         prop_of_tot = cumsum(idx_sum) - idx_sum) %>%
  ungroup %>%
  mutate(side=factor(side, levels=c("unilat", "bilat"))) %>%
  group_by(side, age_at_dx) %>%
  ## collapse cases occuring in the same month into a single data point
  filter(prop_of_tot == min(prop_of_tot)) 

y_axis_seq <- c(0.01, seq(0.02, 0.1, 0.02), seq(0.2, 1, 0.2))
y_axis_seq_char = format(y_axis_seq, digits=2)

fig1 <- ggplot(k2, aes(x=age_at_dx, y = prop_of_tot, group=side, shape=side)) +
  scale_shape_manual(values=c(21,22), 
                     labels=c("Unilateral Cases (25)","Bilateral Cases (23)")) + 
  scale_y_log10(breaks = y_axis_seq, 
                limits = c(0.01, 1.04), 
                labels = as.character(gsub("^0\\.", "\\.", y_axis_seq_char)),
                expand = c(0, 0)) +
  scale_x_continuous(breaks = seq(0,60, 10), 
                     limits = c(-0.1, 75), 
                     expand = c(0, 0)) +
  xlab("AGE IN MONTHS") + 
  ylab("FRACTION OF CASES NOT YET DIAGNOSED AT AGE INDICATED") +
  theme_classic() + 
  theme(aspect.ratio = 1.5, 
        legend.position = c(0.3,0.16), 
        legend.title=element_blank(),
        axis.title = element_text(size=8)) +
  ggtitle("Reproduced in R") + 
  geom_abline(intercept = 0, slope = -1/30, color="black", 
                 linetype="solid", size=0.5) + 
  stat_function(fun = function(t) -4 * 10^{-4} *t^2, linetype="longdash", color="black") +
  geom_point(fill="white")

k2_missing <- k2 %>% filter(age_at_dx >= 60 | {age_at_dx == 47 & side == "unilat"})

fig1_w_missing <- fig1 + 
  geom_point(data = k2_missing, aes(x=age_at_dx, y = prop_of_tot), pch=21, fill=NA, color = "red", size=4.5)


fig1_w_missing + fig1_orig + plot_layout(widths = c(1, 1.45))

Not bad. We can see that the figure is largely reproducible. However, there are a few differences that we'll highlight in the following sections.

Missing data

If we look closely, there are 3 data points (highlighted with red circles) in the reproduced figure that differ from the original Figure 1.

Fitted curves

The black solid line in the reproduced plot above describes data from bilateral cases and could be reproduced exactly. This equation corresponds to the line $logS = -t/30$, as described in the figure 1 legend:

knud_fig1_lgnd <- png::readPNG("knudson_fig1_legend.png")

knud_fig1_lgnd_plt <- ggplot() +  
  annotation_custom(grid::rasterGrob(knud_fig1_lgnd, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)
knud_fig1_lgnd_plt

However, the mathematical equation describing the unilateral cases did not match the curve. The curve $logS = -4 \times 10^{-4}t^2$ (black dotted line) is a much better fit than the curve $logS = -4 \times 10^{-5}t^2$ (red dotted line) described in figure 1 legend.

fig1 + 
  stat_function(fun = function(t) -4 * 10^{-5} *t^2, linetype="longdash", color="red") +
  annotate("text", x = 60, y = 0.9, label = "-4~x~10^{-5}~t^{2}", parse = TRUE, color="red") + 
  annotate('text', x = 45, y = 0.4, label = "-4~x~10^{-4}~t^{2}", parse = TRUE)

This suggests that the exponent from the equation reported in the figure 1 legend is off by a digit (should be $10^{-4}$ instead of $10^{-5}$).

Conclusions

Please feel free to contact me at dpique1@gmail.com with any questions or comments!



dpique/knudson documentation built on March 17, 2020, 3:55 p.m.