In dpique/knudson: Data from Table 1 of Knudson, 1971 (Two-Hit Hypothesis of Cancer)

Revisiting Knudson's 2 Hit Hypothesis: A Lesson in Data Reproducibility

Knudson's "Mutation and cancer: statistical study of retinoblastoma" (1971) forms the basis for the 2-hit hypothesis in cancer and has 8000+ citations in the biomedical literature.

The goal of this vignette is to reproduce Figure 1 from this paper using the data provided in Table 1, which is also provided within this package as the object k.

Loading the `knudson` library

knudson is a data package that contains a single object, k, which is a machine-readable form of Table 1 from Knudson, 1971.

We'll also load some other packages that we'll need for the rest of the analysis.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning=FALSE,
  error=FALSE, 
  message=FALSE
)

library(knudson)
library(tidyverse)
library(patchwork)
library(png)
library(knitr)
library(kableExtra)

Looking at Table 1

Let's load a screenshot of Table 1 taken from the original article. We'll also peek at a machine readable form of the Table 1, k.

t1_png <- png::readPNG("knudson_tab1_1971_PNAS.png")

t1_png_gg <- ggplot() +  
  annotation_custom(grid::rasterGrob(t1_png, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)
t1_png_gg

# `k` is a machine-readable version of Table 1 
kable(knudson::k) %>%
  kable_styling(position="center") %>%
  scroll_box(width = "100%", height = "200px")

Comparing Figure 1 with Table 1

Let's load a screenshot of Figure 1 taken from Knudson, 1971. I've added the word "Original" to the top of the figure so that we know it's the figure directly from the manuscript.

knud_png <- png::readPNG("knudson_fig1_1971_PNAS.png")

fig1_orig <- ggplot() +  
  annotation_custom(grid::rasterGrob(knud_png, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)

fig1_orig

Of note, the x-axis in Figure 1 ends at 60 months. However, there are 2 patients in Table 1 who are diagnosed at or after age 60 months.

To make sure we have the right number of data points, let's count the number of data points represented in this figure and compare this with the expected number of data points.

I've annotated a screenshot of Figure 1 to count the number of points:

knud_png_annot <- png::readPNG("knudson_fig1_1971_PNAS_annot.png")

knud_fig1_annot <- ggplot() +  
  annotation_custom(grid::rasterGrob(knud_png_annot, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)
knud_fig1_annot

There are fewer data points than expected based on the legend (20 unilateral and 17 bilateral cases compared with 25 and 23 expected, respectively).

Why are there fewer points in the plot than expected? The y-axis describes the "fraction of cases not yet diagnosed at the age indicated." Therefore, it is possible that the author collapsed cases occuring in the same month into a single data point.

Let's recreate figure 1 with this approach in mind.

Recreating Figure 1

Let's put the reproduced (labeled as "Reproduced in R") and original figures side by side so that we can compare the two.

k2 <- k %>%
  arrange(side, -age_at_dx) %>%
  group_by(side) %>%
  mutate(idx_sum = 1 / length(side), 
         prop_of_tot = cumsum(idx_sum) - idx_sum) %>%
  ungroup %>%
  mutate(side=factor(side, levels=c("unilat", "bilat"))) %>%
  group_by(side, age_at_dx) %>%
  ## collapse cases occuring in the same month into a single data point
  filter(prop_of_tot == min(prop_of_tot)) 

y_axis_seq <- c(0.01, seq(0.02, 0.1, 0.02), seq(0.2, 1, 0.2))
y_axis_seq_char = format(y_axis_seq, digits=2)

fig1 <- ggplot(k2, aes(x=age_at_dx, y = prop_of_tot, group=side, shape=side)) +
  scale_shape_manual(values=c(21,22), 
                     labels=c("Unilateral Cases (25)","Bilateral Cases (23)")) + 
  scale_y_log10(breaks = y_axis_seq, 
                limits = c(0.01, 1.04), 
                labels = as.character(gsub("^0\\.", "\\.", y_axis_seq_char)),
                expand = c(0, 0)) +
  scale_x_continuous(breaks = seq(0,60, 10), 
                     limits = c(-0.1, 75), 
                     expand = c(0, 0)) +
  xlab("AGE IN MONTHS") + 
  ylab("FRACTION OF CASES NOT YET DIAGNOSED AT AGE INDICATED") +
  theme_classic() + 
  theme(aspect.ratio = 1.5, 
        legend.position = c(0.3,0.16), 
        legend.title=element_blank(),
        axis.title = element_text(size=8)) +
  ggtitle("Reproduced in R") + 
  geom_abline(intercept = 0, slope = -1/30, color="black", 
                 linetype="solid", size=0.5) + 
  stat_function(fun = function(t) -4 * 10^{-4} *t^2, linetype="longdash", color="black") +
  geom_point(fill="white")

k2_missing <- k2 %>% filter(age_at_dx >= 60 | {age_at_dx == 47 & side == "unilat"})

fig1_w_missing <- fig1 + 
  geom_point(data = k2_missing, aes(x=age_at_dx, y = prop_of_tot), pch=21, fill=NA, color = "red", size=4.5)


fig1_w_missing + fig1_orig + plot_layout(widths = c(1, 1.45))

Not bad. We can see that the figure is largely reproducible. However, there are a few differences that we'll highlight in the following sections.

Missing data

If we look closely, there are 3 data points (highlighted with red circles) in the reproduced figure that differ from the original Figure 1.

First, look along the base of the x-axis in the reproduced figure above. Two data points from patients age >=60 -- the oldest patients to have been diagnosed from each group -- are missing.
Note that in the original figure, the x-axis stops at 60, which can explain for the first two missing data points. Perhaps Knudson intentionally excluded these two data points whose y value equals 0, as plotting a 0 value on a log scale is difficult (because log(0) is undefined).
For some reason, in the reproduced plot, R plots the top half of these two data points, when really they should not be showing up here at 0.01 (their y value equals 0 on a linear scale, and is undefined on a log scale).
A third data point for unilateral retinoblastoma patient 08997, age 47 months, is also missing. However, the reason why this data point is missing is unclear.

Fitted curves

The black solid line in the reproduced plot above describes data from bilateral cases and could be reproduced exactly. This equation corresponds to the line $logS = -t/30$, as described in the figure 1 legend:

knud_fig1_lgnd <- png::readPNG("knudson_fig1_legend.png")

knud_fig1_lgnd_plt <- ggplot() +  
  annotation_custom(grid::rasterGrob(knud_fig1_lgnd, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                    -Inf, Inf, -Inf, Inf)
knud_fig1_lgnd_plt

However, the mathematical equation describing the unilateral cases did not match the curve. The curve $logS = -4 \times 10^{-4}t^2$ (black dotted line) is a much better fit than the curve $logS = -4 \times 10^{-5}t^2$ (red dotted line) described in figure 1 legend.

fig1 + 
  stat_function(fun = function(t) -4 * 10^{-5} *t^2, linetype="longdash", color="red") +
  annotate("text", x = 60, y = 0.9, label = "-4~x~10^{-5}~t^{2}", parse = TRUE, color="red") + 
  annotate('text', x = 45, y = 0.4, label = "-4~x~10^{-4}~t^{2}", parse = TRUE)

This suggests that the exponent from the equation reported in the figure 1 legend is off by a digit (should be $10^{-4}$ instead of $10^{-5}$).

Conclusions

Overall, it was satisfying to see that Figure 1 from Knusdon 1971 is highly reproducible from the data in Table 1.
However, there were a few key exceptions:
- There was 1 missing and unexplained data point in Figure 1.
- The fitted curve for the unilateral cases has an error in the exponent.
These inconsistencies do not change the conclusions of this paper.
Also, notice that this paper was 'communicated' to PNAS and did not undergo standard peer review. This could have opened the door for the minor unchecked mistakes we observed.
- Fortunately, PNAS discontinued communicated manuscripts in 2010

Please feel free to contact me at dpique1@gmail.com with any questions or comments!

dpique/knudson documentation built on March 17, 2020, 3:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dpique/knudson
Data from Table 1 of Knudson, 1971 (Two-Hit Hypothesis of Cancer)

In dpique/knudson: Data from Table 1 of Knudson, 1971 (Two-Hit Hypothesis of Cancer)

Revisiting Knudson's 2 Hit Hypothesis: A Lesson in Data Reproducibility

Loading the `knudson` library

Looking at Table 1

Comparing Figure 1 with Table 1

Recreating Figure 1

Missing data

Fitted curves

Conclusions

R Package Documentation

Browse R Packages

We want your feedback!

dpique/knudson Data from Table 1 of Knudson, 1971 (Two-Hit Hypothesis of Cancer)

In dpique/knudson: Data from Table 1 of Knudson, 1971 (Two-Hit Hypothesis of Cancer)

Revisiting Knudson's 2 Hit Hypothesis: A Lesson in Data Reproducibility

Loading the knudson library

Looking at Table 1

Comparing Figure 1 with Table 1

Recreating Figure 1

Missing data

Fitted curves

Conclusions

R Package Documentation

Browse R Packages

We want your feedback!

dpique/knudson
Data from Table 1 of Knudson, 1971 (Two-Hit Hypothesis of Cancer)

Loading the `knudson` library