Knudson's "Mutation and cancer: statistical study of retinoblastoma" (1971) forms the basis for the 2-hit hypothesis in cancer and has 8000+ citations in the biomedical literature.
The goal of this vignette is to reproduce Figure 1 from this paper using the data provided in Table 1, which is also provided within this package as the object k
.
knudson
libraryknudson
is a data package that contains a single object, k
, which is a machine-readable form of Table 1 from Knudson, 1971.
We'll also load some other packages that we'll need for the rest of the analysis.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning=FALSE, error=FALSE, message=FALSE )
library(knudson) library(tidyverse) library(patchwork) library(png) library(knitr) library(kableExtra)
Let's load a screenshot of Table 1 taken from the original article. We'll also peek at a machine readable form of the Table 1, k
.
t1_png <- png::readPNG("knudson_tab1_1971_PNAS.png") t1_png_gg <- ggplot() + annotation_custom(grid::rasterGrob(t1_png, width = unit(1,"npc"), height = unit(1,"npc")), -Inf, Inf, -Inf, Inf) t1_png_gg
# `k` is a machine-readable version of Table 1 kable(knudson::k) %>% kable_styling(position="center") %>% scroll_box(width = "100%", height = "200px")
Let's load a screenshot of Figure 1 taken from Knudson, 1971. I've added the word "Original" to the top of the figure so that we know it's the figure directly from the manuscript.
knud_png <- png::readPNG("knudson_fig1_1971_PNAS.png") fig1_orig <- ggplot() + annotation_custom(grid::rasterGrob(knud_png, width = unit(1,"npc"), height = unit(1,"npc")), -Inf, Inf, -Inf, Inf) fig1_orig
Of note, the x-axis in Figure 1 ends at 60 months. However, there are 2 patients in Table 1 who are diagnosed at or after age 60 months.
To make sure we have the right number of data points, let's count the number of data points represented in this figure and compare this with the expected number of data points.
I've annotated a screenshot of Figure 1 to count the number of points:
knud_png_annot <- png::readPNG("knudson_fig1_1971_PNAS_annot.png") knud_fig1_annot <- ggplot() + annotation_custom(grid::rasterGrob(knud_png_annot, width = unit(1,"npc"), height = unit(1,"npc")), -Inf, Inf, -Inf, Inf) knud_fig1_annot
There are fewer data points than expected based on the legend (20 unilateral and 17 bilateral cases compared with 25 and 23 expected, respectively).
Why are there fewer points in the plot than expected? The y-axis describes the "fraction of cases not yet diagnosed at the age indicated." Therefore, it is possible that the author collapsed cases occuring in the same month into a single data point.
Let's recreate figure 1 with this approach in mind.
Let's put the reproduced (labeled as "Reproduced in R") and original figures side by side so that we can compare the two.
k2 <- k %>% arrange(side, -age_at_dx) %>% group_by(side) %>% mutate(idx_sum = 1 / length(side), prop_of_tot = cumsum(idx_sum) - idx_sum) %>% ungroup %>% mutate(side=factor(side, levels=c("unilat", "bilat"))) %>% group_by(side, age_at_dx) %>% ## collapse cases occuring in the same month into a single data point filter(prop_of_tot == min(prop_of_tot)) y_axis_seq <- c(0.01, seq(0.02, 0.1, 0.02), seq(0.2, 1, 0.2)) y_axis_seq_char = format(y_axis_seq, digits=2) fig1 <- ggplot(k2, aes(x=age_at_dx, y = prop_of_tot, group=side, shape=side)) + scale_shape_manual(values=c(21,22), labels=c("Unilateral Cases (25)","Bilateral Cases (23)")) + scale_y_log10(breaks = y_axis_seq, limits = c(0.01, 1.04), labels = as.character(gsub("^0\\.", "\\.", y_axis_seq_char)), expand = c(0, 0)) + scale_x_continuous(breaks = seq(0,60, 10), limits = c(-0.1, 75), expand = c(0, 0)) + xlab("AGE IN MONTHS") + ylab("FRACTION OF CASES NOT YET DIAGNOSED AT AGE INDICATED") + theme_classic() + theme(aspect.ratio = 1.5, legend.position = c(0.3,0.16), legend.title=element_blank(), axis.title = element_text(size=8)) + ggtitle("Reproduced in R") + geom_abline(intercept = 0, slope = -1/30, color="black", linetype="solid", size=0.5) + stat_function(fun = function(t) -4 * 10^{-4} *t^2, linetype="longdash", color="black") + geom_point(fill="white") k2_missing <- k2 %>% filter(age_at_dx >= 60 | {age_at_dx == 47 & side == "unilat"}) fig1_w_missing <- fig1 + geom_point(data = k2_missing, aes(x=age_at_dx, y = prop_of_tot), pch=21, fill=NA, color = "red", size=4.5) fig1_w_missing + fig1_orig + plot_layout(widths = c(1, 1.45))
Not bad. We can see that the figure is largely reproducible. However, there are a few differences that we'll highlight in the following sections.
If we look closely, there are 3 data points (highlighted with red circles) in the reproduced figure that differ from the original Figure 1.
First, look along the base of the x-axis in the reproduced figure above. Two data points from patients age >=60 -- the oldest patients to have been diagnosed from each group -- are missing.
Note that in the original figure, the x-axis stops at 60, which can explain for the first two missing data points. Perhaps Knudson intentionally excluded these two data points whose y value equals 0, as plotting a 0 value on a log scale is difficult (because log(0) is undefined).
For some reason, in the reproduced plot, R plots the top half of these two data points, when really they should not be showing up here at 0.01 (their y value equals 0 on a linear scale, and is undefined on a log scale).
A third data point for unilateral retinoblastoma patient 08997, age 47 months, is also missing. However, the reason why this data point is missing is unclear.
The black solid line in the reproduced plot above describes data from bilateral cases and could be reproduced exactly. This equation corresponds to the line $logS = -t/30$, as described in the figure 1 legend:
knud_fig1_lgnd <- png::readPNG("knudson_fig1_legend.png") knud_fig1_lgnd_plt <- ggplot() + annotation_custom(grid::rasterGrob(knud_fig1_lgnd, width = unit(1,"npc"), height = unit(1,"npc")), -Inf, Inf, -Inf, Inf) knud_fig1_lgnd_plt
However, the mathematical equation describing the unilateral cases did not match the curve. The curve $logS = -4 \times 10^{-4}t^2$ (black dotted line) is a much better fit than the curve $logS = -4 \times 10^{-5}t^2$ (red dotted line) described in figure 1 legend.
fig1 + stat_function(fun = function(t) -4 * 10^{-5} *t^2, linetype="longdash", color="red") + annotate("text", x = 60, y = 0.9, label = "-4~x~10^{-5}~t^{2}", parse = TRUE, color="red") + annotate('text', x = 45, y = 0.4, label = "-4~x~10^{-4}~t^{2}", parse = TRUE)
This suggests that the exponent from the equation reported in the figure 1 legend is off by a digit (should be $10^{-4}$ instead of $10^{-5}$).
Overall, it was satisfying to see that Figure 1 from Knusdon 1971 is highly reproducible from the data in Table 1.
However, there were a few key exceptions:
There was 1 missing and unexplained data point in Figure 1.
The fitted curve for the unilateral cases has an error in the exponent.
These inconsistencies do not change the conclusions of this paper.
Also, notice that this paper was 'communicated' to PNAS and did not undergo standard peer review. This could have opened the door for the minor unchecked mistakes we observed.
Please feel free to contact me at dpique1@gmail.com with any questions or comments!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.