knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
knitr::opts_chunk$set(echo = TRUE, tidy = FALSE) options(width = 80) library(knitr) library(rmarkdown) library(rmcorr) library(ggplot2) library(dplyr) library(patchwork) library(esc) library(psych)
In this vignette, we demonstrate the potential consequences of a statistical analysis that ignores dependencies in data. Such overfitting may produce misestimated effect sizes with suppressed variability leading to underestimated p-values. This phenomenon is also widely referred to as pseudoreplication [@hurlbert1984pseudoreplication; @lazic2010problem; @eisner2021pseudoreplication] or a "unit of analysis error" [@hurlbert2009ancient]. For an accessible introduction to this topic see @reinhart2015statistics
This is a simplified illustration of the consequences for overfitting/pseudoreplication, it is not comprehensive. Overfitting is not limited to simple regression/correlation and there are multiple types of pseudoreplication including data collected over space and time [@hurlbert1984pseudoreplication; @hurlbert2009ancient].
install.packages("dplyr") require(dplyr) install.packages("esc") require(esc) install.packages("patchwork") require(patchwork)
N.Marusich2016 <- length(unique(marusich2016_exp2$Pair)) k.Marusich2016 <- length(unique(marusich2016_exp2$SourceReliablity)) total.obs.Marusich2016 <- length(marusich2016_exp2$Pair)
We illustrate the consequences of overfitting/pseudoreplication using our data from @marusich2016. This dataset has a sample size of N = r N.Marusich2016
dyads (two participants working together), where each dyad has k = r k.Marusich2016
repeated measures of two variables: a total of r total.obs.Marusich2016
paired observations (N x k). The two measured variables were scores on the Mission Awareness Rating Scale (MARS) and target-capture time, both assessed in three separate experimental blocks.
Repeated measures correlation is a technique for assessing the within-subject (in this example, within-dyad) relationship between two variables; for example, does a dyad tend to demonstrate higher awareness scores in blocks where it also generates faster target-capture times? Sometimes, however, the relevant research question is about between-subject (dyad) associations; for example, do dyads who tend to perform better on one variable also tend to perform better on the other? To answer this second question, it may be tempting to treat the repeated measures data as independent. Below we demonstrate the potential consequences of this approach.
In the left figure, we fit a simple regression/correlation improperly treating the r total.obs.Marusich2016
total observations as independent units of analysis. In the right figure, we demonstrate one approach to analyzing the between (dyad) relationship without overfitting: averaging the data by the unit of analysis (dyad). Note this approach removes all within (dyad) information.
The x-axis depicts scores on the Mission Awareness Rating Scale (MARS), with higher values representing better situation awareness. The y-axis is target-capture times, the time in seconds to capture High Value Targets (HVT) during the task. Smaller values (faster times) represent better task performance. The band around each regression line is a 95\% confidence interval.
overfit.plot <- ggplot(data = marusich2016_exp2, aes(x = MARS, y = HVT_capture)) + geom_point(aes(colour = factor(Pair))) + geom_smooth(method= "lm", level = 0.95) + coord_cartesian(xlim = c(2,4), ylim=c(0,30)) + ylab("Capture Time (seconds)") + theme_minimal() + theme(legend.position="none") marusich2016_avg <- marusich2016_exp2 %>% group_by(Pair) %>% summarize(Mean_MARS = mean(MARS), Mean_HVT_capture = mean(HVT_capture)) average.plot <- ggplot(data = marusich2016_avg, aes(x = Mean_MARS, y = Mean_HVT_capture)) + geom_smooth(fullrange = TRUE, method= "lm", level = 0.95) + coord_cartesian(xlim = c(2,4), ylim=c(0,30)) + geom_point(aes(colour = factor(Pair))) + xlab("Mean MARS") + ylab("Mean Capture Time (seconds)") + scale_colour_discrete(name = "Dyad") + theme_minimal() overfit.cor <- cor.test(marusich2016_exp2$MARS, marusich2016_exp2$HVT_capture) average.cor <- cor.test(marusich2016_avg$Mean_MARS, marusich2016_avg$Mean_HVT_capture) df.s <- rbind(overfit.cor$parameter, average.cor$parameter) r.s <- rbind(round(rbind(overfit.cor$estimate, average.cor$estimate), digits = 2)) CI.s <- formatC(rbind(overfit.cor$conf.int, average.cor$conf.int), digits = 2, format = 'f') p.vals <- rbind(round(overfit.cor$p.value, digits = 3), prettyNum(average.cor$p.value, digits = 2, drop0trailing = TRUE)) overfit.plot + average.plot
The inferential statistics for the above plots are:
Overfit (left): r(r df.s[1]
) = r r.s[1]
, 95\% CI [r CI.s[1,]
], p = r p.vals[1]
(Exact p-value = r overfit.cor$p.value
)
Average (right): r(r df.s[2]
) = r r.s[2]
, 95\% CI [r CI.s[2,]
], p = r p.vals[2]
(Exact p-value = r average.cor$p.value
)
For the overfit results, note the Pearson correlation has r df.s[1]
degrees of freedom which is excessive because it implies a sample size of N = r df.s[1] + 2
dyads (N - 2 degrees of freedom for a correlation [@cohen2013applied]). That is, the actual sample size (N = r df.s[2] + 2
dyads) is erroneously modeled as r df.s[1] + 2
independent units by ignoring the three paired, repeated measures per dyad.
These two analyses produce varied results. The overfit model erroneously produces a precise moderate, negative correlation for higher MARS values being associated with lower times to capture targets (better performance). In contrast, the average model indicates a weak negative correlation with high uncertainty between dyads.
As a comparison, we show the results of conducting a repeated measures correlation analysis on the same dataset. As described above, this method addresses a different question of the nature of the within-dyad relationship, rather than between dyads. In this analysis, averaging is not required, as the dependencies in the data are taken into account.
marusich.rmc <- rmcorr(participant = Pair, measure1 = MARS, measure2 = HVT_capture, data = marusich2016_exp2) rmc.plot <- ggplot(data = marusich2016_exp2, aes(x = MARS, y = HVT_capture, group = factor(Pair), color = factor(Pair))) + geom_point(aes(colour = factor(Pair))) + geom_line(aes(y = marusich.rmc$model$fitted.values)) + theme_minimal() + scale_colour_discrete(name = "Dyad") + ylab("Capture Time (seconds)") rmc.r <- round(marusich.rmc$r, 2) rmc.CI <- round(marusich.rmc$CI, 2) rmc.p <- formatC(marusich.rmc$p, format = "fg", digits = 2) # prettyNum(average.cor$p.value, digits = 2, # drop0trailing = TRUE)) rmc.plot
The inferential statistics for the rmcorr analysis are:
r~rmcorr~(r marusich.rmc$df
) = r rmc.r
, 95\% CI [r rmc.CI
], p = r rmc.p
.
The above analysis indicates a strong within-dyad or relative relationship for the two measures. At the within level of analysis, higher values of MARS have a large to very large relative (within-dyad) association with better performance, lower target capture times; and vice-versa. However, the dyad-to-dyad comparisons for MARS or performance values are only weakly predictive at the between level of analysis. In other words, using a specific MARS (or performance) value to predict the other is quite limited between the dyads. Results at the between level of analysis may not necessarily generalize to the within level of analysis [@curran2011disaggregation; @fisher2018lack; @molenaar2009new]
1) Misestimated effect size: The overfit analysis has an approximately point-estimated medium negative correlation, whereas the averaged analysis has slightly less than a small negative point-estimated correlation.
2) Spurious precision: The coverage of the confidence intervals (visually on the plot and matching inferential statistics) is narrower than it should be for the overfit analysis compared to the averaged data.
3) Suppressed p-value: Because overfitting produces a medium negative effect size and underestimates variance, the overfit example is highly statistically significant.
To compare the suppression of variability with the overfit model versus a correct model, we use the Fisher Z approximation for the standard deviation (sd) is:
$$
\frac{1}{\sqrt{N-3}}
$$
This formula is calculated using the respective sample sizes for each model N = r df.s[1] + 2
(overfit) and N = r df.s[2] + 2
(correct).
N.vals <- seq(5, 100, by = 1) sd.Z <- 1/sqrt(N.vals-3) sd.overfit <- data.frame(cbind(N.vals, sd.Z)) sd.cor <- sd.Z[N.Marusich2016-4] #Have to subtract 4 because N.vals starts at 5, not 1 sd.overf <- sd.Z[total.obs.Marusich2016-4] #We could determine the inflated sample size for the overfit model using its degrees of freedom + 2 df.s + 2 == total.obs.Marusich2016 sd.overfit.plot <- ggplot(data = sd.overfit, aes(x = N.vals, y = sd.Z)) + coord_cartesian(xlim = c(0,100), ylim=c(0, 0.75)) + geom_point(alpha = 0.25) + geom_segment(x = 28, y = sd.cor, xend = 84, yend = sd.cor, linetype = 2) + geom_segment(x = 84, y = sd.cor, xend = 84, yend = sd.overf, colour = "red", linetype = 2) + xlab("Analysis Sample Size") + ylab(expression(paste("Fisher ", italic("Z"), " standard deviation"))) + annotate(geom = "point", x = 28, y = sd.cor, colour = "black", size = 2) + annotate(geom = "point", x = 84, y = sd.overf, colour = "red", size = 2) + theme_minimal() sd.overfit.plot
In the graph, the red dot indicates the overfit model and black dot shows a correct model with averaged data. Note the magnitude of sd underestimation with pseudoreplication is depicted by the dashed vertical red line: the Fisher Z sd = r round(sd.overf, digits = 2)
for the overfit model is about half the value of a correct model Fisher Z sd = r format(round(sd.cor, digits = 2), nsmall = 2)
.
The easiest way to detect overfitting is comparing the the sample size (degrees of freedom) in statistical results to the actual sample size, as shown above.
However, if degrees of freedom for reported results are unavailable it may still be possible to detect overfitting. For example, the actual sample size N and reported (ideally exact) p-value can be used to calculate their corresponding expected effect size. Then, the expected effect and reported effect can be compared.
We demonstrate this using the esc package [@esc2019] to determine the expected correlation (Fisher Z) for an actual sample size of N = r N.Marusich2016
with a reported p-value (exact) = r overfit.cor$p.value
from the overfit analysis. We perform the Fisher Z-to-r transformation, using the psych package [@psych2024], to compare the values of the expected correlation versus the reported correlation.
#Calculate the expected correlation using the reported p-value and actual sample size #esc_t assumes p-value is two-tailed and uses a Fisher Z transformation calc.z <- esc_t(p = as.numeric(p.vals[1]), totaln = N.Marusich2016, es.type = "r")$es reported.r = as.numeric(r.s[1]) #Overfit? fisherz2r(calc.z) != abs(reported.r) #Note calc.z will always be positive using esc_t() this way #If the reported.r is not a Fisher Z value - either it should be transformed or the calc.z should be transformed fisherz2r(calc.z) reported.r
See the last code chunk in this R Markdown document for examples of detecting overfit correlations from results reported in published papers (https://osf.io/cnfjt) from a systematic review and meta-analysis [@bakdash2022SAmeta].
Detection of overfitting should be used with caution because mismatches in reported versus expected values could be due to missing or excluded data in the reported statistics. Missing data can produce a lower sample size (based on the reported statistics) than the actual sample size; as opposed to an inflated sample size with true pseudoreplication.
Additionally, overfit models can be adjusted or penalized to account for dependencies in data. Examples of such techniques include cluster-robust errors, regularization, and ensemble models.
Last, we contend there may be some circumstances where overfitting is acceptable (e.g., exploratory data analysis including visualization, separate models for each participant with a large number of repeated trials [such as psychophysics], or no meaningful difference between the overfit model without and with adjustment techniques).
The prevalence of overfitting/pseudoreplication in published papers suggest it is a frequent statistical error across disciplines. For results reported in a single issue of the journal Nature Neuroscience 36\% of papers had one or more results with suspected overfitting and 12\% of papers had definitive evidence of pseudoreplication [@lazic2010problem]. In human factors, using only results meeting inclusion for a systematic review, overfitting was found in 28\% of papers: 29 out of 103^[26 papers had all overfit results and 3 papers had partial results with overfitting. This calculation had 103 total papers = 74 (no overfitting) + 3 (partial overfitting) + 26 (all overfitting).][@bakdash2022SAmeta].
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.