knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
library(ggplot2) library(dplyr)
Here, we show a simple example of how a 'confounder' (gender, in the example), can lead to a spurious association between two variables.
![Models without and with confounder (gender).](../figs/confounder.png)
We generate simulated data from two models. In model 1, cause follows a standard normal distribution (mean=0, stdev=1), and effect also follows a normal distribution, but with its mean a linear function of cause (i.e., effect depends on cause). In model 2, both cause and effect follow normal distributions with their mean dependent on gender. More formally, assuming $\epsilon \sim N(0,\sigma_{\epsilon})$:
Hence, in model 2, cause and effect are marginally dependent, but conditionally independent given gender. Translated into R code ...
ssize <- 1000 alpha <- 0 # intercept beta <- 1.0 # slope stdev <- 1.0 ## model 1 set.seed(123) # for reproducible results model1 <- data.frame(cause=rnorm(ssize,mean=0,sd=stdev)) %>% dplyr::mutate(effect=alpha + beta*cause + rnorm(ssize,mean=0,sd=stdev)) ## model 2 set.seed(345) # for reproducible results model2 <- data.frame(gender=factor(sample(c("female","male"),size=ssize,replace=TRUE))) %>% dplyr::mutate(cause=rnorm(ssize,mean=alpha+ c(0,beta)[gender],sd=stdev/2)) %>% dplyr::mutate(effect=rnorm(ssize,mean=alpha+ c(0,beta)[gender],sd=stdev/2))
We now fit a linear model on the model 1 data, where we regress effect on cause. Notice the clear positive slope relating cause to effect, and the correspondingly highly significant estimated $\beta_{\text{cause}}$ coefficient.
ggplot2::ggplot(model1,aes(x=cause,y=effect)) + ggplot2::geom_point() + ggplot2::geom_smooth(method="lm") + labs(title="Model1") lm1 <- lm( effect ~ cause, data=model1 ) summary(lm1)
If we fit a similar linear model on model 2 data, we obtain similar results, with a significant $\beta_{\text{cause}}$ coefficient.
ggplot2::ggplot(model2,aes(x=cause,y=effect)) + ggplot2::geom_point(aes(x=cause,y=effect,col=gender)) + ggplot2::geom_smooth(method="lm") + labs(title="Model2") lm2 <- lm( effect ~ cause, data=model2 ) summary(lm2)
However, if we now include gender as a covariate in the regression model, we see that the association between cause and effect disappears, as it is entirely explained by both variables' association with gender.
ggplot2::ggplot(model2,aes(x=cause,y=effect,col=gender)) + ggplot2::geom_point() + ggplot2::geom_smooth(method="lm") + labs(title="Model2 (controlling for gender)") lm3 <- lm( effect ~ cause + gender, data=model2 ) summary(lm3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.