Home

/

GitHub

/

lwjohnst86/acdcourse

/

In lwjohnst86/acdcourse: Course on Analyzing Cohort Datasets in R

A difficult part of doing cohort analyses is controlling for potential confounders.

Classical definition of a confounder

Example of classical confounder. {width=80%}

What is a confounder? It is a variable that could influence both the exposure and outcome. While you may have learned about confounding in other courses, understanding it is essential to making valid inferences.

Creating models: Controlling for confounding

Completely controlling for confounding is very difficult
STROBE statement on best practices: STrengthening the Reporting of OBservational studies in Epidemiology. (www.strobe-statement.org)
Three common approaches:
- Literature, biological rationale, background knowledge
- Causal pathways: Directed acyclic graphs (DAG)
- Model selection: Information criterion methods

You control for confounding by including them in your models. Adequately adjusting for confounding requires much consideration. Do the best you can, but know that you'll never adjust for everything.

Confounder adjustment is part of any thorough research and is part of the STROBE statement on best practices. STROBE, or Strengthening the Reporting of Observational Studies in Epidemiology, is a standard in cohort research and should be adhered to.

We should use several approaches to identify confounders, as each have strengths and weaknesses. Use biological and domain knowledge and formal methods like directed acyclic graphs, called DAGs, and information criterion techniques.

Identify confounders with Directed Acyclic Graphs (DAG)

DAG, or a directed acyclic graph. {width=80%}

Directed = A link with an arrow/direction
Acyclic = No looping (cycling) backward
Graph = Representation of objects and links

DAGs make hypothetical causal links explicit and provide a powerful approach to finding confounders.

Let's break the name down. Directed indicates directionality: cause and effect. Acyclic means that a pathway doesn't loop backward: the effect can't also cause the cause. Lastly, graph is a visual representation of links between objects.

Identifying adjustment variables with dagitty

An example: Height with colon cancer

confounders <- dagitty("dag {
  Height -> ColonCancer

}")

plot(graphLayout(confounders))

dagitty generated graph.

Here's an example of using DAGs. Let's say we want to determine if height associates with colon cancer.

We can use the dagitty package to help find confounders. We give it a character string of a DAG specification. This string begins with the name dag, then curly brackets, followed by hypothetical variable names and links. Write links with the minus and greater than sign.

We plot it using the graph-layout function.

Identifying adjustment variables with dagitty

An example: Height with colon cancer

But...

Men are taller
Men more likely to get cancer

confounders <- dagitty("dag {
  Height -> ColonCancer
  Sex -> {Height ColonCancer}
}")

plot(graphLayout(confounders))

dagitty generated graph of colon cancer, sex, and height.

But, we know there are confounders. We know men tend to be taller and men are more likely to get cancer.

We then add these links into dagitty. Since sex is linked with both height and cancer, we include both by wrapping them in curly brackets.

Plotting it shows the links we've added.

Identifying adjustment variables with dagitty

An example: Height with colon cancer

But...

Men are taller
Men more likely to get cancer

confounders <- dagitty("dag {
  Height -> ColonCancer
  Sex -> {Height ColonCancer}
}")

adjustmentSets(
    confounders,
    exposure = "Height",
    outcome = "ColonCancer"
)

#>  { Sex }

dagitty is used mainly to help with causal reasoning and to decide what to include in our models. The adjustment sets function tells us which variables at a minimum we should adjust for. In the function we need to set the exposure and outcome arguments.

This is a simple example, but it says we should at least adjust for sex.

Assessing model fit: Information criterion methods

Estimates relative model "quality" over others
Trade-off between goodness of fit and number of predictors
Common method: Akaike information criterion (AIC)
- For maximum likelihood models
- Smaller number, the better

Another method is information criterion, which identifies adjustment variables by comparing multiple models' fitness. Akaike criterion or AIC ranking is commonly used. A smaller AIC is better.

Model selection using the MuMIn package

full_model <- glmer(
    got_cvd ~ body_mass_index_scaled + total_cholesterol_scaled +
        participant_age + currently_smokes + education_combined +
        sex + (1 | subject_id), 
    data = tidied_framingham, family = binomial, na.action = "na.fail")

library(MuMIn)
# Models with every combination of predictor
model_selection <- dredge(full_model, rank = "AIC",
                          subset = "total_cholesterol_scaled")

A caution: With many variables, big datasets, and/or the type of model = long computation times

The MuMIn package has model selection functions like dredge. Dredge needs a model with all variables we think may bias the results. The argument na dot action set to na dot fail must be set.

We give the model object to dredge to run models with all possible variable combinations. We'll use AIC in the rank argument and specify with the subset argument that we want models with at least cholesterol included.

A quick warning, computation times can become very long if you're not careful.

Model selection using the MuMIn package

# Top 3 models
as.data.frame(model_selection) %>%
    head(3)

   (Intercept) body_mass_index_scaled currently_smokes
57   -1.218899              0.7494514             <NA>
59   -1.522049              0.7652833                +
49   -1.258198              0.7356253             <NA>
   education_combined participant_age sex total_cholesterol_scaled
57               <NA>       0.5155339   +                0.2742008
59               <NA>       0.5858787   +                0.2709116
49               <NA>              NA   +                0.2748649
   df    logLik      AIC     delta     weight
57  6 -203.0653 418.1307 0.0000000 0.36560377
59  7 -202.2876 418.5751 0.4444555 0.29275100
49  5 -205.4571 420.9143 2.7836167 0.09089835

To show the top three models, we first convert the dredge object to a dataframe and use head with three.

Dredge outputs several columns, with each row as a model. Rows with missing values in the columns for the variable names indicate they were not included in the model. The delta and weight columns indicate which models are the relatively better. The delta is the difference in AIC to the model above. Weight is the likelihood that we should choose that model.

The top model adjusts for body mass, age, and sex.