In Athanasiamo/LME_introduction_workshop: Materials for workshop in Linear Mixed Models

# Here you can place global options for the entire document.
# Mostly used for knitr settings, but can also load data etc.
# Whatever happens here will not show in the presentation.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library(lme4)

Linear mixed models - why?

.pull-left[ In statistics we are usually taught only to do linear regressions, like t-tests and anova. These are good options when you have single observations per entity, and entity can for instance be a person or a location. If you have repeated observations from the same entity, or there is some hierarchical structure to your data in a way, i.e. your dependent obsverations correlate in some way, a linear mixed model might suit your needs more. ]

hierarchical models
linear mixed models
linear mixed effects
multi-level models

When searching for information on mixed models, try all these terms, they will give you good coverage of the subject. These terms are not completely synonymous to eachother, but searches on these will help you identify what model applies to your data. ]

Linear mixed models - why?

Homoscedastic (equal variances)
No autocorrelation
Resudials should be normally distributed
]

Handles autocorrelation through random terms
Handles heteroscedasticity through random terms
Residuals need not be normally distributed
When linear model assumptions are met, generally gives same results in large data sets
]

Get started

Inspecting the data

Getting started

Let's dig into it. We will be using the gapminder_sub data for this, as it is already nicely cleaned and in a format that is very well suited for LME's, long format.

gapminder_sub <- gapminder::gapminder 

gapminder_sub

Getting started

Hierarchical, long data

.pull-left[ Long-format data is where an entity has as many rows of data as observations. As you see in the top part of the data, each country as one row of data per year of measurement. gapminder_sub is a great dataset for this workshop because it has a nice hierarchy, we do not only have countries, but countries are nested within continents. This information is something we should use as we are building our models.

As always, we should get to know our data a little before we start. Let's make some plots. ]

knitr::include_graphics("https://neilpatel-qvjnwj7eutn3.netdna-ssl.com/wp-content/uploads/2014/08/2-site-breakdown.png")

]

Inspecting the data

library(tidyverse)
gapminder_sub %>% 
  ggplot(aes(x=lifeExp, 
             colour=continent)) + 
  geom_density()

checking the distribution of the data to understand what is looks like across natural groups in the data

]

]

Inspecting the data

gapminder_sub %>% 
  filter(year == min(year)) %>% 
  arrange(lifeExp) %>% 
  mutate(
    country = fct_inorder(unique(
      as.character(country)
    ))
  ) %>%

  ggplot(aes(x=country, y=lifeExp, 
             fill=continent, 
             colour=continent)) +
  geom_histogram(stat="identity") +
  coord_flip()

checking the distribution of the data to understand what is looks like across natural groups in the data ]

]

Inspecting the data

gapminder_sub %>% 
  ggplot(aes(x=year, y=lifeExp)) + 
  geom_jitter(alpha=.2) +
  geom_smooth(method = "lm", colour="black")

]

]

Inspecting the data

gapminder_sub %>% 
  ggplot(aes(x=year, y=lifeExp)) + 
  geom_jitter(alpha=.2) +
  geom_line(show.legend = FALSE, #<<
            aes(group = country, #<< 
                colour = country)) + #<<
  geom_smooth(method = "lm", colour="black")

]

]

Let's do some modelling!

Exploring standard linear models

Some R-syntax information - formula

.pull-left[ Running models in R, we use something we call a formula. This is basically an unquoted expression of your model specification.

If you want to predict life expentancy by the year of measurement, the corresponding formula would be

lifeExp ~ year

What is on the left-side of the tilde (~) is your dependent variable, and on the right you place you predictors. ] .pull-right[ Main effects are added on using +. Here predicting life expectancy, with year and country as main effects.

lifeExp ~ year + country

Interactions are specified with :, here predicting life expectansy only by the interaction of year and country.

lifeExp ~ year:country

A 'full factorial', a complete mains + interactions can be done as such:

lifeExp ~ year + country + year:country

or the shorthand with an asteri (*):

lifeExp ~ year * country ]

Linear models - running a simple model

library(broom)
lifeEx_year <- lm(lifeExp ~ year, 
                  data=gapminder_sub)

tidy(lifeEx_year) %>% 
  knitr::kable(format="html", #<<
               digits=3) #<<

From here on highlighted code means code you can ignore, it is just there to make the slides prettier. ]

]

knitr::include_graphics("https://assets.rbl.ms/14068114/980x.jpg")

Linear models - plot the regression

lifeEx_year_fit <- augment(lifeEx_year)
beta <- paste0("beta = ", 
               round(
                 lifeEx_year$coefficients[2],2
               ))

lifeEx_year_fit %>% 
  ggplot(aes(x=year, y=.fitted)) + 
  geom_line() + 
  geom_ribbon(alpha=.5,
              aes(ymin=.fitted-.se.fit, 
                  ymax=.fitted+.se.fit)) +
  geom_label(x=1960, y=60,
             label=beta)

]

]

Linear models - checking assumption violations

plot(lifeEx_year, which=1)

] .pull-right[

plot(lifeEx_year, which=2)

]

Linear models - checking assumption violations

gapminder_sub %>% 
  ggplot(aes(x=country, y=lifeExp)) + 
  geom_boxplot() +
  coord_flip()

By making boxplots for each country we can see data varies substantially across the countries. We should somehow be incorporating this information in our models. The previous diagnostic plots clearly indicate autocorrelation in our data, and we know this, because each subsequent observation for a country is correlated to the last observation. ]

]

Linear models - adding a covariate

lifeEx_year_c <- lm(lifeExp ~ year + country, 
                    data=gapminder_sub)

tidy(lifeEx_year_c) %>% 
  knitr::kable(format="html", #<<
               digits=3) #<<

]

]

Linear models - checking assumption violations

.pull-left[ Adding a covariate is not helping our assumption violations. Neither will adding an interaction term. you can try it your self and see, the QQ-plot particularly looks maddeningly bad.

We need to be handling the autocorrelation somehow. An alternative is to aggregate data across countries and use the aggregated data for analysis. But the is data reduction, and you will be loosing alot of power and cannot neatly account for the variability in the data. ]

plot(lifeEx_year_c, which=2)

]

Let's do some modelling!

Exploring linear mixed models

Linear mixed models

library(lme4)
lifeEx_y_c <- lmer(lifeExp ~ year + 
                     (1|country), 
                   data=gapminder_sub)

]

.pull-right[ In the package lme4 a random effect (autororrelation specification) is added with the formulaeic expression (1|entity), which will fit an independent intercept per entity. In this case, our entity is country.

You may also use the package nlme for linear mixed models, but you specify the random effect differently. ]

broom::tidy(lifeEx_y_c) %>% 
  knitr::kable(format="html") #<<

Linear mixed models - inspecting residuals

plot(lifeEx_y_c, which=1)

With LMEs we no longer get QQ plots to inspect, as they are not necessary. But a fitted again residuals you may still instpect. You should see more or less random dot distribution and a completely straight line across 0. This plot looks pretty nice, though there is a downward tail that is not ideal.

There is still something in the data we are not accounting for. Countries are nested within continents, and perhaps specifying this will help. There might be continental trends that might help our model work better, and also we can utilise partial pooling, as the intercepts of countries within continents will be shrunk towards the continent estimate. This increases our degrees of freedom. ] .pull-right[

]

Linear mixed models - random intercepts

Let's first just try just adding continent as a random effect. This will not nest the countries within continent, and we will not get parital pooling. We ignore the hierarchical structure of the data, but would like to see if this makes any difference to our model.

lifeEx_y_cc <- lmer(lifeExp ~ year + 
                      (1|country) + (1|continent), data=gapminder_sub)

broom::tidy(lifeEx_y_cc) %>% 
  knitr::kable(format="html") #<<

Linear mixed models - inspecting residuals

plot(lifeEx_y_cc, which=1)

] .pull-right[

]

Linear mixed models - random intercepts

lifeEx_y_ccn <- lmer(lifeExp ~ year + (1|continent/country), data=gapminder_sub)

broom::tidy(lifeEx_y_ccn) %>% 
  knitr::kable(format="html") #<<

Linear mixed models - inspecting residuals

plot(lifeEx_y_ccn, which=1)

]

]

Linear mixed models - random intercepts

When comparing models, which is the real advantage when using mixed models, and the likelihood function it uses, we want to find the model with lowest AIC and/or BIC, but also to keep in mind what model best formulates the known structure of the data.

In this case, all models are performing more or less equally, and so we will keep with the nested model, which preserves more of the data structure.

anova(lifeEx_y_c, 
      lifeEx_y_cc, 
      lifeEx_y_ccn) %>% 
  tidy() %>% 
  knitr::kable(format="html") #<<

Linear mixed models - random intercepts

.pull-left[ Lets have a look at our fit. Make a data.frame with the variable of interest, year, spanning the time to predict in and run a prediction to get the fits. We also add a standard linear smooth and see if that is different.

predict_data <- data.frame(
  year = seq(min(gapminder_sub$year), 
             max(gapminder_sub$year))
)
predict_data$fit = predict(lifeEx_y_ccn, 
                           newdata=predict_data, 
                           re.form=NA)
ggplot(gapminder_sub, 
       aes(x=year, y=lifeExp)) + 
  geom_jitter(alpha=.2) +
  geom_smooth(method="lm") +
  geom_line(data=predict_data,
            colour="forestgreen",
            aes(y=fit))

]

]

Linear mixel models - random slopes

.pull-left[ Now, we have been fitting just random intercepts. There is also here the possibility of fitting random slopes too, meaning we allow countries to have different trajecotries in life expectancy over time. Which we know is part of the data.

lifeEx_y_s <- lmer(lifeExp ~ year + 
                     (year|country),
                   data=gapminder_sub)

]

Linear mixel models - random slopes

plot(lifeEx_y_s, which = 1)

]

lifeEx_y_s %>% 
  tidy() %>% 
  knitr::kable(format="html") #<<

]

Linear mixel models - random slopes

lifeEx_y_sn <- lmer(lifeExp ~ year + 
                      (year|continent/country),
                    data=gapminder_sub)

]

lifeEx_y_sn %>% 
  tidy() %>% 
  knitr::kable(format="html") #<<

]

Linear mixed models - comparing all models

anova(lifeEx_y_c, 
      lifeEx_y_cc, 
      lifeEx_y_ccn,
      lifeEx_y_s,
      lifeEx_y_sn) %>% 
  tidy() %>% 
  knitr::kable(format="html") #<<

]

.pull-right[ If we compare all models we have tried, they all seem to be performing more or less the same. As such, it is up to us to pick the model which best reflects the structure of the underlying data. In this case, as both models with random slopes threw a type of convergence warning, i would stick to the one with neste intercepts alone. ]

Playing with another dataset

Get data from Gabriela Hadjuk's post om linear mixed models

Get data at: http://gkhajduk.d.pr/9GPn/3nbbPoK6

More resources on LMM's

- Julia Pilowski's practical guide

- Bodo Winters' tutorials and data

- Jared Knowles' getting started

- B. Bolker's mixed models in R

- Gabriela Hadjuk's introduction

Shameless self-promotion

knitr::include_graphics("lme_rladies_london_files/Screenshot-2019-03-26-at-16.41.58.png")

drmowinckels.io

Athanasiamo/LME_introduction_workshop documentation built on Oct. 9, 2020, 2:03 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Athanasiamo/LME_introduction_workshop Materials for workshop in Linear Mixed Models

In Athanasiamo/LME_introduction_workshop: Materials for workshop in Linear Mixed Models

Linear mixed models - why?

Linear mixed models - why?

Get started

Inspecting the data

Getting started

Getting started

Hierarchical, long data

Inspecting the data

Inspecting the data

Inspecting the data

Inspecting the data

Let's do some modelling!

Exploring standard linear models

Some R-syntax information - formula

Linear models - running a simple model

Linear models - plot the regression

Linear models - checking assumption violations

Linear models - checking assumption violations

Linear models - adding a covariate

Linear models - checking assumption violations

Let's do some modelling!

Exploring linear mixed models

Linear mixed models

Linear mixed models - inspecting residuals

Linear mixed models - random intercepts

Linear mixed models - inspecting residuals

Linear mixed models - random intercepts

Linear mixed models - inspecting residuals

Linear mixed models - random intercepts

Linear mixed models - random intercepts

Linear mixel models - random slopes

Linear mixel models - random slopes

Linear mixel models - random slopes

Linear mixed models - comparing all models

Playing with another dataset

More resources on LMM's

- Julia Pilowski's practical guide

- Bodo Winters' tutorials and data

- Jared Knowles' getting started

- B. Bolker's mixed models in R

- Gabriela Hadjuk's introduction

Shameless self-promotion

drmowinckels.io

R Package Documentation

Browse R Packages

We want your feedback!

Athanasiamo/LME_introduction_workshop
Materials for workshop in Linear Mixed Models