In activityMonitoring/week2DataChallenge: Data Challenge - Week 2 (Health Association Analysis)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 12,
  fig.height = 8
)

In this tutorial, we will go through a basic health association analysis using movement behaviour data.

Load in data

First we will load required packages and some helper functions.

# First we need to install packages that aren't already present
pkgs <- c("ggplot2", "data.table", "survival", "table1", "scales", "performance", "see", "patchwork")
pkgs_inst <- pkgs[!{pkgs %in% rownames(installed.packages())}]
install.packages(pkgs_inst, repos = "https://www.stats.bris.ac.uk/R/" )
rm(pkgs, pkgs_inst) # clean up after ourselves

# Load packages
invisible(lapply(c("ggplot2", "data.table", "survival"), library, character.only = TRUE)) # only loading packages that we'll use a lot

As well as external packages, we'll use some helper functions we've prepared previously. If you want to modify them, you can find them in the file R/utils.R on the GitHub page.

# Load helper functions
source("../R/utils.R")

We then load the data prepared in the last tutorial.

To do this with example data, replace the file location below with "/cdtshared/wearables/health_data_files/dataset-with-preprocessing-done.csv".

df <- fread("../data_and_data_prep/dataset-with-preprocessing-done.csv", data.table = FALSE)

Here are some bits of extra prep based on what we're doing next:

I added in variables scaling the movement behaviour variables to the number of hours or minutes in a day.
We will use an overall activity variable chopped by quintile in the data.

# Calculate proportion of time in different behaviours as min/day and hr/day
vars <- c("MVPA", "light", "sedentary", "sleep")
df[, paste0(vars, "_min_per_day")] <-
  lapply(df[, paste0(vars, ".overall.avg")], function(x) 24 * 60 * x)
df[, paste0(vars, "_hr_per_day")] <-
  lapply(df[, paste0(vars, ".overall.avg")], function(x) 24 * x)

# Cut by quintile
## Home-brewed cut-by-quantile function:
qtile <-
  function(x,
           probs = seq(0, 1, 0.25),
           labels = NULL,
           ordered = FALSE,
           na.rm = TRUE) {
    breaks <- quantile(x = x, probs = probs, na.rm = na.rm)
    out <-
      cut(
        x = x,
        breaks = breaks,
        labels = labels,
        include.lowest = TRUE,
        ordered_result = ordered
      )
    return(out)
  }
## Cut acc by quintile
fifths <- seq(0, 1, by = 0.2)
df$acc_quintiles <- qtile(df$acc.overall.avg,
                          fifths,
                          labels = paste0("Q", 1:5))
df$acc_quintiles_values <-  qtile(df$acc.overall.avg,
                                  fifths)

rm(vars, fifths) # clean up

Describe and explore the data

This section provides a few examples of descriptive analyses.

As always, it's worth spending some time here getting to know the data. Are the patterns as you would expect? Does it raise any potential issues? (e.g. when thinking about confounding)

The analyses here are just a sample of what you could do --- please do add to them.

Tables to describe the data

The following code constructs a table to describe participant characteristics by quintile of overall acitvity.

We are cheating a bit here by using a package that generates a nicely formatted table. You can of course write your own code to make a table of what you're interested in.

table1::table1(~ sex + age_gp + smoking | acc_quintiles_values, data=df)

Plots to describe the data

We'll also plot some of the variables to get a feel for how they're distributed.

# overall physical activity
## deciles
quantile(df$acc.overall.avg, prob = seq(0, 1, by = 0.1), na.rm = TRUE)
## histogram
hist(df$acc.overall.avg, breaks=1e3, xlim=c(0,100))

# MVPA
## deciles
quantile(df$MVPA_min_per_day, prob = seq(0, 1, by = 0.1), na.rm = TRUE)
## histogram
hist(df$MVPA_min_per_day, breaks=1000, xlim=c(0,300))

We can have a look at some of the machine-learned variables, to see if we believe them! Here we'll look at plots of probability of being in a particular behaviour by time of day:

plot_average_day(
  data = df,
  exposure_prefix = "sleep.hourOfDay.",
  exposure_suffix = ".avg",
  y_label = "Sleep (probability)"
) # This is a custom function. To see what it's doing see R/utils.R

Now let's look descriptively at something we might expect to vary as activity status varies: BMI. This is just a descriptive plot i.e. it's not adjusted for any other behaviours.

plot_var_and_quintile(data = df,
                   exposure = 'acc.overall.avg',
                   outcome = 'BMI'
                  ) # This is a custom function. To see what it's doing see R/utils.R

Run a simple health association analysis

Let's run a minimally (age group and sex) adjusted linear model for BMI, against fifths of average acceleration vector magnitude. This is attempting to model statistically the association we were looking at descriptively in the end of the last section.

min_adj_lm_BMI <- lm(BMI ~ acc_quintiles + age_gp + sex, data = df)

We can also look at the model diagnostics to understand more about the fit of the model:

performance::check_model(min_adj_lm_BMI) # It's perfectly possible to do these checks in base R - but this gives us some explanations 
# See https://easystats.github.io/performance/ and Lüdecke et al., 2021, JOSS

The fact the residuals aren't very normal is probably because the distribution of BMI is a bit skewed.

We can look at the model summary:

summary(min_adj_lm_BMI)

Next steps for modelling

Here we've looked at a very simple linear model. Next steps might be:

adjusting for confounders, e.g. perhaps socioeconomic status, smoking, or education affects both level of activity and BMI
refining the definition of the exposure or looking at different exposure variables
looking at different outcomes
logistic regression for prevalent disease (use the glm() function)
Cox regression for incident disease (see for example the survival package and the coxph() function)

Getting started with Cox regression

For example, here's an initial Cox regression analysis associating quintiles of overall acceleration with incident ischaemic heart disease.

From the data preparation step, we have an event status indicator at exit and a follow-up time variable. Using these, we can run a Cox model using the survival package in R:

cox_model <- coxph(Surv(fu_time, inc_ihd) ~ age_gp + sex + acc_quintiles, data = df)# This line uses functions from the 'survival' package

As discussed in the lecture, a key assumption of Cox regression is the proportional hazards assumption. There are several ways to assess this. One way is through plots of the Schoenfeld residuals. Read more here (see section 23.2.5 and 23.2.6).

plot(cox.zph(cox_model))

We can look at the model summary:

summary(cox_model)

The exp(coef) column gives the hazard ratio. Not surprisingly, older age and male sex are associated with higher risk of ischaemic heart disease, whereas a higher level of activity is associated with a lower risk of ischaemic heart disease.

We could plot the results:

data_for_plot <- as.data.frame(
  exp( cbind(coef(cox_model), confint(cox_model)) )
)
names(data_for_plot) <- c("HR", "lower_CI", "upper_CI")
data_for_plot$var_name <- rownames(data_for_plot)

ncoef <- nrow(data_for_plot)
data_for_plot <- data_for_plot[(ncoef - 3):ncoef, ]

ref_row <-
  data.frame(
    "var_name" = "Q1",
    "HR" = 1,
    "lower_CI" = 1,
    "upper_CI"  = 1
  )
data_for_plot <- rbind(ref_row, data_for_plot)
data_for_plot$quintile <-
  sub("acc_quintiles", "", data_for_plot$var_name)
data_for_plot$event_number <-
  sapply(
    X = as.factor(data_for_plot$quintile),
    FUN = function(x)
      sum(df$inc_ihd[df$acc_quintiles == x])
  )

ggplot(data_for_plot, aes(x = HR, y = quintile)) + 
  scale_x_continuous(trans = "log") + 
  scale_y_discrete(limits=rev) + 
  labs(title = "Association of activity with incident IHD", y = "Quintile of average acceleration")+
  geom_vline(xintercept = 1) + 
  geom_errorbar(aes(xmin = lower_CI, xmax = upper_CI), width = 0, size = 1) +
  geom_point(aes(size = event_number), shape = 15) +
  theme_classic()  + 
  theme(axis.line.y = element_blank())

Next steps

Again, there's lots you could do to build on and refine this example:

it would be good to adjust for other possible confounders.
here, we've excluded people who have prevalent disease recorded in hospital. What about people who self-reported prevalent disease at the UK Biobank baseline assessment?
you might want look at sensitivity analyses. For example, could the association be driven by participants who have reduced their activity due to poor health? How could you assess that?
this model is probably under-adjusted for age: age is the major determinant of IHD risk, and here we've adjusted for baseline age in very crude groups. A first step would be to use more granular groups or a continuous term.
A bonus: we've adjusted for baseline age and used time-on-study as the timescale in the Cox regression analysis. This is what most introductory texts do. However, in Cox regression we can use different timescales. In epidemiological studies, time-on-study might not be the most relevant timescale: for example, we could also use age as the timescale.

Have fun! :)

activityMonitoring/week2DataChallenge documentation built on May 30, 2022, 1:54 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

activityMonitoring/week2DataChallenge
Data Challenge - Week 2 (Health Association Analysis)

In activityMonitoring/week2DataChallenge: Data Challenge - Week 2 (Health Association Analysis)

Load in data

Describe and explore the data

Tables to describe the data

Plots to describe the data

Run a simple health association analysis

Next steps for modelling

Getting started with Cox regression

Next steps

R Package Documentation

Browse R Packages

We want your feedback!

activityMonitoring/week2DataChallenge Data Challenge - Week 2 (Health Association Analysis)

In activityMonitoring/week2DataChallenge: Data Challenge - Week 2 (Health Association Analysis)

Load in data

Describe and explore the data

Tables to describe the data

Plots to describe the data

Run a simple health association analysis

Next steps for modelling

Getting started with Cox regression

Next steps

R Package Documentation

Browse R Packages

We want your feedback!

activityMonitoring/week2DataChallenge
Data Challenge - Week 2 (Health Association Analysis)