knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 12, fig.height = 8 )
In this tutorial, we will go through a basic health association analysis using movement behaviour data.
First we will load required packages and some helper functions.
# First we need to install packages that aren't already present pkgs <- c("ggplot2", "data.table", "survival", "table1", "scales", "performance", "see", "patchwork") pkgs_inst <- pkgs[!{pkgs %in% rownames(installed.packages())}] install.packages(pkgs_inst, repos = "https://www.stats.bris.ac.uk/R/" ) rm(pkgs, pkgs_inst) # clean up after ourselves # Load packages invisible(lapply(c("ggplot2", "data.table", "survival"), library, character.only = TRUE)) # only loading packages that we'll use a lot
As well as external packages, we'll use some helper functions we've prepared previously. If you want to modify them, you can find them in the file R/utils.R on the GitHub page.
# Load helper functions source("../R/utils.R")
We then load the data prepared in the last tutorial.
To do this with example data, replace the file location below with "/cdtshared/wearables/health_data_files/dataset-with-preprocessing-done.csv".
df <- fread("../data_and_data_prep/dataset-with-preprocessing-done.csv", data.table = FALSE)
Here are some bits of extra prep based on what we're doing next:
# Calculate proportion of time in different behaviours as min/day and hr/day vars <- c("MVPA", "light", "sedentary", "sleep") df[, paste0(vars, "_min_per_day")] <- lapply(df[, paste0(vars, ".overall.avg")], function(x) 24 * 60 * x) df[, paste0(vars, "_hr_per_day")] <- lapply(df[, paste0(vars, ".overall.avg")], function(x) 24 * x) # Cut by quintile ## Home-brewed cut-by-quantile function: qtile <- function(x, probs = seq(0, 1, 0.25), labels = NULL, ordered = FALSE, na.rm = TRUE) { breaks <- quantile(x = x, probs = probs, na.rm = na.rm) out <- cut( x = x, breaks = breaks, labels = labels, include.lowest = TRUE, ordered_result = ordered ) return(out) } ## Cut acc by quintile fifths <- seq(0, 1, by = 0.2) df$acc_quintiles <- qtile(df$acc.overall.avg, fifths, labels = paste0("Q", 1:5)) df$acc_quintiles_values <- qtile(df$acc.overall.avg, fifths) rm(vars, fifths) # clean up
This section provides a few examples of descriptive analyses.
As always, it's worth spending some time here getting to know the data. Are the patterns as you would expect? Does it raise any potential issues? (e.g. when thinking about confounding)
The analyses here are just a sample of what you could do --- please do add to them.
The following code constructs a table to describe participant characteristics by quintile of overall acitvity.
We are cheating a bit here by using a package that generates a nicely formatted table. You can of course write your own code to make a table of what you're interested in.
table1::table1(~ sex + age_gp + smoking | acc_quintiles_values, data=df)
We'll also plot some of the variables to get a feel for how they're distributed.
# overall physical activity ## deciles quantile(df$acc.overall.avg, prob = seq(0, 1, by = 0.1), na.rm = TRUE) ## histogram hist(df$acc.overall.avg, breaks=1e3, xlim=c(0,100))
# MVPA ## deciles quantile(df$MVPA_min_per_day, prob = seq(0, 1, by = 0.1), na.rm = TRUE) ## histogram hist(df$MVPA_min_per_day, breaks=1000, xlim=c(0,300))
We can have a look at some of the machine-learned variables, to see if we believe them! Here we'll look at plots of probability of being in a particular behaviour by time of day:
plot_average_day( data = df, exposure_prefix = "sleep.hourOfDay.", exposure_suffix = ".avg", y_label = "Sleep (probability)" ) # This is a custom function. To see what it's doing see R/utils.R
Now let's look descriptively at something we might expect to vary as activity status varies: BMI. This is just a descriptive plot i.e. it's not adjusted for any other behaviours.
plot_var_and_quintile(data = df, exposure = 'acc.overall.avg', outcome = 'BMI' ) # This is a custom function. To see what it's doing see R/utils.R
Let's run a minimally (age group and sex) adjusted linear model for BMI, against fifths of average acceleration vector magnitude. This is attempting to model statistically the association we were looking at descriptively in the end of the last section.
min_adj_lm_BMI <- lm(BMI ~ acc_quintiles + age_gp + sex, data = df)
We can also look at the model diagnostics to understand more about the fit of the model:
performance::check_model(min_adj_lm_BMI) # It's perfectly possible to do these checks in base R - but this gives us some explanations # See https://easystats.github.io/performance/ and Lüdecke et al., 2021, JOSS
The fact the residuals aren't very normal is probably because the distribution of BMI is a bit skewed.
We can look at the model summary:
summary(min_adj_lm_BMI)
Here we've looked at a very simple linear model. Next steps might be:
glm()
function)coxph()
function)For example, here's an initial Cox regression analysis associating quintiles of overall acceleration with incident ischaemic heart disease.
From the data preparation step, we have an event status indicator at exit and a follow-up time variable. Using these, we can run a Cox model using the survival package in R:
cox_model <- coxph(Surv(fu_time, inc_ihd) ~ age_gp + sex + acc_quintiles, data = df)# This line uses functions from the 'survival' package
As discussed in the lecture, a key assumption of Cox regression is the proportional hazards assumption. There are several ways to assess this. One way is through plots of the Schoenfeld residuals. Read more here (see section 23.2.5 and 23.2.6).
plot(cox.zph(cox_model))
We can look at the model summary:
summary(cox_model)
The exp(coef)
column gives the hazard ratio. Not surprisingly, older age and male sex are associated with higher risk of ischaemic heart disease, whereas a higher level of activity is associated with a lower risk of ischaemic heart disease.
We could plot the results:
data_for_plot <- as.data.frame( exp( cbind(coef(cox_model), confint(cox_model)) ) ) names(data_for_plot) <- c("HR", "lower_CI", "upper_CI") data_for_plot$var_name <- rownames(data_for_plot) ncoef <- nrow(data_for_plot) data_for_plot <- data_for_plot[(ncoef - 3):ncoef, ] ref_row <- data.frame( "var_name" = "Q1", "HR" = 1, "lower_CI" = 1, "upper_CI" = 1 ) data_for_plot <- rbind(ref_row, data_for_plot) data_for_plot$quintile <- sub("acc_quintiles", "", data_for_plot$var_name) data_for_plot$event_number <- sapply( X = as.factor(data_for_plot$quintile), FUN = function(x) sum(df$inc_ihd[df$acc_quintiles == x]) ) ggplot(data_for_plot, aes(x = HR, y = quintile)) + scale_x_continuous(trans = "log") + scale_y_discrete(limits=rev) + labs(title = "Association of activity with incident IHD", y = "Quintile of average acceleration")+ geom_vline(xintercept = 1) + geom_errorbar(aes(xmin = lower_CI, xmax = upper_CI), width = 0, size = 1) + geom_point(aes(size = event_number), shape = 15) + theme_classic() + theme(axis.line.y = element_blank())
Again, there's lots you could do to build on and refine this example:
Have fun! :)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.