In activityMonitoring/week2DataChallenge: Data Challenge - Week 2 (Health Association Analysis)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 12,
  fig.height = 8
)

# First we need to install packages that aren't already present
## CRAN Packages
pkgs <- c(
  "ggplot2",
  # Plotting works better with ggplot2
  "forestplot",
  # Plotting works better with forestplot
  "remotes" # To install epicoda from GitHub, the remotes package is required.
)

pkgs_inst <- pkgs[!{
  pkgs %in% rownames(installed.packages())
}]
install.packages(pkgs_inst)

## GitHub Packages
remotes::install_github(
    repo = "activityMonitoring/epicoda",
    build_opts = "--no-resave-data",
    build_vignettes = TRUE,
    build_manual = TRUE
  ) # TROUBLESHOOTING if this doesn't work, use build_vignettes = FALSE. You have the info from the vignette here!

rm(pkgs, pkgs_inst) # clean up after ourselves

Load packages:

# Load packages
invisible(lapply(
  c("ggplot2", "forestplot", "epicoda"),
  library,
  character.only = TRUE
))

This tutorial is an optional additional tutorial on Compositional Data Analysis for movement behaviour data. Compositional Data Analysis is an approach to the analysis of compositional data which has become increasingly popular in movement behaviour epidemiology in recent years. This tutorial illustrates some steps of an epidemiological analysis with a compositional exposure. It uses a package, epicoda, we have developed for this purpose. More information about any function we use can be found by running ?function_name (as usual). Please note that the package is under active development, and has been tested in a limited range of scenarios, so you may well find bugs. If you do, or if anything isn't clear, please do get in touch: rosemary.walmsley@gtc.ox.ac.uk.

This tutorial is closely based on one of the package vignettes... Which was itself based on an earlier version of this tutorial!

It is highly recommended to deal with missing data both in compositional variables and in covariate or outcome data prior to using the package: functions were designed assuming complete data and may not behave predictably where data is not complete.

Getting started with compositional data

Loading simulated data

simdata is a (simulated) compositional dataset in the epicoda package. It has 5 columns which make up a hypothetical movement behaviour composition: "vigorous" (activity), "moderate" (activity), "light" (activity), "sedentary" (behaviour), "sleep" It also contains sex, age group and outcome columns. As the data is simulated, all results are hypothetical.

head(simdata)

What is compositional data?

Data is compositional when the relative (not absolute) values of the variables are relevant to the problem being studied [@Filzmoser2018]. Another way to describe this is that compositional data is data that can be expressed as proportions of a whole without losing information.

Here, the compositional variables sum to the same total (24 --- the number of hours in the day), which is a common situation. As this means it is not possible to simply increase time in one behaviour --- it is only possible to spend time in one behaviour rather than in another behaviour --- it is the relative values that are relevant.

rowSums(simdata[1:10, c("vigorous", "moderate", "light", "sedentary", "sleep")])

Other examples of compositional data don't have a fixed sum. For example, consider the vote counts in an election for different parties in different constituencies: although the overall number of votes cast might differ between constituencies, it is the proportion of the vote going to each party, and not the total number of votes going to that party in a given constituency, that is relevant.

Notes

The variables in compositional data are often known as parts.
In Compositional Data Analysis, expressing the data as proportions of a whole is known as taking the closure, written and defined as:

\begin{equation} \mathcal{C}(part_{1}, part_{2}, ..., part_{D}) = \Bigg(\frac{part_{1}}{\sum_{i =1}^{D}{part_i}}, \frac{part_{2}}{\sum_{i =1}^{D}{part_i}}, \ldots, \frac{part_{D}}{\sum_{i =1}^{D}{part_i}}\Bigg) \end{equation}

Exploratory data analysis

The compositional mean

The compositional mean is a frequently-used summary statistic for compositional data [@Filzmoser2018]. It is found by taking the geometric mean of each component, then taking the closure so they sum to the overall total. Like the standard (multivariate) mean it is a measure of location, or central tendency. Unlike the standard mean, it interacts in an appropriate way with compositional data.

To calculate the compositional mean of columns "vigorous", "moderate", "light", "sedentary", "sleep" using epicoda, we run:

comp_mean(data = simdata, # this is the dataset
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep") # this is the labels of the compositional columns, 
          # which we specified above
          )

Unfortunately, it's thrown an error, with a message about zeroes. Compositional data analysis can't automatically include zero values. For more information about this issue, see [@Palarea-Albaladejo2015] and [@Rasmussen2020]. There are a couple of options:

They can be dropped. This is the right option if they're so called "true zeroes" i.e. not related to a limit in the precision with which we can measure.
They can be imputed. This is the right option if they're so-called "rounded zeroes" i.e. they're related to a limit in the precision with which we can measure. For example, with movement behaviour data, it might be that if we recorded data for long enough, we would eventually see some vigorous activity, but our measurement period wasn't long enough. It might also be that our measurement device can't record values lower than a certain level.

For convenience, many functions in this package let you specify what to do with zeroes as part of the function, and will then perform this preprocessing step as part of the process. They have two arguments to do this: rounded_zeroes (default: TRUE) and det_limit (default: NULL).

Setting rounded_zeroes = TRUE and det_limit to a numeric value on the same scale as the data imputes zeroes with that detection limit. Under the hood, it uses the lrEM() algorithm from the zCompositions package. Note that currently the same detection limit must apply to all compositional columns. [Note that if the input data scale and the output scale (units argument), are different, it should match the input data scale.]
Setting rounded_zeroes = FALSE will drop any rows with zeroes (i.e. use the "true zeroes" approach above)
The default in the function is to run with rounded_zeroes = TRUE and det_limit = NULL. If there are no zeroes in the data, either as they've already been processed or because there just are no zero values, this will run fine: it doesn't need to do anything to the data. However, if there are zeroes it will throw an error message.

These convenience functions should be used with care. You can choose to perform zero imputation or drop zero values prior to doing any analysis on your data. This has several advantages:

It can speed up code as these calculations are done once, rather than repeatedly.
You can inspect the data post-processing of zeroes. For example, you can check that zero imputation has worked as expected. If dropping zeroes, you will know how many rows have been dropped.
You can do more complex zero imputation than implemented here (e.g. different detection limits for different parts; appropriate methods in the case of data with no complete columns)
You can use different zero imputation methods (see the zCompositions package)
If you are using compositions with non-constant sum, automated det_limit rescaling used in the package will fail. Again, these cases can be dealt with by pre-processing the zeroes using zCompositions

As well as comp_mean(), some other functions in the package also have rounded_zeroes and det_limit arguments: they also use calculations which need zeroes to be imputed or excluded. For all of them, it works the same as for comp_mean().

So we can now find the mean again, this time imputing the zeroes with a specified detection limit. In the simulated data, the shortest period of a behaviour we could detect was $0.0012\ hr/day$ (due to the length of an epoch for the accelerometer used to measure the data on which the simulated data is based).

comp_mean(data = simdata, # this is the dataset
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), # this is the labels of the compositional columns, 
          # which we specified above
          rounded_zeroes = TRUE, # this option specifies that we'll treat the zeroes 
          # as rounded zeroes i.e. we'll impute them
          det_limit = 0.0012, # this is the smallest value observable in the data
          )

Even though we started out with sums which added to 24, the mean seems to add to 1. This is because we didn't specify the units --- and so the package didn't know what units to return it to us in. But it could still return it as a sum to 1, without loss of information, as per the definition of compositional data.

To change, this we can specify the output units:

comp_mean(data = simdata,
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
          rounded_zeroes = TRUE, 
          det_limit = 0.0012, 
          units = "hr/day" # this is the units. There are pre-specified options "hr/day",
          # "hr/wk", "min/day", "min/wk" and "unitless". 
          # If you set units = "specified", you can also specify your own units using 
          # specified = c("my_units_name", sum of a composition in these units)
          )

Other descriptive statistics

While the compositional mean is a fairly clear compositional equivalent to the mean for non-compositional data, other standard descriptive statistics (such as the variance), do not have straightforward compositional equivalents for all purposes for which they are used (there may be several alternatives, depending on the purpose of the analysis) [@Filzmoser2018]. These are beyond the current scope of this package.

Plotting ternary diagrams

Ternary plots provide a way to plot compositional data with three parts. The R package ggtern is an extension of ggplot2 to carry out this plotting. Unfortunately, the current version causes some unfortunate masking of ggplot2 methods, so isn't shown in this vignette. But if you want to plot ternary plots, I (Rosemary) can share code and tips!

Regression modelling

Setting up regression models

The package has a wrapper for linear regression, logistic regression and Cox regression modelling using compositional variables, with adjustment for covariates. Compositional Data Analysis regression modelling works by transforming the variables making up the composition, and using the transformed variables (known as "log-ratio coordinates") in the model (see "Mechanics of Compositional Data Analysis" below). The package automates these steps. For example, to set up a linear regression model for the association between behaviour variables and BMI, adjusted for age group and sex:

lm_BMI <- comp_model(
  type = "linear",
  outcome = "BMI",
  covariates = c("agegroup", "sex"),
  data = simdata,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  det_limit = 0.0012
)

Logistic models can be produced in exactly the same way by setting type = "logistic" and using an appropriate binary outcome variable. Cox models can be produced similarly by setting type = "cox" and either setting outcome as a Surv object from the package survival or setting the follow_up_time and event arguments. For examples, see "Regression modelling: further examples".

Understanding the model output

The model object is a standard lm object, and we can inspect it in the usual way:

plot(lm_BMI)

The model diagnostics can be interpreted as usual, and in this case all looks fine.

summary(lm_BMI)

The summary is hard to interpret because the transformed variables or 'coordinates' used in the model do not relate in a straightforward way to the original composition.

Presenting the results: tabulate pivot-coordinate coefficients

One way to present the results is to present coefficients as usual for covariates, and then produce as many models as there are compositional variables. From each of these models, the coefficient for the first 'ilr pivot coordinate' is taken (see "Mechanics of Compositional Data Analysis" below). These ilr pivot coordinate coefficients can be interpreted in terms of transfer between the particular compositional part and all other parts proportionally. tab_coefs() is a function which wraps generating one model per compositional variable and outputting a table of model coefficients.

tab_coefs(scale_type = "lp", # This argument can be "lp" or "exp" and determines whether coefficients are presented on the scale of the linear predictors ("lp") or are exponentiated ("exp"). 
          # Exponentiation gives the Odds Ratio for logistic regression models and the Hazard Ratio for Cox regression models.
          level = 0.95,
          type = "linear",
          outcome = "BMI", 
          data = simdata, 
          covariates = c("agegroup", "sex"),
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
          det_limit = 0.0012)

Notes:

As these coefficients come from different models they do not parametrise a single model.
Arguments which would be needed for comp_model() must be given to this function too, as it regenerates the regression models.
The function tab_covariate_coefs() is called within this function, and can be used to extract the coefficients of the covariates only from a model.

Presenting the results: reallocation plots

Substitution or reallocation plots show the model-estimated difference in the outcome (relative to at the compositional mean) as the balance between two parts of the composition varies. In the case of time use/movement behaviours, they are sometimes described as isotemporal substitution plots and show the model-estimated difference in the outcome at higher time in one behaviour at the expense of time in another behaviour (the model-estimated difference in outcome for a hypothetical reallocation of time from one behaviour to another). For example:

plot_transfers(
  from_part = "sedentary",
  to_part = "moderate",
  model = lm_BMI ,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  y_label = "Model-estimated difference in BMI",
  units = "hr/day"
)

Notes

There is more information on how confidence intervals for this plot are calculated in a vignette (vignette("derivation_of_CIs_used")).
To customise the appearance of the plot see "Customising plots".

Presenting the results: estimates at particular compositions

Another set of estimates that might be of interest are estimates of the difference in outcome associated with particular compositions, again relative to the mean. These can be examined by plotting these in a "forest plot" style plot. This allows assessment of the model-estimated associations with more complex substitutions of time. Note that you may wish to more flexibly plot this yourself (see section: "Getting the numbers out: estimates for particular compositions").

Generating perturbations of a composition

While there are hugely varied compositions which might be possible, one kind of composition is often of interest. This is compositions where one particular part is perturbed at the expense of all other parts proportionally. As this is a common use case, there is a function implemented to do this:

cm <- as.data.frame(comp_mean(
  data = simdata,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  det_limit = 0.0012, 
  units = "hr/day" # Note whatever the units are here is the units
  # you should specify the changes in in the next two lines.
))
new_comp <-
  change_composition(
    composition = cm,
    main_part = "moderate",
    main_change = +0.5,
    comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
  )
new_comp2 <-
  change_composition(
    composition = cm,
    main_part = "sedentary",
    main_change = -3.5,
    comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
  )
list_for_plot <-
  list("Extra 0.5 hr/day moderate" = new_comp,
       "3.5 hr/day less sedentary" = new_comp2)
print(list_for_plot)

Plotting estimates at particular compositions

We now need to plot these, which can be done as follows.

forest_plot_comp(
  composition_list = list_for_plot,
  model = lm_BMI,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
)

Please note that on the plot with terms = TRUE the reference is always the compositional mean in the dataset used to create the model. This will agree with the compositional mean calculated above and which we calculated perturbations relative to if there is no missing data in the dataset used to create the model. To check this, there's a utility function in the package to get the compositional mean directly from the model (although it's a bit ugly):

# Set up of the function means it uses this internal from the modelling. Hopefully will be streamlined in future... :
tl_for_check <-
  transf_labels(
    comp_labels  = c("vigorous", "moderate", "light", "sedentary", "sleep"),
    transformation_type = "ilr"
  )
# The comp mean calculated from the model:
get_cm_from_model(
  lm_BMI,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  transf_labels = tl_for_check
)$cm
# The comp mean calculated above transformed to unitless scale:
cm / 24

The forest_plot_comp() function can also be used to compare models. To do this, leave model at its default NULL value, and set models_list as a named list of models. For example, to compare lm_BMI with a model adjusted for age_group only and an unadjusted model (very boring in the simulated data because the correlation structure for the simulation wasn't sufficiently rich!):

lm_BMI_unadjusted <- comp_model(
  type = "linear",
  outcome = "BMI",
  data = simdata,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  det_limit = 0.0012
)
lm_BMI_age_group_only <- lm_BMI <- comp_model(
  type = "linear",
  outcome = "BMI",
  covariates = c("agegroup"),
  data = simdata,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  det_limit = 0.0012
)
forest_plot_comp(
  composition_list = list_for_plot,
  models_list = list(
    "Unadjusted" = lm_BMI_unadjusted,
    "Age-adjusted" = lm_BMI_age_group_only,
    "Age- and sex- adjusted" = lm_BMI
  ),
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
)

Additional information

This section contains lots of additional information, on:

Mechanics of Compositional Data Analysis: Log-ratio transformations
Presenting results via model predictions
Customising plots
Regression modelling: further examples
Getting the numbers out: estimates/ predictions for particular compositions

Mechanics of Compositional Data Analysis: Log-ratio transformations

The functions above wrap all the "log-ratio mechanics" of using a Compositional Data Analysis approach.

There are several different kinds of transformations which can be used in order to work with compositional data (alr, ilr, clr). These transform the variables into coordinates which are capable of representing all the variation in the data. Different transformations have different pros/cons, and alr and clr can only be used for certain applications. All the functions default to ilr (isometric log ratio) transformation using so-called "pivot coordinates", which is fairly standard in the time use/movement behaviour literature (and is appropriate for the applications which the other transformations are used for). If you want to learn more, see [@Filzmoser2018][@Pawlowsky-Glahn2011].

The purpose of this package is not to provide this functionality, which is implemented in other R packages, such as compositions and robCompositions. (The main aim of this package is shown above, in the functions to perform standard epidemiological analyses). This functionality is reimplemented here simply to ensure that future changes to the defaults in those packages do not affect this package.

data_ilr_impute_zeroes <- transform_comp(
  data = simdata,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  transformation_type = "ilr",
  det_limit = 0.0012
)
print(head(data_ilr_impute_zeroes))

There are also a few other things we can do with this function, like specifying a given "first part" (which is useful when it comes to reporting, and is used therefore in tab_coefs()). To learn more, run ?transform_comp.

Presenting results via model predictions

An alternative way to get round the challenges of interpreting the model coefficients for these models is to look at model-predicted outcomes for different input variable values (rather than model-estimated differences). The plot_transfers() function can do this (for linear, logistic and Cox regression models) by setting terms = FALSE:

plot_transfers(
  from_part = "sedentary",
  to_part = "moderate",
  model = lm_BMI ,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  y_label = "Model-predicted outcome",
  units = "hr/day",
  terms = FALSE
)

Notes

The predictions and confidence intervals on this plot capture the predictions and uncertainty on them due to all variables in the model, including the covariates at the set fixed values.
The fixed_values argument can be used to set the values at which covariates (but not compositional variables) are fixed (which if not set default to mean or modal values for continuous and categorical variables respectively)
Other functions, such as forest_plot_comp(), can also output predictions by setting terms = FALSE.
If terms = FALSE is used with a Cox model, note results are still presented as HRs/ risk scores.

Customising plots

All the plotting functions can have settings changed to make more visually attractive plots (e.g. changing text size, line size, text colour). plot_transfers() is based on ggplot2.

To specify non-data aspects of the plot (e.g. line, axis and text size and colours), set the optional theme argument. Note that you will need the ggplot2 package loaded in order to do this.
For plot_transfers(), to specify characteristics of the plotted prediction line, set the point_specification argument as a geom_point() object.
For plot_transfers(), to specify colours of the error bars, set the error_bar_colour argument as the name of a colour.
For plot_transfers(), if the line looks 'gappy' (likely at the ends), increase granularity (and if it takes too long to plot, decrease granularity).

For an extreme example of implementing some of this:

# it's easier to specify themes if ggplot2 is loaded.
plot_transfers(
  from_part = "sedentary",
  to_part = "moderate",
  model = lm_BMI ,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
  y_label = "Model-estimated difference in outcome",
  units = "hr/day",
  point_specification = geom_point(size = 0.5, colour = "purple"),
  error_bar_colour = "yellow",
  theme = theme_bw() + theme(
    line = element_line(colour = "red", size = 2),
    text = element_text(
      family = "mono",
      face = "bold.italic",
      colour = "green"
    ),
    axis.line = element_line(colour = "blue", size = 3)
  )
)

Slightly differently, forest_plot_comp() is based on the forestplot package, so takes arguments specifying the plot in the same way as that function. There are several arguments which set particular aspects of the forest plot. In addition, any argument that the basic forestplot function can take, which isn't already set, can be set as an argument to forest_plot_comp().

# it's easier to specify options such as fpColors if the forestplot package is loaded.
forest_plot_comp(composition_list = list_for_plot,
                 model = lm_BMI,
                 comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
                 pred_name = "The predicted difference at this composition", # This sets the name with which to label the column. ,
                 title = "Analysis of physical behaviour data", # This sets the overall plot title.
                 col = forestplot::fpColors(zero = "black") # This sets the colour of the line at 0 (or 1, if it's the plot of a logistic or Cox model)
)

For additional information on how to customise the functions, see the manual by running e.g. ?plot_transfers.

Regression modelling: further examples

As well as linear models, we can produce logistic and Cox models:

log_outcome <- comp_model(type = "logistic",
                          outcome = "disease",
                          covariates = c("agegroup", "sex"),
                          data = simdata,
                          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
                          det_limit = 0.0012)
summary(log_outcome)

cox_outcome <- comp_model(type = "cox",
                          event = "event",
                          follow_up_time = "follow_up_time",
                          covariates = c("agegroup", "sex"),
                          data = simdata,
                          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
                          det_limit = 0.0012)
summary(cox_outcome)

Getting the numbers out: estimates for particular compositions

To get the numbers out for estimates at particular compositions, we can use the predict_fit_and_ci() function (which is wrapped in the plotting functions).This takes a data frame argument for new_data.

predict_fit_and_ci(
  model = lm_BMI,
  new_data = new_comp,
  comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
)[, c("fit", "lower_CI", "upper_CI")]

References

activityMonitoring/week2DataChallenge documentation built on May 30, 2022, 1:54 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

activityMonitoring/week2DataChallenge
Data Challenge - Week 2 (Health Association Analysis)

In activityMonitoring/week2DataChallenge: Data Challenge - Week 2 (Health Association Analysis)

Getting started with compositional data

Loading simulated data

What is compositional data?

Notes

Exploratory data analysis

The compositional mean

Other descriptive statistics

Plotting ternary diagrams

Regression modelling

Setting up regression models

Understanding the model output

Presenting the results: tabulate pivot-coordinate coefficients

Notes:

Presenting the results: reallocation plots

Notes

Presenting the results: estimates at particular compositions

Generating perturbations of a composition

Plotting estimates at particular compositions

Additional information

Mechanics of Compositional Data Analysis: Log-ratio transformations

Presenting results via model predictions

Notes

Customising plots

Regression modelling: further examples

Getting the numbers out: estimates for particular compositions

References

R Package Documentation

Browse R Packages

We want your feedback!

activityMonitoring/week2DataChallenge Data Challenge - Week 2 (Health Association Analysis)

In activityMonitoring/week2DataChallenge: Data Challenge - Week 2 (Health Association Analysis)

Getting started with compositional data

Loading simulated data

What is compositional data?

Notes

Exploratory data analysis

The compositional mean

Other descriptive statistics

Plotting ternary diagrams

Regression modelling

Setting up regression models

Understanding the model output

Presenting the results: tabulate pivot-coordinate coefficients

Notes:

Presenting the results: reallocation plots

Notes

Presenting the results: estimates at particular compositions

Generating perturbations of a composition

Plotting estimates at particular compositions

Additional information

Mechanics of Compositional Data Analysis: Log-ratio transformations

Presenting results via model predictions

Notes

Customising plots

Regression modelling: further examples

Getting the numbers out: estimates for particular compositions

References

R Package Documentation

Browse R Packages

We want your feedback!

activityMonitoring/week2DataChallenge
Data Challenge - Week 2 (Health Association Analysis)