knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
knitr::opts_chunk$set(fig.width=9, fig.height=6) 
devtools::load_all()

Load the package

If you have installed the package following the instructions on GitHub, it can be loaded by:

library(epicoda)

A basic analysis

This vignette illustrates some of the things this package can be used to do in an epidemiological analysis with a compositional exposure. More information about any function can be found by running ?function_name (as usual).

Please note this package is shared as-is and without extensive review/testing. Use is at your own risk (https://github.com/activityMonitoring/epicoda). Bug reports are welcome.

It is highly recommended to deal with missing data both in compositional variables and in covariate or outcome data prior to using the package: functions were designed assuming complete data and may not behave predictably where data is not complete.

Getting started with compositional data

Loading simulated data

simdata is a (simulated) compositional dataset in the package. It has 5 columns which make up a hypothetical movement behaviour composition: "vigorous" (activity), "moderate" (activity), "light" (activity), "sedentary" (behaviour), "sleep". It also contains sex, age group and outcome columns. As the data is simulated, all results are hypothetical.

head(simdata)

What is compositional data?

Data is compositional when it is positive-valued and the relative (not absolute) values of the variables are relevant to the problem being studied [@Filzmoser2018]. Another way to describe this is that compositional data is data that can be expressed as proportions of a whole without losing information.

Here, the compositional variables sum to the same total (24- the number of hours in the day), which is a common situation. As this means it is not possible to simply increase time in one behaviour - it is only possible to spend time in one behaviour rather than in another behaviour - it is the relative values that are relevant.

head(apply(simdata[, c("vigorous", "moderate", "light", "sedentary", "sleep")], 1, sum))

Other examples of compositional data don't have a fixed sum. For example, consider the vote counts in an election for different parties in different constituencies: although the overall number of votes cast might differ between constituencies, it is the proportion of the vote going to each party, and not the total number of votes going to that party in a given constituency, that is relevant.

Notes

Exploratory data analysis

The compositional mean

The compositional mean is a frequently-used summary statistic for compositional data [@Filzmoser2018]. Like the standard (multivariate) mean it is a measure of location, or central tendency. Unlike the standard mean, it interacts in an appropriate way with compositional data.

To calculate the compositional mean of columns "vigorous", "moderate", "light", "sedentary", "sleep" using epicoda, we can run:

comp_mean(data = simdata, # this is the dataset
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep") # this is the labels of the compositional columns, 
          # which we specified above
          )

Unfortunately, it's thrown an error, with a message about zeroes. Compositional data analysis can't automatically include zero values. For more information about this issue, see [@Palarea-Albaladejo2015] and [@Rasmussen2020]. There are a couple of options:

For convenience, many functions in this package let you specify what to do with zeroes as part of the function, and will then perform this preprocessing step as part of the process. They have two arguments to do this: rounded_zeroes (default: TRUE) and det_limit (default: NULL). - Setting rounded_zeroes = TRUE and det_limit to a numeric value on the same scale as the data imputes zeroes with that detection limit. Under the hood, it uses the lrEM algorithm from the zCompositions package. Note that currently the same detection limit must apply to all compositional columns. [Note that if the input data scale and the output scale (units argument), are different, it should match the input data scale.] - Setting rounded_zeroes = FALSE will drop any rows with zeroes (i.e. use the "true zeroes" approach above) - The default in the function is to run with rounded_zeroes = TRUE and det_limit = NULL. If there are no zeroes in the data, either as they've already been processed or because there just are no zero values, this will run fine: it doesn't need to do anything to the data. However, if there are zeroes it will throw an error message.

These convenience functions should be used with care. You can choose to perform zero imputation or drop zero values prior to doing any analysis on your data. This has several advantages: - It can speed up code as these calculations are done once, rather than repeatedly. - You can inspect the data post-processing of zeroes. For example, you can check that zero imputation has worked as expected. If dropping zeroes, you will know how many rows have been dropped. - You can do more complex zero imputation than implemented here (e.g. different detection limits for different parts; appropriate methods in the case of data with no complete columns) - You can use different zero imputation methods (see the zCompositions package)
- If you are using compositions with non-constant sum, automated det_limit rescaling used in the package will fail. Again, these cases can be dealt with by pre-processing the zeroes using zCompositions

As well as comp_mean, some other functions in the package also have rounded_zeroes and det_limit arguments: they also use calculations which need zeroes to be imputed or excluded. For all of them, it works the same as for comp_mean. Note: Previously, functions in this package "guessed" the detection limit as the minimum non-zero value observed in the data if it wasn't set. This behaviour has been removed as this may have unfortunate consequences in some datasets and should be an explicit analytic decision.

So we can now find the mean again, this time imputing the zeroes with a specified detection limit. In the simulated data, the shortest period of a behaviour we could detect was $0.0012\ hr/day$ (due to the length of an epoch for the accelerometer used to measure the data on which the simulated data is based).

comp_mean(data = simdata, # this is the dataset
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), # this is the labels of the compositional columns, 
          # which we specified above
          rounded_zeroes = TRUE, # this option specifies that we'll treat the zeroes 
          # as rounded zeroes i.e. we'll impute them
          det_limit = 0.0012, # this is the smallest value observable in the data
          )

Even though we started out with sums which added to 24, the mean seems to add to 1. This is because we didn't specify the units - and so the package didn't know what units to return it to us in. But it could still return it as a sum to 1, without loss of information, as per the definition of compositional data.

To change, this we can specify the output units:

comp_mean(data = simdata,
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
          rounded_zeroes = TRUE, 
          det_limit = 0.0012, 
          units = "hr/day" # this is the units. There are pre-specified options "hr/day",
          # "hr/wk", "min/day", "min/wk" and "unitless". 
          # If you set units = "specified", you can also specify your own units using 
          # specified = c("my_units_name", sum of a composition in these units)
          )

Other descriptive statistics

While the compositional mean is a fairly clear compositional equivalent to the mean for non-compositional data, other standard descriptive statistics (such as the variance), do not have straightforward compositional equivalents for all purposes for which they are used (there may be several alternatives, depending on the purpose of the analysis) [@Filzmoser2018]. These are beyond the current scope of this package.

Plotting ternary diagrams

Ternary plots provide a way to plot compositional data with three parts. The R package ggtern is an extension of ggplot2 to carry out this plotting. In this package, there is a wrapper for ggtern functions to implement common use cases: plotting the data density, and plotting prediction regions for the data, both overall and within subgroups. (Note that while the "geom_confidence_region" from ggtern is used, the regions produced seem to be more properly described as prediction regions)

Note (18.03.2021): A conflict between ggtern and ggplot2 means using ggtern can stop axis labels appearing on ggplot2 graphs. Pending a more permanent solution, these functions remain in the package, but ggtern is not automatically loaded and should be used with care.

Frequently, data has more than three compositional parts (as here). There are two alternatives in that case: using three parts of the existing parts (looking at the subcomposition formed by these parts), or summing parts to consider a different composition (where that is informative). In the simulated data, "vigorous", "moderate" and "light" represent three different intensities of activity, so it is meaningful to combine them:

simdata$all_activity <- simdata$vigorous + simdata$moderate + simdata$light
plot_prediction_region_ternary(data = simdata, 
                     parts_to_plot = c("all_activity", "sedentary", "sleep"),
                     probs = c(0.25, 0.5, 0.75) # This will plot 25, 50 and 75% prediction regions (default is 0.5, 0.9, 0.95)
                      )

Another frequent use case is considering this plot separated by subgroups. For example, to look at different sexes (which in fact don't differ much in the simulated data):

plot_prediction_region_ternary(data = simdata, 
                     parts_to_plot = c("all_activity", "sedentary", "sleep"),
                     groups = "sex", 
                     probs = c(0.25, 0.5, 0.75) # This will plot 25, 50 and 75% prediction regions (default is 0.5, 0.9, 0.95)
                      )

The mark_points argument can be used to annotate particular points using crosshairs. For example, to mark the compositional mean:

points_to_mark <- as.data.frame(comp_mean(data = simdata, comp_labels = c("all_activity", "sedentary", "sleep")))
plot_prediction_region_ternary(data = simdata, 
                     parts_to_plot = c("all_activity", "sedentary", "sleep"),
                     mark_points = points_to_mark,
                     probs = c(0.25, 0.5, 0.75) # This will plot 25, 50 and 75% prediction regions (default is 0.5, 0.9, 0.95)
                      )                              

Regression modelling

Setting up regression models

The package has a wrapper for linear regression, logistic regression and Cox regression modelling using compositional variables, with adjustment for covariates. Compositional Data Analysis regression modelling works by transforming the variables making up the composition (see "Mechanics of Compositional Data Analysis" below), and using the transformed variables (known as "log-ratio coordinates") in the model. The package automates these steps. For example, to set up a linear regression model for the association between behaviour variables and BMI, adjusted for age and sex:

lm_BMI <- comp_model(type = "linear",
         outcome = "BMI",
         covariates = c("agegroup", "sex"),
         data = simdata,
         comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
         det_limit = 0.0012)

Logistic models can be produced in exactly the same way by setting type = "logistic" and using an appropriate binary outcome variable. Cox models can be produced similarly by setting type = "cox" and either setting outcome as a Surv object from the package survival or setting the follow_up_time and event arguments. For examples, see "Regression modelling: further examples".

Understanding the model output

The model object is a standard lm object, and we can inspect it in the usual way:

plot(lm_BMI)
summary(lm_BMI)

The model diagnostics can be interpreted as usual, and in this case all looks fine. However, the summary is hard to interpret because the transformed variables or "coordinates" used in the model do not relate in a straightforward way to the original composition.

Presenting the results: tabulate pivot-coordinate coefficients

A common way to present the results is to present coefficients as usual for covariates, and then produce as many models as there are compositional variables. From each of these models, the coefficient for the first "ilr pivot coordinate" is taken (see "Mechanics of Compositional Data Analysis" below). These ilr pivot coordinate coefficients can be interpreted in terms of reallocation between the particular compositional part and all other parts proportionally. tab_coefs is a function which wraps generating one model per compositional variable and outputting a table of model coefficients.

tab_coefs(scale_type = "lp", # This argument can be "lp" or "exp" and determines whether 
          # coefficients are presented on the scale of the linear predictors ("lp") 
          # or are exponentiated ("exp"). Exponentiation gives the Odds Ratio for 
          # logistic regression models and the Hazard Ratio for Cox regression models. 
          level = 0.95, 
          type = "linear",
          outcome = "BMI",
          covariates = c("agegroup", "sex"),
          data = simdata,
          comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
          det_limit = 0.0012)

Notes:

Presenting the results: substitution plots

Substitution or reallocation plots show the model-estimated difference in the outcome (relative to at the compositional mean) as the balance between two parts of the composition varies. In the case of time use/movement behaviours, they are sometimes called isotemporal substitution plots. They can be thought of as showing the difference in outcome associated with a hypothetical reallocation of time from one behaviour to another, as estimated by the model. For example:

plot_transfers(from_part = "sedentary",
               to_part = "moderate",
               model = lm_BMI ,
               comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
               y_label = "Model-estimated difference in BMI",
               units = "hr/day")

Notes

Presenting the results: estimates at particular compositions

Another set of estimates that might be of interest are estimates at particular compositions. These can be examined by plotting these in a "forest plot" style plot. This allows assessment of the model-estimated associations with much more complex reallocations between parts.

Generating perturbations of a composition

While there are hugely varied compositions which might be possible, one kind of composition is often of interest. This is compositions where one particular part is perturbed at the expense of all other parts proportionally. As this is a common use case, there is a function implemented to do this:

df <- as.data.frame(
  comp_mean(
    data = simdata,
    comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
    det_limit = 0.0012, 
    units = "hr/day", # Note whatever the units are here is the units
    # you should specify the changes in in the next two lines. 
  )
)
new_comp <-
  change_composition(
    composition = df,
    main_part = "moderate",
    main_change = +0.5,
    comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
  ) 

new_comp2 <-
    change_composition(
    composition = df,
    main_part = "sedentary",
    main_change = -3.5,
    comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep")
  )
list_for_plot <- list("Extra 0.5 hr/day moderate" = new_comp, "3.5 hr/day less sedentary" = new_comp2)
print(list_for_plot)

Plotting estimates at particular compositions

We now need to plot these, which can be done as follows.

forest_plot_comp(composition_list = list_for_plot, 
                 model = lm_BMI, 
                 comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"))

Please note that on the plot with terms = TRUE the reference is always the compositional mean in the dataset used to create the model. This will agree with the compositional mean calculated above and which we calculated perturbations relative to if there is no missing data in the dataset used to create the model. To check this, there's a utility function in the package to get the compositional mean directly from the model (although it's a bit ugly):

# Set up of the function means it uses this internal from the modelling. Hopefully will be streamlined in future... : 
tl_for_check <- transf_labels(comp_labels  = c("vigorous", "moderate", "light", "sedentary", "sleep"), transformation_type = "ilr") 
 # The comp mean calculated from the model: 
get_cm_from_model(lm_BMI, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), transf_labels = tl_for_check)$cm
# The comp mean calculated above transformed to unitless scale: 
df/24 

The forest_plot_comp function can also be used to compare models. To do this, leave model at its default NULL value, and set models_list as a named list of models. For example, to compare lm_BMI with a model adjusted for age_group only and an unadjusted model (very boring in this case as the simulated data doesn't have realistic correlation structure for covariates!):

lm_BMI_unadjusted <- comp_model(type = "linear",
         outcome = "BMI",
         data = simdata,
         comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
         det_limit = 0.0012)
lm_BMI_age_group_only <- comp_model(type = "linear",
         outcome = "BMI",
         covariates = c("agegroup"),
         data = simdata,
         comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
         det_limit = 0.0012)
forest_plot_comp(composition_list = list_for_plot, 
                 models_list = list("Unadjusted" = lm_BMI_unadjusted, 
                                    "Age-adjusted" = lm_BMI_age_group_only,
                                    "Age- and sex- adjusted" = lm_BMI), 
                 comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"))

Additional information

This section contains lots of additional information, on:

Mechanics of Compositional Data Analysis: Log-ratio transformations

The functions above wrap all the "log-ratio mechanics" of using a Compositional Data Analysis approach.

There are several different kinds of transformations which can be used in order to work with compositional data (alr, ilr, clr). These transform the variables into coordinates which are capable of representing all the variation in the data. Different transformations have different pros/cons, and alr and clr can only be used for certain applications. All the functions default to ilr (isometric log ratio) transformation using so-called "pivot coordinates", which is fairly standard in the movement behaviour/time use literature (and is appropriate for the applications which the other transformations are used for). If you want to learn more, see [@Filzmoser2018][@Pawlowsky-Glahn2011].

The purpose of this package is not to provide this functionality, which is implemented in other R packages, such as compositions and robCompositions. (The main aim of this package is shown above, in the functions to perform standard epidemiological analyses). This functionality is reimplemented here simply to ensure that future changes to the defaults in those packages do not affect this package.

data_ilr_impute_zeroes <- transform_comp(data = simdata,
                                   comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
                                   transformation_type = "ilr", 
                                   det_limit = 0.0012)
head(data_ilr_impute_zeroes)

There are also a few other things we can do with this function, like specifying a given "first part" (which is useful when it comes to reporting, and is used therefore in tab_coefs). To learn more, run ?transform_comp.

Presenting results via full model predictions

An alternative which may sometimes be useful when interpreting models is to consider model-predicted outcomes for different input variable values, including values of covariates (rather than model-estimated differences associated with a particular difference in compositional variables). Functions related to plotting (e.g. the plot_transfers function) can do this by setting terms = FALSE:

plot_transfers(from_part = "sedentary",
               to_part = "moderate",
               model = lm_BMI ,
               comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
               y_label = "Model-predicted outcome",
               units = "hr/day",
               terms = FALSE)

Notes

Customising plots

All the plotting functions can have settings changed to make more visually attractive plots (e.g. changing text size, line size, text colour). plot_transfers and simplex_plotting are based on ggplot2 and its extension ggtern.

For an extreme example of implementing some of this:

library(ggplot2) # it's easier to specify e.g. themes if ggplot2 is loaded. 
plot_transfers(from_part = "sedentary",
               to_part = "moderate",
               model = lm_BMI ,
               comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"),
               y_label = "Model-estimated difference in outcome",
               units = "hr/day",
               point_specification = geom_point(size = 0.5, colour = "purple"),
               error_bar_colour = "yellow", 
               theme = theme_bw()+ theme(line = element_line(colour = "red", size = 2), 
                             text = element_text(family = "mono", face = "bold.italic", colour = "green"), 
               axis.line = element_line(colour = "blue", size = 3)))

Slightly differently, forest_plot_comp is based on the forestplot package, so takes arguments specifying the plot in the same way as that function. There are several arguments which set particular aspects of the forest plot. In addition, any argument that the basic forestplot function can take, which isn't already set, can be set as an argument to forest_plot_comp.

library(forestplot) # it's easier to specify options such as fpColors if the forestplot package is loaded. 
forest_plot_comp(composition_list = list_for_plot, 
                 model = lm_BMI, 
                 comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
                 pred_name = "The estimated difference at this composition", # This sets the name with which to label the column. , 
                 title = "Analysis of movement behaviour data", # This sets the overall plot title. 
                 col = forestplot::fpColors(zero = "black") # This sets the colour of the line at 0 (or 1, if it's the plot of a logistic or Cox model)
                 )

For additional information on how to customise the functions, see the manual by running e.g. ?plot_transfers.

Regression modelling: further examples

As well as linear models, we can produce logistic and Cox models:

logistic_outcome <- comp_model(type = "logistic",
         outcome = "disease",
         covariates = c("agegroup", "sex"),
         data = simdata,
         comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
         det_limit = 0.0012)
cox_outcome <- comp_model(type = "cox",
         event = "event", 
         follow_up_time = "follow_up_time",
         covariates = c("agegroup", "sex"),
         data = simdata,
         comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), 
         det_limit = 0.0012)

Getting the numbers out: estimates/predictions for particular compositions

To get the model estimates/predictions out at particular compositions, we can use the predict_fit_and_ci function. This takes a data frame argument for new_data and is the underlying function used in the reallocation plots.

predict_fit_and_ci(model = lm_BMI, new_data = new_comp, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"))

As noted above, this can be used to look at any reallocations between parts which are of interest.

References



OxWearables/epicoda documentation built on Dec. 7, 2022, 9:07 p.m.