knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_chunk$set(fig.width=9, fig.height=6)
devtools::load_all()
If you have installed the package following the instructions on GitHub, it can be loaded by:
library(epicoda)
This vignette illustrates some of the things this package can be used to do in an epidemiological analysis with a compositional exposure. More information about any function can be found by running ?function_name
(as usual).
Please note this package is shared as-is and without extensive review/testing. Use is at your own risk (https://github.com/activityMonitoring/epicoda). Bug reports are welcome.
It is highly recommended to deal with missing data both in compositional variables and in covariate or outcome data prior to using the package: functions were designed assuming complete data and may not behave predictably where data is not complete.
simdata
is a (simulated) compositional dataset in the package. It has 5 columns which make up a hypothetical movement behaviour composition: "vigorous" (activity), "moderate" (activity), "light" (activity), "sedentary" (behaviour), "sleep". It also contains sex, age group and outcome columns. As the data is simulated, all results are hypothetical.
head(simdata)
Data is compositional when it is positive-valued and the relative (not absolute) values of the variables are relevant to the problem being studied [@Filzmoser2018]. Another way to describe this is that compositional data is data that can be expressed as proportions of a whole without losing information.
Here, the compositional variables sum to the same total (24- the number of hours in the day), which is a common situation. As this means it is not possible to simply increase time in one behaviour - it is only possible to spend time in one behaviour rather than in another behaviour - it is the relative values that are relevant.
head(apply(simdata[, c("vigorous", "moderate", "light", "sedentary", "sleep")], 1, sum))
Other examples of compositional data don't have a fixed sum. For example, consider the vote counts in an election for different parties in different constituencies: although the overall number of votes cast might differ between constituencies, it is the proportion of the vote going to each party, and not the total number of votes going to that party in a given constituency, that is relevant.
The compositional mean is a frequently-used summary statistic for compositional data [@Filzmoser2018]. Like the standard (multivariate) mean it is a measure of location, or central tendency. Unlike the standard mean, it interacts in an appropriate way with compositional data.
To calculate the compositional mean of columns "vigorous", "moderate", "light", "sedentary", "sleep"
using epicoda
, we can run:
comp_mean(data = simdata, # this is the dataset comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep") # this is the labels of the compositional columns, # which we specified above )
Unfortunately, it's thrown an error, with a message about zeroes. Compositional data analysis can't automatically include zero values. For more information about this issue, see [@Palarea-Albaladejo2015] and [@Rasmussen2020]. There are a couple of options:
They can be dropped. This is the right option if they're so called "true zeroes", and not related to a limit in the precision with which we can measure.
They can be imputed. This is the right option if they're so-called "rounded zeroes" i.e. they're related to a limit in the precision with which we can measure. For example, with movement behaviour data, it might be that if we recorded data for long enough, we would eventually see some vigorous activity, but our measurement period wasn't long enough. It might also be that our measurement device can't record values lower than a certain level.
For convenience, many functions in this package let you specify what to do with zeroes as part of the function, and will then perform this preprocessing step as part of the process. They have two arguments to do this: rounded_zeroes
(default: TRUE
) and det_limit
(default: NULL
).
- Setting rounded_zeroes = TRUE
and det_limit
to a numeric value on the same scale as the data imputes zeroes with that detection limit. Under the hood, it uses the lrEM
algorithm from the zCompositions
package. Note that currently the same detection limit must apply to all compositional columns. [Note that if the input data scale and the output scale (units
argument), are different, it should match the input data scale.]
- Setting rounded_zeroes = FALSE
will drop any rows with zeroes (i.e. use the "true zeroes" approach above)
- The default in the function is to run with rounded_zeroes = TRUE
and det_limit = NULL
. If there are no zeroes in the data, either as they've already been processed or because there just are no zero values, this will run fine: it doesn't need to do anything to the data. However, if there are zeroes it will throw an error message.
These convenience functions should be used with care. You can choose to perform zero imputation or drop zero values prior to doing any analysis on your data. This has several advantages:
- It can speed up code as these calculations are done once, rather than repeatedly.
- You can inspect the data post-processing of zeroes. For example, you can check that zero imputation has worked as expected. If dropping zeroes, you will know how many rows have been dropped.
- You can do more complex zero imputation than implemented here (e.g. different detection limits for different parts; appropriate methods in the case of data with no complete columns)
- You can use different zero imputation methods (see the zCompositions
package)
- If you are using compositions with non-constant sum, automated det_limit
rescaling used in the package will fail. Again, these cases can be dealt with by pre-processing the zeroes using zCompositions
As well as comp_mean
, some other functions in the package also have rounded_zeroes
and det_limit
arguments: they also use calculations which need zeroes to be imputed or excluded. For all of them, it works the same as for comp_mean
. Note: Previously, functions in this package "guessed" the detection limit as the minimum non-zero value observed in the data if it wasn't set. This behaviour has been removed as this may have unfortunate consequences in some datasets and should be an explicit analytic decision.
So we can now find the mean again, this time imputing the zeroes with a specified detection limit. In the simulated data, the shortest period of a behaviour we could detect was $0.0012\ hr/day$ (due to the length of an epoch for the accelerometer used to measure the data on which the simulated data is based).
comp_mean(data = simdata, # this is the dataset comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), # this is the labels of the compositional columns, # which we specified above rounded_zeroes = TRUE, # this option specifies that we'll treat the zeroes # as rounded zeroes i.e. we'll impute them det_limit = 0.0012, # this is the smallest value observable in the data )
Even though we started out with sums which added to 24, the mean seems to add to 1. This is because we didn't specify the units - and so the package didn't know what units to return it to us in. But it could still return it as a sum to 1, without loss of information, as per the definition of compositional data.
To change, this we can specify the output units:
comp_mean(data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), rounded_zeroes = TRUE, det_limit = 0.0012, units = "hr/day" # this is the units. There are pre-specified options "hr/day", # "hr/wk", "min/day", "min/wk" and "unitless". # If you set units = "specified", you can also specify your own units using # specified = c("my_units_name", sum of a composition in these units) )
While the compositional mean is a fairly clear compositional equivalent to the mean for non-compositional data, other standard descriptive statistics (such as the variance), do not have straightforward compositional equivalents for all purposes for which they are used (there may be several alternatives, depending on the purpose of the analysis) [@Filzmoser2018]. These are beyond the current scope of this package.
Ternary plots provide a way to plot compositional data with three parts. The R package ggtern
is an extension of ggplot2
to carry out this plotting. In this package, there is a wrapper for ggtern
functions to implement common use cases: plotting the data density, and plotting prediction regions for the data, both overall and within subgroups. (Note that while the "geom_confidence_region" from ggtern
is used, the regions produced seem to be more properly described as prediction regions)
Note (18.03.2021): A conflict between ggtern
and ggplot2
means using ggtern
can stop axis labels appearing on ggplot2
graphs. Pending a more permanent solution, these functions remain in the package, but ggtern
is not automatically loaded and should be used with care.
Frequently, data has more than three compositional parts (as here). There are two alternatives in that case: using three parts of the existing parts (looking at the subcomposition formed by these parts), or summing parts to consider a different composition (where that is informative). In the simulated data, "vigorous", "moderate"
and "light"
represent three different intensities of activity, so it is meaningful to combine them:
simdata$all_activity <- simdata$vigorous + simdata$moderate + simdata$light plot_prediction_region_ternary(data = simdata, parts_to_plot = c("all_activity", "sedentary", "sleep"), probs = c(0.25, 0.5, 0.75) # This will plot 25, 50 and 75% prediction regions (default is 0.5, 0.9, 0.95) )
Another frequent use case is considering this plot separated by subgroups. For example, to look at different sexes (which in fact don't differ much in the simulated data):
plot_prediction_region_ternary(data = simdata, parts_to_plot = c("all_activity", "sedentary", "sleep"), groups = "sex", probs = c(0.25, 0.5, 0.75) # This will plot 25, 50 and 75% prediction regions (default is 0.5, 0.9, 0.95) )
The mark_points
argument can be used to annotate particular points using crosshairs. For example, to mark the compositional mean:
points_to_mark <- as.data.frame(comp_mean(data = simdata, comp_labels = c("all_activity", "sedentary", "sleep"))) plot_prediction_region_ternary(data = simdata, parts_to_plot = c("all_activity", "sedentary", "sleep"), mark_points = points_to_mark, probs = c(0.25, 0.5, 0.75) # This will plot 25, 50 and 75% prediction regions (default is 0.5, 0.9, 0.95) )
The package has a wrapper for linear regression, logistic regression and Cox regression modelling using compositional variables, with adjustment for covariates. Compositional Data Analysis regression modelling works by transforming the variables making up the composition (see "Mechanics of Compositional Data Analysis" below), and using the transformed variables (known as "log-ratio coordinates") in the model. The package automates these steps. For example, to set up a linear regression model for the association between behaviour variables and BMI
, adjusted for age and sex:
lm_BMI <- comp_model(type = "linear", outcome = "BMI", covariates = c("agegroup", "sex"), data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012)
Logistic models can be produced in exactly the same way by setting type = "logistic"
and using an appropriate binary outcome variable. Cox models can be produced similarly by setting type = "cox"
and either setting outcome
as a Surv
object from the package survival
or setting the follow_up_time
and event
arguments. For examples, see "Regression modelling: further examples".
The model object is a standard lm
object, and we can inspect it in the usual way:
plot(lm_BMI) summary(lm_BMI)
The model diagnostics can be interpreted as usual, and in this case all looks fine. However, the summary is hard to interpret because the transformed variables or "coordinates" used in the model do not relate in a straightforward way to the original composition.
A common way to present the results is to present coefficients as usual for covariates, and then produce as many models as there are compositional variables. From each of these models, the coefficient for the first "ilr pivot coordinate" is taken (see "Mechanics of Compositional Data Analysis" below). These ilr pivot coordinate coefficients can be interpreted in terms of reallocation between the particular compositional part and all other parts proportionally. tab_coefs
is a function which wraps generating one model per compositional variable and outputting a table of model coefficients.
tab_coefs(scale_type = "lp", # This argument can be "lp" or "exp" and determines whether # coefficients are presented on the scale of the linear predictors ("lp") # or are exponentiated ("exp"). Exponentiation gives the Odds Ratio for # logistic regression models and the Hazard Ratio for Cox regression models. level = 0.95, type = "linear", outcome = "BMI", covariates = c("agegroup", "sex"), data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012)
comp_model
must be given to this function too, as it regenerates the regression models.tab_covariate_coefs
is called within this function, and can be used to extract the coefficients of the covariates only from a model. Substitution or reallocation plots show the model-estimated difference in the outcome (relative to at the compositional mean) as the balance between two parts of the composition varies. In the case of time use/movement behaviours, they are sometimes called isotemporal substitution plots. They can be thought of as showing the difference in outcome associated with a hypothetical reallocation of time from one behaviour to another, as estimated by the model. For example:
plot_transfers(from_part = "sedentary", to_part = "moderate", model = lm_BMI , comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), y_label = "Model-estimated difference in BMI", units = "hr/day")
plot_transfers
function. However, the predict_fit_and_ci
function can be used to estimate the outcome for any specified reallocation. vignette("derivation_of_CIs_used")
).Another set of estimates that might be of interest are estimates at particular compositions. These can be examined by plotting these in a "forest plot" style plot. This allows assessment of the model-estimated associations with much more complex reallocations between parts.
While there are hugely varied compositions which might be possible, one kind of composition is often of interest. This is compositions where one particular part is perturbed at the expense of all other parts proportionally. As this is a common use case, there is a function implemented to do this:
df <- as.data.frame( comp_mean( data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012, units = "hr/day", # Note whatever the units are here is the units # you should specify the changes in in the next two lines. ) ) new_comp <- change_composition( composition = df, main_part = "moderate", main_change = +0.5, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep") ) new_comp2 <- change_composition( composition = df, main_part = "sedentary", main_change = -3.5, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep") ) list_for_plot <- list("Extra 0.5 hr/day moderate" = new_comp, "3.5 hr/day less sedentary" = new_comp2) print(list_for_plot)
We now need to plot these, which can be done as follows.
forest_plot_comp(composition_list = list_for_plot, model = lm_BMI, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"))
Please note that on the plot with terms = TRUE
the reference is always the compositional mean in the dataset used to create the model. This will agree with the compositional mean calculated above and which we calculated perturbations relative to if there is no missing data in the dataset used to create the model. To check this, there's a utility function in the package to get the compositional mean directly from the model (although it's a bit ugly):
# Set up of the function means it uses this internal from the modelling. Hopefully will be streamlined in future... : tl_for_check <- transf_labels(comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), transformation_type = "ilr") # The comp mean calculated from the model: get_cm_from_model(lm_BMI, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), transf_labels = tl_for_check)$cm # The comp mean calculated above transformed to unitless scale: df/24
The forest_plot_comp
function can also be used to compare models. To do this, leave model
at its default NULL
value, and set models_list
as a named list of models. For example, to compare lm_BMI
with a model adjusted for age_group
only and an unadjusted model (very boring in this case as the simulated data doesn't have realistic correlation structure for covariates!):
lm_BMI_unadjusted <- comp_model(type = "linear", outcome = "BMI", data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012) lm_BMI_age_group_only <- comp_model(type = "linear", outcome = "BMI", covariates = c("agegroup"), data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012) forest_plot_comp(composition_list = list_for_plot, models_list = list("Unadjusted" = lm_BMI_unadjusted, "Age-adjusted" = lm_BMI_age_group_only, "Age- and sex- adjusted" = lm_BMI), comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"))
This section contains lots of additional information, on:
The functions above wrap all the "log-ratio mechanics" of using a Compositional Data Analysis approach.
There are several different kinds of transformations which can be used in order to work with compositional data (alr, ilr, clr). These transform the variables into coordinates which are capable of representing all the variation in the data. Different transformations have different pros/cons, and alr and clr can only be used for certain applications. All the functions default to ilr (isometric log ratio) transformation using so-called "pivot coordinates", which is fairly standard in the movement behaviour/time use literature (and is appropriate for the applications which the other transformations are used for). If you want to learn more, see [@Filzmoser2018][@Pawlowsky-Glahn2011].
The purpose of this package is not to provide this functionality, which is implemented in other R packages, such as compositions
and robCompositions
. (The main aim of this package is shown above, in the functions to perform standard epidemiological analyses). This functionality is reimplemented here simply to ensure that future changes to the defaults in those packages do not affect this package.
data_ilr_impute_zeroes <- transform_comp(data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), transformation_type = "ilr", det_limit = 0.0012) head(data_ilr_impute_zeroes)
There are also a few other things we can do with this function, like specifying a given "first part" (which is useful when it comes to reporting, and is used therefore in tab_coefs
). To learn more, run ?transform_comp
.
An alternative which may sometimes be useful when interpreting models is to consider model-predicted outcomes for different input variable values, including values of covariates (rather than model-estimated differences associated with a particular difference in compositional variables). Functions related to plotting (e.g. the plot_transfers
function) can do this by setting terms = FALSE
:
plot_transfers(from_part = "sedentary", to_part = "moderate", model = lm_BMI , comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), y_label = "Model-predicted outcome", units = "hr/day", terms = FALSE)
fixed_values
argument can be used to set the values at which covariates are fixed (which if not set default to mean or modal values for continuous and categorical variables respectively)forest_plot_comp
, can also output predictions by setting terms = FALSE
. All the plotting functions can have settings changed to make more visually attractive plots (e.g. changing text size, line size, text colour). plot_transfers
and simplex_plotting
are based on ggplot2
and its extension ggtern
.
theme
argument. Note that you will need the ggplot2
package loaded in order to do this.plot_transfers
, to specify characteristics of the plotted line, set the point_specification
argument as a geom_point
object.plot_transfers
, to specify colours of the error bars, set the error_bar_colour
argument as the name of a colour.plot_transfers
, if the line looks "gappy" (likely at the ends), increase granularity
(and if it takes too long to plot, decrease granularity
). For an extreme example of implementing some of this:
library(ggplot2) # it's easier to specify e.g. themes if ggplot2 is loaded. plot_transfers(from_part = "sedentary", to_part = "moderate", model = lm_BMI , comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), y_label = "Model-estimated difference in outcome", units = "hr/day", point_specification = geom_point(size = 0.5, colour = "purple"), error_bar_colour = "yellow", theme = theme_bw()+ theme(line = element_line(colour = "red", size = 2), text = element_text(family = "mono", face = "bold.italic", colour = "green"), axis.line = element_line(colour = "blue", size = 3)))
Slightly differently, forest_plot_comp
is based on the forestplot
package, so takes arguments specifying the plot in the same way as that function. There are several arguments which set particular aspects of the forest plot. In addition, any argument that the basic forestplot
function can take, which isn't already set, can be set as an argument to forest_plot_comp
.
library(forestplot) # it's easier to specify options such as fpColors if the forestplot package is loaded. forest_plot_comp(composition_list = list_for_plot, model = lm_BMI, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), pred_name = "The estimated difference at this composition", # This sets the name with which to label the column. , title = "Analysis of movement behaviour data", # This sets the overall plot title. col = forestplot::fpColors(zero = "black") # This sets the colour of the line at 0 (or 1, if it's the plot of a logistic or Cox model) )
For additional information on how to customise the functions, see the manual by running e.g. ?plot_transfers
.
As well as linear models, we can produce logistic and Cox models:
logistic_outcome <- comp_model(type = "logistic", outcome = "disease", covariates = c("agegroup", "sex"), data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012) cox_outcome <- comp_model(type = "cox", event = "event", follow_up_time = "follow_up_time", covariates = c("agegroup", "sex"), data = simdata, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"), det_limit = 0.0012)
To get the model estimates/predictions out at particular compositions, we can use the predict_fit_and_ci
function. This takes a data frame argument for new_data
and is the underlying function used in the reallocation plots.
predict_fit_and_ci(model = lm_BMI, new_data = new_comp, comp_labels = c("vigorous", "moderate", "light", "sedentary", "sleep"))
As noted above, this can be used to look at any reallocations between parts which are of interest.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.