Transforming your variables is a common activity when analyzing cohorts. In this lesson we will cover the what, why, and how of variable transformations.
Transforming a variable is the act of applying a mathematical function to a variable.
There are many reasons to transform a variable. We transform to fit the variable distribution to assumptions for a statistical technique, to examine linear relationships between variables, or create a common unit among variables for easier comparison. Ultimately, you use transformations to make it easier to interpret your analyses. This is particularly important in health research since your results could be used to improve people's health.
| Type | R code | When to use |
|:-----|:-------|:------------|
| natural log | log(x)
| Highly-right skewed, positive data |
| mean-center | scale(x, scale = FALSE)
| Zero is the mean; use for easier interpretation |
| scaling | scale(x)
| Unit is 1 standard deviation; be careful with longitudinal data |
| square root | sqrt(x)
| Great for count data; handles zeros |
| inverse |1 / x
| To invert values/unit, but no zeros |
There are many types of transformations. Which you use depends on your data, your choice of statistics, and how you want the results to be interpreted. Here is a table of a few common transformations. The natural logarithm is a very common transformation. You use it when your data is heavily right-skewed. Another common transformation is mean-centering, which is when the values are subtracted by the mean so zero represents the mean. Scaling is also fairly common, which is mean-centering and then dividing by the standard deviation giving units as standard deviations. The square root is great for count data since it handles zeros, unlike the logarithm or inverse. The inverse, also called the reciprocal, is useful for several purposes, such as to reverse the unit. For instance, if a unit is persons per doctor, the inverse is doctors per person.
# Typical way of transforming: transformed <- tidier2_framingham %>% mutate(body_mass_index_log = log(body_mass_index_log), heart_rate_log = log(heart_rate))
# More efficient way: invert <- function(x) 1 / x transformed <- tidier2_framingham %>% mutate_at(vars(body_mass_index, heart_rate), # Form: name_to_append = function list(scale = scale, log = log, invert = invert))
[1] "body_mass_index" "heart_rate" "body_mass_index_scale" [4] "heart_rate_scale" "body_mass_index_log" "heart_rate_log" [7] "body_mass_index_invert" "heart_rate_invert"
With the mutate function from dplyr, transforming is easy! The typical way of creating new columns from transformations is to use mutate and create variables one by one. This gets tedious fast when there are many more transformations or variables.
A faster way is to use the mutate_at()
function. It takes two
arguments, the variables specified by the vars function, and the transformation
functions given in a list. The list should have name-function pairings, with the
name being appended to the end of the variable name to form the newly
transformed variable column.
If a transformation function doesn't exist, I'd recommend creating a function of it so it's easier to provide in the list. This is what we did with invert here. So, with only one or two lines of code, you've transformed all these variables!
# Long data and facets to show all plots transformed %>% select(contains("heart_rate")) %>% gather(variable, value) %>% ggplot(aes(x = value, y = stat(density))) + geom_histogram(colour = "black", fill = "grey80") + geom_density() + facet_wrap(vars(variable), scales = "free")
Let's look at what the transformations do. Here is the code that will create the next figure. Notice the contains function in select. This function finds all variables with this string. We'll plot both a histogram and a density curve. Since both geoms are used, we need to use the stat function with density as the y argument. We convert the data to long form and combine with facet as we've done previously, to plot each transformation.
Here are the plotted transforms. Notice how scaling doesn't change the distribution, but now the middle is approximately zero and the units are in standard deviations. Compare this to the logarithm, which shrinks the distribution, so more extreme values are closer to the middle. Finally, see how the inverse literally inverts the distribution. Larger heart rates are now small and small are large. Consider how which transform you use will change the units and, in later analyses, the interpretation.
log()
and scale()
To summarize, we covered when to use transformations and some common types, such as the log or scale. As always, be careful and thoughtful about using transformations. Your findings may influence human health, so you must be sure to avoid possible miscommunication.
Let's practice these transformations!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.