In lwjohnst86/acdcourse: Course on Analyzing Cohort Datasets in R

Transforming your variables is a common activity when analyzing cohorts. In this lesson we will cover the what, why, and how of variable transformations.

Why transform variables?

Transformation: Applying math functions to data values
Multiple reasons to transform variables:
- To fit with statistical assumptions
- To analyze for linear relationships
- To convert to shared variable unit
- To improve interpretation of model results

Transforming a variable is the act of applying a mathematical function to a variable.

There are many reasons to transform a variable. We transform to fit the variable distribution to assumptions for a statistical technique, to examine linear relationships between variables, or create a common unit among variables for easier comparison. Ultimately, you use transformations to make it easier to interpret your analyses. This is particularly important in health research since your results could be used to improve people's health.

Common types of variable transformations

Which to use depends on data, analysis, and desired interpretation

| Type | R code | When to use | |:-----|:-------|:------------| | natural log | log(x) | Highly-right skewed, positive data | | mean-center | scale(x, scale = FALSE) | Zero is the mean; use for easier interpretation | | scaling | scale(x) | Unit is 1 standard deviation; be careful with longitudinal data | | square root | sqrt(x) | Great for count data; handles zeros | | inverse |1 / x| To invert values/unit, but no zeros |

There are many types of transformations. Which you use depends on your data, your choice of statistics, and how you want the results to be interpreted. Here is a table of a few common transformations. The natural logarithm is a very common transformation. You use it when your data is heavily right-skewed. Another common transformation is mean-centering, which is when the values are subtracted by the mean so zero represents the mean. Scaling is also fairly common, which is mean-centering and then dividing by the standard deviation giving units as standard deviations. The square root is great for count data since it handles zeros, unlike the logarithm or inverse. The inverse, also called the reciprocal, is useful for several purposes, such as to reverse the unit. For instance, if a unit is persons per doctor, the inverse is doctors per person.

Transforming your data in R

# Typical way of transforming:
transformed <- tidier2_framingham %>%
    mutate(body_mass_index_log = log(body_mass_index_log),
           heart_rate_log = log(heart_rate))

# More efficient way:
invert <- function(x) 1 / x
transformed <- tidier2_framingham %>%
    mutate_at(vars(body_mass_index, heart_rate),
              # Form: name_to_append = function
              list(scale = scale, log = log, invert = invert))

[1] "body_mass_index"        "heart_rate"          "body_mass_index_scale" 
[4] "heart_rate_scale"       "body_mass_index_log" "heart_rate_log"        
[7] "body_mass_index_invert" "heart_rate_invert"

With the mutate function from dplyr, transforming is easy! The typical way of creating new columns from transformations is to use mutate and create variables one by one. This gets tedious fast when there are many more transformations or variables.

A faster way is to use the mutate_at() function. It takes two arguments, the variables specified by the vars function, and the transformation functions given in a list. The list should have name-function pairings, with the name being appended to the end of the variable name to form the newly transformed variable column.

If a transformation function doesn't exist, I'd recommend creating a function of it so it's easier to provide in the list. This is what we did with invert here. So, with only one or two lines of code, you've transformed all these variables!

Visualizing the transformations

# Long data and facets to show all plots
transformed %>%
    select(contains("heart_rate")) %>%
    gather(variable, value) %>%
    ggplot(aes(x = value, y = stat(density))) +
    geom_histogram(colour = "black", fill = "grey80") +
    geom_density() +
    facet_wrap(vars(variable), scales = "free")

Let's look at what the transformations do. Here is the code that will create the next figure. Notice the contains function in select. This function finds all variables with this string. We'll plot both a histogram and a density curve. Since both geoms are used, we need to use the stat function with density as the y argument. We convert the data to long form and combine with facet as we've done previously, to plot each transformation.

Visualizing the transformations

Transformation distributions for heart rate.

Here are the plotted transforms. Notice how scaling doesn't change the distribution, but now the middle is approximately zero and the units are in standard deviations. Compare this to the logarithm, which shrinks the distribution, so more extreme values are closer to the middle. Finally, see how the inverse literally inverts the distribution. Larger heart rates are now small and small are large. Consider how which transform you use will change the units and, in later analyses, the interpretation.

Lesson summary

Transform for many reasons, with many choices
- Be thoughtful and careful about why
Common ones include log() and scale()
Transforms can strongly influence distribution

To summarize, we covered when to use transformations and some common types, such as the log or scale. As always, be careful and thoughtful about using transformations. Your findings may influence human health, so you must be sure to avoid possible miscommunication.