Cohort datasets often contain many discrete variables, for example with data obtained from questionnaires. In this lesson, you'll learn when and how to tidy up discrete data.
Dichotomania: The obsession to convert continuous data into discrete or binary data
First I want to mention a problem common in health research, known as dichotomania. Dichotomania is an obsession to convert continuous data into discrete data. It is also called discretizing or dichotomizing. For example, obesity is defined as a BMI greater than thirty.
Discretizing can cause a lot of problems. For instance it reduces statistical power, provides little clinical value, and there is a higher chance of misclassifying an individual's health risk. So, avoid it as much as you can.
| Category | BMI range | |:---------|:----------| | "Underweight" | < 18.5 | | "Normal weight" | 18.5 - 25 | | "Overweight" | 25 - 30 | | "Obese" | > 30 |
# For next figure: tidier_framingham %>% ggplot(aes(x = body_mass_index)) + geom_histogram(colour = "black", fill = "grey80") + geom_vline(xintercept = c(20, 25, 30), linetype = "dashed")
You may be familiar with the different weight classes used according to your body mass index, or BMI. These classes, shown in the table, are often used in health research. Let's visually show why this is a problem. The code provided creates the image shown next.
BMI continuously occurs at every value from 10 to more than 60. The three vertical lines are when discretization is applied to the data. Huge chunks of people are classified with others even if they are very close to one of the lines. For instance, someone with a BMI of twenty would be classified equally as someone with a BMI of twenty-four point nine. Or, if someone gains a BMI of zero point two, they could move from one category to the next. Biologically, this makes no sense. So, as best you can, avoid dichotomizing variables!
Naturally/generally discrete: Usually no fractions
Reasons to reduce:
While discretizing a continuous variable is often discouraged, some variables tend to be more naturally discrete. For example, you usually take one or two pills, not one point five. Sometimes, reducing the number of levels in these discrete variables makes sense, such as when there is large error in measurement or data entry, small sample sizes in some levels, or easier interpretation.
tidier_framingham %>% count(cigarettes_per_day)
# A tibble: 46 x 2 cigarettes_per_day n <dbl> <int> 1 0 6598 2 1 162 3 2 98 4 3 183 5 4 65 6 5 181 7 6 77 8 7 47 9 8 52 10 9 149 # … with 36 more rows
Let's use cigarettes per day as an example and use the count function on the variable. We can see that the values are integers. People usually report smoking one or two, not one and a half. From the previous exercise you'll have seen that more people report a rounded or estimated number, such as about ten or fifteen, but rarely numbers like seventeen. And because there are few people in any given integer, we could simplify the variable.
tidier_framingham <- tidier_framingham %>% mutate(cig_packs_per_day = case_when( cigarettes_per_day == 0 ~ "None", cigarettes_per_day >= 1 & cigarettes_per_day <= 20 ~ "Up to one", cigarettes_per_day >= 21 & cigarettes_per_day <= 40 ~ "One to two", cigarettes_per_day > 40 ~ "More than two", TRUE ~ NA_character_ )) tidier_framingham %>% count(cig_packs_per_day)
# A tibble: 5 x 2 cig_packs_per_day n <chr> <int> 1 NA 79 2 More than two 145 3 None 6598 4 One to two 1133 5 Up to one 3672
We can tidy up and reduce a variable's levels by using the case_when()
function from dplyr. Since packs of cigarettes usually come with 20 cigarettes,
let's create another variable for number of packs smoked. case_when()
takes multiple arguments, each is in the form of a condition on the left of the
tilde and the output value on the right. So we set the first condition as when
cigarettes is zero and assign the value none. The next is when cigarettes is
from one to twenty, assigning the value up to one, and so on. The final
condition should be formatted as shown when outputting characters. When counting
the new variable, it now has less levels.
library(forcats) tidier_framingham %>% mutate(cig_packs_per_day = fct_recode( cig_packs_per_day, # Form: "New level" = "Old level" "More than two" = "One to two" )) %>% count(cig_packs_per_day)
# A tibble: 4 x 2 cig_packs_per_day n <fct> <int> 1 More than two 1278 2 None 6598 3 Up to one 3672 4 NA 79
Case when is useful for many situations, but the forcats package is specific to
tidying up factor variables. Use the fct_recode()
function to edit and
rename levels directly. The argument has the form of the new level name then
equals old names. In this case, we now merged two levels together.
count()
to check categorical data case_when()
or fct_recode()
to reduce levelsIn summary, keep continuous data continuous. Use count, case_when()
,
and fct_recode()
functions to check and reduce categorical data.
Alright, let's do some tidying!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.