library(knitr)
knitr::opts_chunk$set(error = TRUE)
knitr::opts_chunk$set(out.width = "750px", dpi = 300)
knitr::opts_chunk$set(dev = "png", fig.width = 8, fig.height = 4.8889, dpi = 300)
knitr::opts_chunk$set(fig.path = "introduction_files/")
library(pacman)
if (!require("tntpr")) install_github("tntp/tntpr")
library(tntpr)
p_load(tidyverse, janitor, praise)
qtype_1 <- c("Strongly Disagree", "Disagree", "Somewhat disagree", "Somewhat agree", "Agree", "Strongly agree")
qtype_2 <- c("Strongly Disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")
qtype_3 <- c("No", "Yes")

survey_dat <- tibble(
  response_id = 1:100,
  years_of_experience = round(runif(100) * 10, digits = 0),
  q1 = sample(qtype_1, 100, replace = TRUE),
  q2 = sample(qtype_1 %>% str_remove("Strongly disagree"), 100, replace = TRUE),
  q3 = sample(qtype_1, 100, replace = TRUE),
  q4 = replicate(100, praise()),
  q5 = sample(qtype_2, 100, replace = TRUE),
  q6 = sample(qtype_2, 100, replace = TRUE),
  q7 = replicate(100, praise("${Exclamation}! ${EXCLAMATION}!-${EXCLAMATION}! This is just ${adjective}!")),
  q8 = sample(qtype_3, 100, replace = TRUE),
  q9 = sample(qtype_3, 100, replace = TRUE)
)

Worked Example

Imagine you conduct a survey and are about to analyze the data. One of the first steps in the data cleaning process is to "factorize" the dataset. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Usually this is one of the most manual, error-prone parts of the data-cleaning process.

Below is a workflow I use that might be helpful for others. I use a couple functions that may be new to you.

Step 1: Identify which variables should be factors.

Admittedly, this part usually takes some knowledge of the dataset and/or exploratory perusing of the dataset. Let's first use glimpse to take a look at the dataset.

survey_dat %>%
  glimpse()

First thing I notice is all the question variables are character vectors. Since I know some of the questions where multiple-choice questions, it's likely some of these should be converted to factors.

One thing you could try is seeing how many unique values each variable has.

survey_dat %>%
  map(unique) %>%
  map(length)

Here I notice response_id, q4, and q7 have 50+ unique values and should likely not be transformed to factors. Knowing the dataset, I think response_id should probably be a character data type but will not do that in this example. Finally, since each q1, q2, q3, q5, q6, q8, and q9 are "variables that have a fixed and known set of possible values" they are likely candidates to be factors.

Next, I'll see what the unique values are for these variables.

survey_dat %>%
  select(q1, q2, q3, q5, q6, q8, q9) %>%
  map(unique)

Aha! These do indeed look like factors.

Step 2: Transform to factors with appropriate level ordering.

Here a lot of people would mutate their dataset with a series of factor(., levels = c(...)) mutations. This works, but let's use factorize_df to help us do this faster with less typing.

First, let's see what the "levels" of q1 are.

survey_dat$q1 %>% unique()

Okay, looks like a 6-pt likert agreement scale. Let's pass these levels through the factorize_df function (in order) as the lvls arguement.

survey_dat <- survey_dat %>%
  factorize_df(lvls = c("Strongly Disagree", "Disagree", "Somewhat disagree", "Somewhat agree", "Agree", "Strongly agree"))

survey_dat %>%
  glimpse()

survey_dat %>%
  map(levels)

Okay, that doesn't look too impressive but notice that the function: a) searched each column in the dataset and tested whether the unique values that were a subset of the values in the lvls arguement, b) identified q1, q2, and q3 as having fitting that criteria, and c) transformed those and only those three columns to factors with the appropriate levels

Next I can move onto the next factor q5.

survey_dat$q5 %>% unique()
survey_dat <- survey_dat %>%
  factorize_df(lvls = c("Strongly Disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree"))

And finally q8.

survey_dat$q8 %>% unique()
survey_dat <- survey_dat %>%
  factorize_df(lvls = c("No", "Yes"))

Step 3: Check if factorizing worked like you intended.

Okay we said we needed to change q1, q2, q3, q5, q6, q8, and q9 to factors. Let's verify this worked.

survey_dat %>%
  select(q1, q2, q3, q5, q6, q8, q9) %>%
  map(is.factor)

They're all factors...

survey_dat %>%
  select(q1, q2, q3, q5, q6, q8, q9) %>%
  map(levels)

... with the correct levels and level order. And voila, a 'factorized' dataset.



tntp/tntpr documentation built on March 27, 2024, 6:26 p.m.