Tidying dataframes with tidyr

Overview

This section is a continuation of manipulating dataframes with dplyr. In this section we will work on some more basic oprations that are handy for tidying your data. The tidyr package (part of the tidyverse) simplifies this process. Don't forget the dplyr verbs we already learned, as they will come in handy.

So what is tidy data? Tidy data refers to a dataframe that has been cleaned, munged, modified, etc. so that it is an ideal state for data analysis (plotting, modeling, etc.). More specificially, this means that every column in your dataframe represents a variable and every row represents an observation. This is explained in R for Data Science in the following manner:

Each variable is in a column.
Each observation is a row.
Each value is a cell.

This is also referred to as long format (as opposed to wide format). Let's look at an example dataset that is not tidy. How many variables are there in this dataset?


my_data <- ds4ling::test_scores_rm

my_data_long <- gather(my_data, test, score, -id, -spec) 

my_data_tidy <- my_data_long %>% 
  separate(., col = spec, into = c("group", "level"), sep = "_") %>%
  separate(., col = id, into = c("lang", "id"), sep = 4)
my_data


We cannot do much with the data in this format, but there are some options. For example, we could create a scatter plot of test1 and test2:

my_data %>% 
  ggplot(., aes(x = test1, y = test2)) + 
    geom_point()

We could calculate Pearson's correlation coefficient:

cor(my_data$test1, my_data$test2)

Or we could compare the means using a paired samples t-test:

t.test(my_data$test1, my_data$test2, paired = TRUE)

Below we will walk through the key verbs that will allow us to tidy the data.

Main verbs

gather()

my_data_long <- gather(my_data, test, score, -id, -spec) 
my_data_long

Now that test is a factor we are able to plot the data differently.

my_data_long %>% 
  ggplot(., aes(x = test, y = score)) + 
    geom_boxplot()

separate()

my_data_long %>% 
  separate(., col = spec, into = c("group", "level"), sep = "_") %>%
  separate(., col = id, into = c("lang", "id"), sep = 4)

spread()

my_data_tidy %>% 
  spread(., lang, score) 
my_data_tidy %>% 
  spread(., lang, score) %>% 
  ggplot(., aes(x = cata, y = span, color = group, shape = level)) + 
    geom_point()

unite()

my_data_tidy %>% 
  unite(., col = participant, lang, id, sep = "_", remove = FALSE) %>% 
  select(., -id)

To summarize, we can use the tidyr package to tidy our data in a way that makes is ideal for plotting and modeling. The functions gather() and separate() are opposites. They can be used to convert our data from wide to long, and vice versa. The functions spread() and unite() are also complementary and can be used to split one column into two, or combine two columns into one.

Continue to the next section to put all your new skills to work in a quiz.



jvcasillas/ds4ling documentation built on April 8, 2021, 10:15 p.m.