In jvcasillas/ds4ling: Datasets and Functions developed for SPAN658: Data Science for Linguists

Tidying dataframes with tidyr

Overview

This section is a continuation of manipulating dataframes with dplyr. In this section we will work on some more basic oprations that are handy for tidying your data. The tidyr package (part of the tidyverse) simplifies this process. Don't forget the dplyr verbs we already learned, as they will come in handy.

So what is tidy data? Tidy data refers to a dataframe that has been cleaned, munged, modified, etc. so that it is an ideal state for data analysis (plotting, modeling, etc.). More specificially, this means that every column in your dataframe represents a variable and every row represents an observation. This is explained in R for Data Science in the following manner:

Each variable is in a column.
Each observation is a row.
Each value is a cell.

This is also referred to as long format (as opposed to wide format). Let's look at an example dataset that is not tidy. How many variables are there in this dataset?

my_data <- ds4ling::test_scores_rm

my_data_long <- gather(my_data, test, score, -id, -spec) 

my_data_tidy <- my_data_long %>% 
  separate(., col = spec, into = c("group", "level"), sep = "_") %>%
  separate(., col = id, into = c("lang", "id"), sep = 4)

my_data

We cannot do much with the data in this format, but there are some options. For example, we could create a scatter plot of test1 and test2:

my_data %>% 
  ggplot(., aes(x = test1, y = test2)) + 
    geom_point()

We could calculate Pearson's correlation coefficient:

cor(my_data$test1, my_data$test2)

Or we could compare the means using a paired samples t-test:

t.test(my_data$test1, my_data$test2, paired = TRUE)

Below we will walk through the key verbs that will allow us to tidy the data.

Main verbs

`gather()`

takes columns, and gathers them into key-value pairs
this is one of the most important verbs for tidying data
you need to tell R what names you will give to the new categorical factor and the new continuous variable

my_data_long <- gather(my_data, test, score, -id, -spec) 
my_data_long

Now that test is a factor we are able to plot the data differently.

my_data_long %>% 
  ggplot(., aes(x = test, y = score)) + 
    geom_boxplot()

`separate()`

This verb is used to separate chars in a column
essentially you create two variables from one column
This is useful if you carefully planned your experimental outputs when you were collecting data (i.e. with you use the same number of chars for group names, common separators, etc.)

my_data_long %>% 
  separate(., col = spec, into = c("group", "level"), sep = "_") %>%
  separate(., col = id, into = c("lang", "id"), sep = 4)

`spread()`

This verb does the oposite of gather
essentially you spread the data from long to wide
this is useful if you already have tidy data and you need to make a scatter plot

my_data_tidy %>% 
  spread(., lang, score)

my_data_tidy %>% 
  spread(., lang, score) %>% 
  ggplot(., aes(x = cata, y = span, color = group, shape = level)) + 
    geom_point()

`unite()`

You will rarely use this verb, I think
Use it to create a single column from two columns
I don't know if I have ever needed to use it

my_data_tidy %>% 
  unite(., col = participant, lang, id, sep = "_", remove = FALSE) %>% 
  select(., -id)

To summarize, we can use the tidyr package to tidy our data in a way that makes is ideal for plotting and modeling. The functions gather() and separate() are opposites. They can be used to convert our data from wide to long, and vice versa. The functions spread() and unite() are also complementary and can be used to split one column into two, or combine two columns into one.

Continue to the next section to put all your new skills to work in a quiz.

jvcasillas/ds4ling documentation built on March 4, 2025, 11:18 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com