This section is a continuation of manipulating dataframes with dplyr.
In this section we will work on some more basic oprations that are handy for
tidying your data. The tidyr
package (part of the tidyverse
)
simplifies this process. Don't forget the dplyr
verbs we already learned,
as they will come in handy.
So what is tidy data? Tidy data refers to a dataframe that has been cleaned, munged, modified, etc. so that it is an ideal state for data analysis (plotting, modeling, etc.). More specificially, this means that every column in your dataframe represents a variable and every row represents an observation. This is explained in R for Data Science in the following manner:
Each variable is in a column.
Each observation is a row.
Each value is a cell.
This is also referred to as long format (as opposed to wide format). Let's look at an example dataset that is not tidy. How many variables are there in this dataset?
my_data <- ds4ling::test_scores_rm my_data_long <- gather(my_data, test, score, -id, -spec) my_data_tidy <- my_data_long %>% separate(., col = spec, into = c("group", "level"), sep = "_") %>% separate(., col = id, into = c("lang", "id"), sep = 4)
my_data
We cannot do much with the data in this format, but there are some options. For example, we could create a scatter plot of test1 and test2:
my_data %>% ggplot(., aes(x = test1, y = test2)) + geom_point()
We could calculate Pearson's correlation coefficient:
cor(my_data$test1, my_data$test2)
Or we could compare the means using a paired samples t-test:
t.test(my_data$test1, my_data$test2, paired = TRUE)
Below we will walk through the key verbs that will allow us to tidy the data.
gather()
my_data_long <- gather(my_data, test, score, -id, -spec) my_data_long
Now that test
is a factor we are able to plot the data differently.
my_data_long %>% ggplot(., aes(x = test, y = score)) + geom_boxplot()
separate()
my_data_long %>% separate(., col = spec, into = c("group", "level"), sep = "_") %>% separate(., col = id, into = c("lang", "id"), sep = 4)
spread()
my_data_tidy %>% spread(., lang, score)
my_data_tidy %>% spread(., lang, score) %>% ggplot(., aes(x = cata, y = span, color = group, shape = level)) + geom_point()
unite()
my_data_tidy %>% unite(., col = participant, lang, id, sep = "_", remove = FALSE) %>% select(., -id)
To summarize, we can use the tidyr
package to tidy our data in a way that
makes is ideal for plotting and modeling. The functions gather()
and
separate()
are opposites. They can be used to convert our data from wide
to long, and vice versa. The functions spread()
and unite()
are
also complementary and can be used to split one column into two, or combine two
columns into one.
Continue to the next section to put all your new skills to work in a quiz.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.