library(learnr) library(tidyverse) library(tidytext) library(intRo) data("messy_fruit") data("slavery") data("yoda_corpus") data("person_united") library(palmerpenguins) knitr::opts_chunk$set(echo = TRUE)
Tidy data is just a data frame where columns are variables, rows are observations and cells are values.
knitr::include_graphics("images/tidy-1.png")
The following data frame is not tidy.
messy_fruit
Look carefully at the data frame and answer the following question.
question( "Why do you think that data frame is not tidy?", answer("Cells do not contain values."), answer("Rows are variables"), answer("Not all columns are variables", correct = TRUE) )
You can think of the messy_fruit
data frame as wide.
It's "wide" because it has columns for each fruit, when instead the fruit should be the values of a single column.
messy_fruit
So we want to make the data frame longer: we want to move the fruit names to a column. In other words, we want to tidy the data frame so that it has three columns (instead of 4):
status
with the selling status (sold
or bought
).
fruit
with the fruit type (orange
, apple
, banana
).
count
with the count of fruit for that category.
We can use pivot_longer()
to make this wide data frame long.
messy_fruit %>% pivot_longer( orange:banana, names_to = "fruit", values_to = "count" )
The data frame is now tidy, because:
Each column is a variable (status
, fruit
, count
).
Each row is an observation.
Each cell has a value.
The data frame is longer because now it has 6 rows and 3 columns, while before it had 2 rows and 4 columns.
Do you see now why we say "wide" and "longer"?
Inspect the following data frame.
The first column lists the flag
of the slaver ships.
The other columns describe the number of enslaved people who were disembarked in different geographical regions.
slavery
This data frame is not tidy. Rather than having one column for each region, with counts in the cells, we want one column that tells us the region and another column with the counts.
To make this data frame tidy, we need to move the column names of the regions to a column named region
and the counts of people to a column called count
.
Try to do that yourself!
slavery <- slavery %>% pivot_longer( # Which columns should we pivot? ..., # Where should the column names go? ..., # Where should the values go? ... )
In some cases, a data frame might have one column with values from two or more variables.
separate()
lets you split the column into separate columns, while with unite()
you can merge two or more columns into one.
Let's see how it works.
Here's some data
person_united
This data frame has two columns, but the second column scores
contains personality scores of five different traits: openness
, agreeableness
, emotional_stability
, conscientiousness
, and extraversion
.
Let's separate the score
column into the individual scores.
person_sep <- person_united %>% separate(col = scores, into = c(...)) person_sep
c("openness", "agreeableness", "emotional_stability", "conscientiousness", "extraversion")
Great! Now, let's pivot the data frame so that it has, apart from the userid
column, two columns: trait
with the trait names, and score
with the trait score.
person_sep <- person_united %>% separate(col = scores, into = c("openness", "agreeableness", "emotional_stability", "conscientiousness", "extraversion"))
person_pivot <- person_sep %>% ... # Check it worked person_pivot
If pivoting worked out correctly, the following code should work and output a plot with each trait in different panel rows.
personality %>% ggplot(aes(value, fill = trait)) + geom_bar() + facet_grid(trait ~ .)
The opposite of separate()
is unite()
.
unite()
needs the name of the new column as a string and the names of the columns it has to unite.
person_united <- person_sep %>% unite("score", openness:extraversion) person_united
We've been using tabular data so far. However, sometimes we want to work with text data.
You can import text data in R and manipulate it with the tidytext package. Although tidytext is not part of the tidyverse collection, it has been designed to work well with it.
There isn't enough time to go through all of the ins and outs of tidytext, so we will only scratch the surface.
If you want to learn more, there's a whole book dedicated to it! Check it out here: https://www.tidytextmining.com.
We will use a corpus of dialogues from the Star Wars movies that involve the world's favourite green creature, Yoda.
In fact, since R works best with tabular data, this text corpus has been shaped as a table.
Check out what the yoda_corpus
looks like.
yoda_corpus
Let's do a simple sentiment analysis of Yoda dialogues.
The first thing we do is to "unnest" the text into individual "tokens", or, in other words, words.
We can achieve that with unnest_tokens()
from tidytext.
First, let's attach tidytext.
library(tidytext)
Now let's unnest the text
column.
unnest_tokens()
needs the name of the column to create and the column with the text to unnest.
yoda_tok <- yoda_corpus %>% unnest_tokens(word, text) yoda_tok
And now let's do some plotting! Should Yoda swallow a chill pill?
yoda_tok <- yoda_corpus %>% unnest_tokens(word, text)
yoda_tok %>% filter(character == "YODA") %>% # Never mind the following line for now. right_join(y = get_sentiments("bing")) %>% ggplot(aes(sentiment, fill = sentiment)) + geom_bar()
Good job!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.