library(learnr)
library(tidyverse)
library(tidytext)
library(intRo)
data("messy_fruit")
data("slavery")
data("yoda_corpus")
data("person_united")
library(palmerpenguins)
knitr::opts_chunk$set(echo = TRUE)

(Un)tidy data

Tidy data is just a data frame where columns are variables, rows are observations and cells are values.

knitr::include_graphics("images/tidy-1.png")

Messy fruit

The following data frame is not tidy.

messy_fruit

Look carefully at the data frame and answer the following question.

question(
  "Why do you think that data frame is not tidy?",
  answer("Cells do not contain values."),
  answer("Rows are variables"),
  answer("Not all columns are variables", correct = TRUE)
)

Pivoting

You can think of the messy_fruit data frame as wide. It's "wide" because it has columns for each fruit, when instead the fruit should be the values of a single column.

messy_fruit

From wide to long

So we want to make the data frame longer: we want to move the fruit names to a column. In other words, we want to tidy the data frame so that it has three columns (instead of 4):

We can use pivot_longer() to make this wide data frame long.

messy_fruit %>%
  pivot_longer(
    orange:banana,
    names_to = "fruit",
    values_to = "count"
  )

The data frame is now tidy, because:

The data frame is longer because now it has 6 rows and 3 columns, while before it had 2 rows and 4 columns.

Do you see now why we say "wide" and "longer"?

Give it a go!

Inspect the following data frame. The first column lists the flag of the slaver ships. The other columns describe the number of enslaved people who were disembarked in different geographical regions.

slavery

This data frame is not tidy. Rather than having one column for each region, with counts in the cells, we want one column that tells us the region and another column with the counts.

To make this data frame tidy, we need to move the column names of the regions to a column named region and the counts of people to a column called count.

Try to do that yourself!

slavery <- slavery %>%
  pivot_longer(
    # Which columns should we pivot?
    ...,
    # Where should the column names go?
    ...,
    # Where should the values go?
    ...
  )
**Hint:** Use `names_to = ...` and `values_to = ...`.

Separate and unite

In some cases, a data frame might have one column with values from two or more variables.

separate() lets you split the column into separate columns, while with unite() you can merge two or more columns into one.

Let's see how it works.

Separate

Here's some data

person_united

This data frame has two columns, but the second column scores contains personality scores of five different traits: openness, agreeableness, emotional_stability, conscientiousness, and extraversion.

Let's separate the score column into the individual scores.

person_sep <- person_united %>%
  separate(col = scores, into = c(...))

person_sep
c("openness", "agreeableness", "emotional_stability", "conscientiousness", "extraversion")

Great! Now, let's pivot the data frame so that it has, apart from the userid column, two columns: trait with the trait names, and score with the trait score.

person_sep <- person_united %>%
  separate(col = scores, into = c("openness", "agreeableness", "emotional_stability", "conscientiousness", "extraversion"))
person_pivot <- person_sep %>%
  ...

# Check it worked
person_pivot
**Hint:** You might want to use `pivot_longer()`.

If pivoting worked out correctly, the following code should work and output a plot with each trait in different panel rows.

personality %>%
  ggplot(aes(value, fill = trait)) +
  geom_bar() +
  facet_grid(trait ~ .)

Unite

The opposite of separate() is unite(). unite() needs the name of the new column as a string and the names of the columns it has to unite.

person_united <- person_sep %>%
  unite("score", openness:extraversion)

person_united

Text data

We've been using tabular data so far. However, sometimes we want to work with text data.

You can import text data in R and manipulate it with the tidytext package. Although tidytext is not part of the tidyverse collection, it has been designed to work well with it.

Easy working with text is

There isn't enough time to go through all of the ins and outs of tidytext, so we will only scratch the surface.

If you want to learn more, there's a whole book dedicated to it! Check it out here: https://www.tidytextmining.com.

We will use a corpus of dialogues from the Star Wars movies that involve the world's favourite green creature, Yoda.

In fact, since R works best with tabular data, this text corpus has been shaped as a table.

Check out what the yoda_corpus looks like.

yoda_corpus

Revealed your opinion is

Let's do a simple sentiment analysis of Yoda dialogues.

The first thing we do is to "unnest" the text into individual "tokens", or, in other words, words.

We can achieve that with unnest_tokens() from tidytext.

First, let's attach tidytext.

library(tidytext)

Now let's unnest the text column. unnest_tokens() needs the name of the column to create and the column with the text to unnest.

yoda_tok <- yoda_corpus %>%
  unnest_tokens(word, text)

yoda_tok

And now let's do some plotting! Should Yoda swallow a chill pill?

yoda_tok <- yoda_corpus %>%
  unnest_tokens(word, text)
yoda_tok %>%
  filter(character == "YODA") %>%
  # Never mind the following line for now.
  right_join(y = get_sentiments("bing")) %>%
  ggplot(aes(sentiment, fill = sentiment)) +
  geom_bar()

You're done!

Good job!



intro-rstats/intRo documentation built on Dec. 20, 2021, 7:58 p.m.