$\$

SDS230::download_data("movies.Rdata")

SDS230::download_image("new_pipe.png")
library(latex2exp)

options(scipen=999)
knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

$\$

Part 1: Data wrangling with dplyr

We can use the package dplyr to wrangle data. dplyr is part of the 'tidyverse' which is a collection of packages that operate on data frames.

The main functions in dplyr we will use are:

Let's explore how these functions work!

$\$

Part 1.1: Loading and viewing movie data

To explore the dplyr functions, we will use data that consists of random sample of movies that were scraped from the Rotten Tomatoes and IMDB websites. A code book describing the variables in this data frame and be found on this website.

Below is code that loads the data. You can use the View() function from the console to look at the data. The glimpse() function in dplyr is also useful for getting a sense of what is in a data frame.

How many cases and variables are in this data frame?

# loading the dplyr library 
library(dplyr)


# load the data
load("movies.Rdata")


# can only be run from the console
# View(movies)


# the dplyr glimpse() function gives a quick view of a data frame



# the number of rows and columns in the data frame

$\$

Part 1.2: Filtering for only specific rows

To look at a more homogeneous data set, let's reduce our data to only feature films. To get this subset of the data we can use dplyr's filter() function. We will filter variable title_type to only have the value of "Feature Film".

# what are the title unique types 



# let's get only the feature films



# let's look at the number of cases and variables in the films data frame

$\$

Part 1.3: Selecting only key variables

The data films data frame contains several variables we are less interesting in. We can reduce a data frame to just variables we are interested in using the select() function. Let's select the following variables from the data frame:

  1. title
  2. genre
  3. runtime
  4. mpaa_rating
  5. critics_score
  6. audience_score
# select the relevant variables





# look at the data frame size

$\$

Part 1.4: Adding new variables with mutate()

We can add new a variable to a data frame that is combinations of variables that already exist in the data frame using the mutate() function.

Let's add a variable that is the difference between the audience_score and the critics_score which will tell us which movies the audience enjoyed more than critics did.

# create a new variable that is the difference between the audience_score and the critics_score

$\$

Part 1.5: Sorting data frames with arrange()

We can arrange the rows in a data frame based on the values in a variable using the arrange() function.

Let's arrange the data in order from lowest to highest values based on the difference between the audience and critic scores.

We can also arrange data from highest to lowest using the desc() function.

# arrange the data from lowest to highest values of the audience preference





# arrange the data from highest values to the lowest values of the audience preference

$\$

Part 1.6: Summarizing data using the summarize() function

Let's examine if critics scores are higher than audience scores on average?


$\$

Part 1.7: Grouping data with group_by()

We can group data by the values in a particular (usually categorical) variable using the group_by() method. This method does nothing on it's own, but is useful in conjunction with other dplyr functions, particularly the summarize() and mutate() functions.

Let's summarize how much more audiences liked movies relative to critics separately for each genre of movie.


$\$

Part 1.8: The pipe operator

The pipe operator (|>) allows us to string together a series of commands. It allows us to place the first argument of a function outside the function. Since all dplyr functions take a data frame as their first argument, and return a data frame as their output, this allows us to string together a chain of dplyr functions (like a grammar).

Let's redo all our above analyses using the pipe operator.

# using the pipe operator











# putting everything together

$\$

Part 1.9: Practice

You now have learned a powerful set of tools to transform data in order to extract insights.

Please explore the data further using the dplyr functions to see if you can find additional interesting trends!



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.