library(learnr)
library(dplyr)
library(palmerpenguins)
library(magrittr)
tutorial_options(exercise.timelimit = 10)
knitr::opts_chunk$set(echo = FALSE)

Introduction

Welcome!

Welcome to our first tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course. The labs are designed to reinforce the material covered during the lectures by introducing you to hands-on applications. The practical nature of our class means that our labs will be data-centered. Throughout our class, we will get acquinted with multiple packages of the tidyverse. Though we expect that some of you may already know them, the tidyverse is a collection of R packages that share an underlying design, syntax, and structure. They will definitely make your life easier!!

Today, we will start with a brief introduction to data manipulation through the dplyr package.

In this tutorial, you will learn to:

This tutorial is partly based on R for Data Science, section 5.2, and Quantitative Politics with R, chapter 3.

What we will need today

We'll practice some wrangling in dplyr using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

You do not need to import the data to work through this tutorial - the data are already here waiting behind the scenes.

But if you do ever want to use the penguins data outside of this tutorial, they now exist in the palmerpenguins package in R.

If you are ready to begin, click on!

Data Structure

Tidy data

Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table
knitr::include_graphics("images/tidy_data.png")

The penguins data set

The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:

knitr::include_graphics("images/penguins.png")

*Illustration by \@allisonhorst*

Let's explore the data set.

head() is a function that returns the first couple rows from a data frame. Write the R code required to explore the first observations of the penguins data set:

Notice that when you press 'Run', the output of the code is returned below it! So by pressing 'Run', you've run your first R code of the class!

head(penguins)

Manipulating with dplyr

What we will learn today

In this tutorial, you'll learn and practice examples using some functions in dplyr to work with data. Those are:

Let's get to work.

select()

The first verb (function) we will utilize is select(). We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names() to retrieve the names of the variables in a data frame.

names(penguins)

Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:

select(data, columns)

Activity The following code chunk would select the variables we need. Can you adapt it, so that we keep the body_mass_g and sex variables as well?

dplyr::select(penguins, species, island, year)
# you just need to type the names of the columns
dplyr::select(penguins, species, island, year, body_mass_g, sex)

To drop variables, use - before the variable name [i.e. select(penguins, -year)] to drop the year column

filter()

The second verb (function) we will employ is filter(). filter() lets you use a logical test to extract specific rows from a data frame. To use filter(), pass it the data frame followed by one or more logical tests. filter() will return every row that passes each logical test.

The more commonly used logical operators are:

Say we are interested in retrieving the observations from the year 2007. We would do:

dplyr::filter(penguins, year == 2007)
# you just need to utilize & and type the logical operator for the species
dplyr::filter(penguins, year == 2007 & species == "Chinstrap")

Activity Can you adapt the code to retrieve all the observations of Chinstrap penguins from 2007 (remember that species contains character units)

The Pipe Operator: %>%

The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. Packages in the tidyverse load %>% for you automatically, so you don’t usually load magrittr explicitly. This will be one of your best friends in R.

Pipes are a powerful tool for clearly expressing a sequence of multiple operations. Let's think about baking for a second.

Activity We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunck with the pipe and conditional logical operators we discussed?

only_2009 <- dplyr::filter(penguins, year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console
penguins
penguins %>%
  dplyr::filter(year == 2009 & species == "Chinstrap") %>%
  dplyr::select(species, sex, year)

mutate()

mutate() lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more confortable with assessing the weight of the penguins in pounds. We would utilize mutate() as such:

```mutate(new_var_name = conditions)```

wzxhzdk:12 wzxhzdk:13 **Activity** *Can you edit the previous code chunk to render a new variable body_mass_kg?* ## `group_by()` and `summarize()` These two verbs `group_by()` and `summarize()` tend to go together. When combined , 'summarize()` will create a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. For example: wzxhzdk:14 wzxhzdk:15 **Activity** *Can you get the weight of the lightest penguin of each species? You can use `min()`. What happens when in addition to species you also group by year `group_by(species, year)`?* ## `arrange()` The `arrange()` verb is pretty self-explanatory. `arrange()` orders the rows of a data frame by the values of selected columns in ascending order. You can use the `desc()` argument inside to arrange in descending order. The following chunk arranges the data frame based on the length of the penguins' bill. You hint tab contains the code for the descending order alternative.

```arrange(variable_of_interest)```

wzxhzdk:16 wzxhzdk:17 **Activity** *Can you create a data frame arranged by body_mass_g of the penguins observed in the "Dream" island?* ## Quiz wzxhzdk:18 Here are some questions for you. Note that there are multiple ways to reach the same answer: wzxhzdk:19

seramirezruiz/hertiestats2 documentation built on April 7, 2023, 11:30 a.m.