library(learnr) library(dplyr) library(palmerpenguins) library(magrittr) tutorial_options(exercise.timelimit = 10) knitr::opts_chunk$set(echo = FALSE)
Welcome to our first tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course. The labs are designed to reinforce the material covered during the lectures by introducing you to hands-on applications. The practical nature of our class means that our labs will be data-centered. Throughout our class, we will get acquinted with multiple packages of the tidyverse. Though we expect that some of you may already know them, the tidyverse is a collection of R packages that share an underlying design, syntax, and structure. They will definitely make your life easier!!
Today, we will start with a brief introduction to data manipulation through the dplyr package. 
In this tutorial, you will learn to:
dplyr verbsdplyr verbs to solve your data manipulation challengesThis tutorial is partly based on R for Data Science, section 5.2, and Quantitative Politics with R, chapter 3.
We'll practice some wrangling in dplyr using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
You do not need to import the data to work through this tutorial - the data are already here waiting behind the scenes.
But if you do ever want to use the penguins data outside of this tutorial, they now exist in the palmerpenguins package in R.
If you are ready to begin, click on!
Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:
knitr::include_graphics("images/tidy_data.png")
penguins data setThe 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:
knitr::include_graphics("images/penguins.png")
*Illustration by \@allisonhorst*
head() is a function that returns the first couple rows from a data frame. Write the R code required to explore the first observations of the penguins data set:
Notice that when you press 'Run', the output of the code is returned below it! So by pressing 'Run', you've run your first R code of the class!
head(penguins)
dplyrIn this tutorial, you'll learn and practice examples using some functions in dplyr to work with data. Those are: 
select(): keep or exclude some columnsfilter(): keep rows that satisfy your conditionsmutate(): add columns from existing data or edit existing columnsgroup_by(): lets you define groups within your data setsummarize(): get summary statisticsarrange(): reorders the rows according to single or multiple variablesLet's get to work.
select()The first verb (function) we will utilize is select(). We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names() to retrieve the names of the variables in a data frame.
names(penguins)
Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:
select(data, columns)
Activity The following code chunk would select the variables we need. Can you adapt it, so that we keep the body_mass_g and sex variables as well?
dplyr::select(penguins, species, island, year)
# you just need to type the names of the columns dplyr::select(penguins, species, island, year, body_mass_g, sex)
To drop variables, use - before the variable name [i.e. select(penguins, -year)] to drop the year column
filter()The second verb (function) we will employ is filter(). filter() lets you use a logical test to extract specific rows from a data frame. To use filter(), pass it the data frame followed by one or more logical tests. filter() will return every row that passes each logical test.
The more commonly used logical operators are:
==: Equal to!=: Not equal to>, >=: Greater than, greater than or equal to<, <=: Less than, less than or equal to&, |: And, orSay we are interested in retrieving the observations from the year 2007. We would do:
dplyr::filter(penguins, year == 2007)
# you just need to utilize & and type the logical operator for the species dplyr::filter(penguins, year == 2007 & species == "Chinstrap")
Activity Can you adapt the code to retrieve all the observations of Chinstrap penguins from 2007 (remember that species contains character units)
%>%The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. Packages in the tidyverse load %>% for you automatically, so you don’t usually load magrittr explicitly. This will be one of your best friends in R. 
Pipes are a powerful tool for clearly expressing a sequence of multiple operations. Let's think about baking for a second.
 
Activity We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunck with the pipe and conditional logical operators we discussed?
only_2009 <- dplyr::filter(penguins, year == 2009) only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap") only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, year) final_df <- only_2009_chinstraps_species_sex_year final_df #to print it in our console
penguins
penguins %>% dplyr::filter(year == 2009 & species == "Chinstrap") %>% dplyr::select(species, sex, year)
mutate()mutate() lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more confortable with assessing the weight of the penguins in pounds. We would utilize mutate() as such:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.