In seramirezruiz/hertiestats2: Stats II - Tutorials

library(learnr)
library(palmerpenguins)
library(magrittr)
library(dplyr)
library(readr)
prejudice_df <- readr::read_csv("data/prejudice.csv")
tutorial_options(exercise.timelimit = 10)
knitr::opts_chunk$set(echo = FALSE)

Introduction

Welcome!

Welcome to our second tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

The labs are designed to reinforce the material covered during the lectures by introducing you to hands-on applications.

The practical nature of our class means that our labs will be data-centered. Throughout our class, we will get acquinted with multiple packages of the tidyverse.

Though we expect that some of you may already know them, the tidyverse is a collection of R packages that share an underlying design, syntax, and structure. They will definitely make your life easier!!

Today, we will start with a brief introduction to data manipulation through the dplyr package.

In this tutorial, you will learn to:

identify the purpose of a set of dplyr verbs
write statements in tidy syntax
apply dplyr verbs to solve your data manipulation challenges

This tutorial is partly based on R for Data Science, section 5.2, and Quantitative Politics with R, chapter 3.

What we will need today

We'll practice some wrangling in dplyr using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

You do not need to import the data to work through this tutorial - the data are already here waiting behind the scenes.

But if you do ever want to use the penguins data outside of this tutorial, they now exist in the palmerpenguins package in R.

If you are ready to begin, click on!

Data Structure

Tidy data

Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table

knitr::include_graphics("images/tidy_data.png")

The `penguins` data set

The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:

species: a factor denoting the penguin species (Adelie, Chinstrap, or Gentoo)
island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed
culmen_length_mm: a number denoting length of the dorsal ridge of penguin bill (millimeters)
culmen_depth_mm: a number denoting the depth of the penguin bill (millimeters)
flipper_length_mm: an integer denoting penguin flipper length (millimeters)
body_mass_g: an integer denoting penguin body mass (grams)
sex: a factor denoting penguin sex (MALE, FEMALE)
year an integer denoting the year of the record

knitr::include_graphics("images/penguins.png")

*Illustration by \@allisonhorst*

Let's explore the data set.

head() is a function that returns the first couple rows from a data frame. Write the R code required to explore the first observations of the penguins data set:

Notice that when you press 'Run', the output of the code is returned below it! So by pressing 'Run', you've run your first R code of the class!

head(penguins)

Manipulating with `dplyr`

What we will learn today

In this tutorial, you'll learn and practice examples using some functions in dplyr to work with data. Those are:

select(): keep or exclude some columns
filter(): keep rows that satisfy your conditions
mutate(): add columns from existing data or edit existing columns
group_by(): lets you define groups within your data set
summarize(): get summary statistics
arrange(): reorders the rows according to single or multiple variables

Let's get to work.

`select()`

The first verb (function) we will utilize is select(). We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names() to retrieve the names of the variables in a data frame.

names(penguins)

Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:

select(data, columns)

Activity The following code chunk would select the variables we need. Can you adapt it, so that we keep the body_mass_g and sex variables as well?

dplyr::select(penguins, species, island, year)

# you just need to type the names of the columns
dplyr::select(penguins, species, island, year, body_mass_g, sex)

To drop variables, use - before the variable name [i.e. select(penguins, -year)] to drop the year column

`filter()`

The second verb (function) we will employ is filter(). filter() lets you use a logical test to extract specific rows from a data frame. To use filter(), pass it the data frame followed by one or more logical tests. filter() will return every row that passes each logical test.

The more commonly used logical operators are:

==: Equal to
!=: Not equal to
>, >=: Greater than, greater than or equal to
<, <=: Less than, less than or equal to
&, |: And, or

Say we are interested in retrieving the observations from the year 2007. We would do:

dplyr::filter(penguins, year == 2007)

# you just need to utilize & and type the logical operator for the species
dplyr::filter(penguins, year == 2007 & species == "Chinstrap")

Activity Can you adapt the code to retrieve all the observations of Chinstrap penguins from 2007 (remember that species contains character units)

The Pipe Operator: `%>%`

The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. Packages in the tidyverse load %>% for you automatically, so you don’t usually load magrittr explicitly. This will be one of your best friends in R.

Pipes are a powerful tool for clearly expressing a sequence of multiple operations. Let's think about baking for a second.

Activity We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunck with the pipe and conditional logical operators we discussed?

only_2009 <- dplyr::filter(penguins, year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console

penguins

penguins %>%
  dplyr::filter(year == 2009 & species == "Chinstrap") %>%
  dplyr::select(species, sex, year)

`mutate()`

mutate() lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more confortable with assessing the weight of the penguins in pounds. We would utilize mutate() as such:

```mutate(new_var_name = conditions)```
wzxhzdk:12 wzxhzdk:13 **Activity** *Can you edit the previous code chunk to render a new variable body_mass_kg?* ## `group_by()` and `summarize()` These two verbs `group_by()` and `summarize()` tend to go together. When combined , 'summarize()` will create a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. For example: wzxhzdk:14 wzxhzdk:15 **Activity** *Can you get the weight of the lightest penguin of each species? You can use `min()`. What happens when in addition to species you also group by year `group_by(species, year)`?* ## `arrange()` The `arrange()` verb is pretty self-explanatory. `arrange()` orders the rows of a data frame by the values of selected columns in ascending order. You can use the `desc()` argument inside to arrange in descending order. The following chunk arranges the data frame based on the length of the penguins' bill. You hint tab contains the code for the descending order alternative.

```arrange(variable_of_interest)```
wzxhzdk:16 wzxhzdk:17 **Activity** *Can you create a data frame arranged by body_mass_g of the penguins observed in the "Dream" island?* ## Quiz wzxhzdk:18 Here are some questions for you. Note that there are multiple ways to reach the same answer: wzxhzdk:19 ## The Potential Outcomes Framework Let's revisit the example from the lecture once again. Say we are interested in assessing the premise of Allport's hypothesis about interpersonal contact being conducive to reducing intergroup prejudice. We are studying a set of ($n=8$) students assigned to a dorm room with a person from their own ethnic group **(contact=0)** and from a different group **(contact=1)**. | Student (i) | Prejudice (C=0) | Prejudice (C=1) | |:-----------:|:---------------:|:---------------:| | 1 | 6 | 5 | | 2 | 4 | 2 | | 3 | 4 | 4 | | 4 | 6 | 7 | | 5 | 3 | 1 | | 6 | 2 | 2 | | 7 | 8 | 7 | | 8 | 4 | 5 | **The data are already pre-loaded in the `prejudice_df` object** ### Data set Today we will work with the `prejudice_df`. The data frame contains the following four variables: - `student_id`: numeric student identification - `prej_0`: prejudice level under $Y_{0i}$ (Contact=0) - `prej_1`: prejudice level under $Y_{1i}$ (Contact=1) - `dorm_type`: binary for actual treatment state wzxhzdk:20 ## Treatment Effects ### Individual Treatment Effect (ITE) We assume from the *potential outcomes framework* that each subject has a **potential outcome** under both treatment states. Let's take the first student in the list as an example. wzxhzdk:21 The figure illustrates the **potential outcomes** for *Student 1*. We see that in a reality where *Student 1* is assigned to in-group dorm **(contact=0)** their levels of prejudice are *6*. On the contrary, in a reality where *Student 1* is assigned to co-ethnic dorm **(contact=1)** their levels of prejudice are *5*. From this illustration, we can gather the **individual treatment effect (ITE)** for student one. The **ITE** is equal to the values under treatment *(contact=1)* minus to the values without treatment *(contact=0)* or $ITE = y_{1i} - y_{0i}$. $$ITE = 5 - 6 = -1$$ As it was put in Cunningham's book: >The ITE is a “comparison of two states of the world” (Cunningham, 2021): individuals are exposed to contact, and not exposed to it. Evidently, each subject can only be observed in one treatment state at any point in time in real life. This is known as the **fundamental problem** (Holland, 1986) of causal inference. **The Individual Treatment Effect (ITE) in reality is unattainable.** Still, it provides us with a conceptual foundation for causal estimation. **Exercise:** *Our data are coming from a world with perfect information. In that sense, we have both potential outcomes `prej_0` and `prej_1`. Can you think of a way to calculate the* **ITE** *for the eight students with one of the `dplyr` verbs we learned earlier today?* wzxhzdk:22 wzxhzdk:23 --- ### Average Treatment Effect (ATE) Normally, we are not interested in the estimates of individual subjects, but rather a population. The **Average Treatment Effect (ATE)** is the difference in the average potential outcomes of the population. $$ATE = E(Y_{1i}) - E(Y_{0i})$$ In other words, the **ATE** is the average **ITE** of all the subjects in the population. As you can see, **the ATE as defined in the formula is also not attainable**. Can you think why? **Exercise:** *Since our data are coming from a world with perfect information. Can you think of a way to calculate the* **ATE** *for the eight students based on what we learned earlier today?* wzxhzdk:24 wzxhzdk:25 --- ### The Average Treatment Effect Among the Treated and Control (ATT) and (ATC) The names for these two estimates are very self-explanatory. These two estimates are simply the average treatment effects conditional on the group subjects are assigned to. The average treatment effect on the treated **ATT** is defined as the difference in the average potential outcomes for those subjects who were treated: $$ATT = E(Y_{1i}-Y_{0i} | D = 1)$$ The average treatment effect on the untreated **ATC** is defined as the difference in the average potential outcomes for those subjects who were not treated: $$ATC = E(Y_{1i}-Y_{0i} | D = 0)$$ **Exercise:** *Since our data are coming from a world with perfect information. Can you think of a way to calculate the* **ATT** *and* **ATC** *for the eight students based on what we learned earlier today?* wzxhzdk:26 wzxhzdk:27 *What do you think these treatment group differences tell us?* --- ### The Naive Average Treatment Effect (NATE) So far, we have worked with perfect information. Still, we know that in reality we can only observe subjects in one treatment state. This is the information we **do** have. wzxhzdk:28 The **Naive Average Treatment Effect (NATE)** is the calculation we can compute based on the observed outcomes. $$NATE = E(Y_{1i}|D{i}=1) - E(Y_{0i}|D{i}=0)$$ \**reads in English as: "The expected average outcome under treatment for those treated minus the expected average outcome under control for those not treated"* **Exercise:** *Can you think of a way to calculate the* **NATE** *for the eight students employing the new `observed_prej` variable?* wzxhzdk:29 wzxhzdk:30 --- *Note.* The ìfelse() function is a very handy tool to have. It allows us to generate conditional statements. The syntax is the following: wzxhzdk:31 *In the case of `observed_prej`, we ask* **R** *to create a new variable, where if the subject is in a co-ethnic dorm, we print the prejudice value under treatment. If that condition is not met, we print the prejudice value under control.* ## Bias ### Bias During the lecture, we met two sources of bias: wzxhzdk:32 ### Baseline bias The baseline, also known as selection bias, is the difference in expected outcomes in the absence of treatment for the actual treatment and control group. In other words, these are the underlying differences that individuals in either group start off with. ### Differential treatment effect bias Differential Treatment Effect bias is the difference in returns to treatment (the treatment effect) between the treatment and control group, multiplied by the share of the population in control. In other words, this type of bias relates to the dissimilarities stemming for ways in which individuals in either group are affected differently by the treatment. **Exercise:** *Since our data are coming from a world with perfect information. Can you think of a way to explore the existence* **baseline bias** *in our data?* wzxhzdk:33 wzxhzdk:34 **Exercise:** *Since our data are coming from a world with perfect information. Can you think of a way to explore the existence* **differential treatment effect bias** *in our data?* wzxhzdk:35 wzxhzdk:36 ## Quiz wzxhzdk:37 wzxhzdk:38 wzxhzdk:39

seramirezruiz/hertiestats2 documentation built on April 7, 2023, 11:30 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

seramirezruiz/hertiestats2
Stats II - Tutorials

In seramirezruiz/hertiestats2: Stats II - Tutorials

Introduction

Welcome!

What we will need today

Data Structure

Tidy data

The `penguins` data set

Let's explore the data set.

Manipulating with `dplyr`

What we will learn today

`select()`

`filter()`

The Pipe Operator: `%>%`

`mutate()`

R Package Documentation

Browse R Packages

We want your feedback!

seramirezruiz/hertiestats2 Stats II - Tutorials

In seramirezruiz/hertiestats2: Stats II - Tutorials

Introduction

Welcome!

What we will need today

Data Structure

Tidy data

The penguins data set

Let's explore the data set.

Manipulating with dplyr

What we will learn today

select()

filter()

The Pipe Operator: %>%

mutate()

R Package Documentation

Browse R Packages

We want your feedback!

seramirezruiz/hertiestats2
Stats II - Tutorials

The `penguins` data set

Manipulating with `dplyr`

`select()`

`filter()`

The Pipe Operator: `%>%`

`mutate()`