Water Pollution and Cancer in China

Author: Brigitte Peter

< ignore

#library(restorepoint)
# facilitates error detection
#set.restore.point.options(display.restore.point=TRUE)

library(RTutor)
library(yaml)
#library(restorepoint)
setwd("~/Studium/Masterarbeit/RTutor")
ps.name = "WaterPollutionChina"; sol.file = paste0(ps.name,"_sol.Rmd")
libs = c("foreign", "ggmap", "dplyr", "ggplot2", "weights", "lmtest", "lfe", "stargazer", "regtools") # character vector of all packages you load in the problem set
name.rmd.chunks(sol.file) # set auto chunk names in this file
create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL,libs=libs, stop.when.finished=FALSE, addons="quiz", var.txt.file="variables.txt")
show.shiny.ps(ps.name, load.sav=FALSE,  sample.solution=FALSE, is.solved=FALSE, catch.errors=TRUE, launch.browser=TRUE)
stop.without.error()

>

Welcome! You are about to begin with this interactive problem set which is part of my master thesis at Ulm University. Its topic is the analysis of the relationship between water quality in China's waterways and the occurrence of digestive cancers. The investigation is based on the paper "The Consequences of Industrialization: Evidence from Water Pollution and Digestive Cancers in China" from Avraham Ebenstein which was published in the Review of Economics and Statistics in 2012. In the course of the problem set we will occasionally refer to it as paper.

The paper, a supplemental appendix and data are available online at the following websites:

The current environmental situation in China is not promising. During the economic boom in the 1980s and 1990s, China's rivers and lakes suffered a vast deterioration of water quality. Now, 70 percent of China's river water is unfit for humans (cf. World Bank, 2006). However, particularly in rural areas, a large part of the population still rely on well water as source for drinking water. Concurrently, an increase in rural cancer rates is observed. In this analysis, the focus lies on digestive cancers since two-thirds of all cancer cases in China are digestive. Today, digestive tract cancers cause 11 percent of all fatalities in China, which is associated with one million deaths annually (cf. World Health Organization, 2002).

China itself is a perfect country for investigating causality between contaminated water and digestive cancers for a variety of reasons. First of all, also in rural areas, where untreated lake and river water is the primary source of drinking water, we have reliable data since the late 1980s. Secondly, mobility limitation due to government regulations prevent people from moving. Hence, the location of residents is unlikely to change, which means that they are influenced by the water of the same river basin throughout their lives. Thirdly, it is possible to analyze the impact of polluted water on cancer rates more precisely since China is characterized by high digestive cancer rates, high contamination rates and regional variation in water quality (cf. World Bank, 2006).

In literature, connections between water pollution and liver cancer (cf. Lin et al., 2000) and gastric cancer (cf. Morales-Suarez-Varela et al., 1995) were already found. However, the investigation failed to determine a causal link. The aim of this problem set is to find a causal relation between water quality and digestive cancers in China. For this purpose, we will make use of two-sample t-tests, regression analysis, fixed effects and other methods. Finally, we will conduct a benefit-cost analysis in order to quantify the cost of pollution and the health consequences for China's population. Concurrently, the problem set aims to deepen your knowledge in econometric theory and to promote your skills in programming with R. The problem set is structured as follows.

Exercise Content

  1. Overview of Data

1.1 Water Quality Criteria

1.2 Digestive Tract Cancer Death Rates

  1. Comparison of the Variables' Mean Values

  2. Regression Analysis

3.1 Correction of Standard Errors

3.2 Solving Endogeneity Problems

3.3 Additional Control Variables

  1. Falsification Tests

  2. Robustness Check

  3. Benefit-Cost Analysis

  4. Conclusion

  5. References

All exercises can be solved independently from each other. However, it is recommended to work through the tasks in the given order. You will start with descriptive statistics, go on with econometric analysis (t-tests, regressions, falsification tests, robustness check) and conclude with a benefit-cost analysis.

This problem set is composed of theoretical principles, code chunks, info boxes and quizzes. In order to solve tasks, you will enter R code into code chunks. After editing the chunk, you have to press check in order to get feedback whether your solution is correct. Every time you start a new exercise you have to press edit first. If you have difficulties in solving a task, you can press the hint button for further advice. For a sample solution, just press solution. The info boxes provide you with background information and additional comments or explain economic theory and R commands in more detail. Quizzes are included in order to test your newly acquired knowledge. For a choice of code chunks and quizzes that you solved successfully, you will get awards.

Have fun at solving the tasks and at collecting awards!

Exercise 1 -- Overview of Data

We are interested in the relationship of water pollution and digestive cancers. The essential data is provided by different sources. We get information about the water quality in China's rivers and lakes due to its national water monitoring system, which carries out measurements of chemicals annually for a set of 484 sites across China. The data for mortality rates and other population characteristics is based on the records of China's Disease Surveillance Point system (DSP). The readings are taken at 145 sites across the country accounting for different levels of wealth and urbanization in order to form a nationally representative sample of China's population.

In both cases, the available data is a cross-section of group averages with different group sizes. The analysis relies on water quality readings from 2004 and mortality rates for a time period of ten years (from 1991 to 2000). In order to guarantee an analysis, the water quality readings are aggregated and assigned to the 145 DSP sites. In the course of this problem set, we will additionally use data from other sources, which will be presented later.

Exercise 1 is divided into two parts. The first part will focus on the water quality readings whereas part two introduces the data for digestive cancer death rates.

Exercise 1.1 -- Water Quality Criteria

In this section, we will analyze the water quality and concentration of contaminants in China's rivers and lakes that are assigned to each DSP site.

Before starting with the investigation, we have to load the data. The data frame we use in this exercise is called dsp_dat.dta which contains, besides population characteristics and environmental indicators, grades for water quality and concentration of pollutants in China's main river systems. We will use the function read.dta() in order to load the data into R.

< info "Load Data with read.dta()"

The function read.dta() from the package foreign reads a Stata file of version 5-12 in an R data frame. Before loading the data frame data.dta you have to specify the package with the library() command.

# Load the package
library(foreign)

# Read in the data
read.dta("data.dta")

In order to simplify later usage, the data frame can be stored in a variable. Here, we call this variable dat.

# Store the data in a variable called dat
dat = read.dta("data.dta")

For more information about the function read.dta() and other functions from the package foreign have a look at https://cran.r-project.org/web/packages/foreign/foreign.pdf.

>

Task: Load the data dsp_dat.dta and store it in the variable dat. To do so, use the command read.dta() from the foreign package but first load the package itself. After pressing check you will get feedback whether your solution is correct. If you do not know how to start, look at the info box above or press hint.

#< task_notest
# Load the package foreign
#>
library(foreign)

#< hint
display("Use the command library() and type the name of the package inside the brackets.")
#>

#< task_notest
# Read in the data set
#>
dat = read.dta("dsp_dat.dta")

#< hint
display("Type in code of the form dat=read.dta() and specify the name of the .dta file in the brackets. Remember to put it into quotation marks.")
#>

< award "Starter"

Good! You solved the first task of this problem set. Loading data into R is an important task without an analysis would not be possible. Keep going and have fun at collecting more awards!

>

After loading the data, we can now have a closer look at the data stored in dat.

Overview of Data

There are numberless possibilities in R to access data. We will use the commands head() and sample_n() in order to get familiar with the variables included in our data frame. These commands will show us the first part and a random selection of rows of our data frame, respectively.

Task: We want to see the first rows of dat. The command is already given. Just press check here.

#< task
# Show the first rows of dat
head(dat)
#>

You can see six rows of our data frame where each row represents one of the 145 DSP sites. For every DSP site, the data frame stores 19 variables. Their names are displayed at the very top. The variable dsp_id provides a unique identifier for each DSP site. river is the name of the river basin where the DSP site is located in. The variables long and lat stand for longitude and latitude and indicate the geographical position of each DSP site. pol is a dummy variable that is equal to $1$ if the DSP site is located along polluted rivers and $0$ if it is close to cleaner rivers. For each DSP site, readings for several chemicals and the overall water quality is recorded.

The data on water quality is provided by China's national monitoring system. The readings are reported for 484 geographic sites across China's river systems. Theses readings are assessed by the World Bank. A water grade is assigned to each of the monitoring sites using a six point scale ranging from grade I and II (drinkable water) to grade VI (essentially useless water). In between, water that is classified as grade III is undrinkable but still not harmful for humans, whereas water rated with grade IV and V can pose a risk and is only appropriate for industrial and agricultural water supply, respectively. Each reading is taken for a set of chemicals, that are ammonia nitrogen (ammonia), biological oxygen demand (bod), volatile phenol (vol_phenol), lead, oils and permanganate. The variables of the latter are named according to the chemicals. In a second step, the water grades are aggregated across the river basin. For all 145 DSP sites, we obtain (average) water grades ranging from $1$ to $6$. To get a concrete impression, just think of school grades: the smaller the grade the higher is the water quality. The necessity of aggregation results from the fact that population characteristics (like digestive cancer death rates stored in digestive et cetera) are available for 145 DSP sites only. The variable water_grade is just the maximum of all recorded grades for each DSP site and represents the overall water quality at this site. If you want to have more information about the chemicals, you can have a look at the following info box.

< info "Description of Chemicals"

The above mentioned chemicals enter China's waterways through wastewater. The main part of contaminants is discharged by industrial firms but also the overuse of fertilizer may affect the concentration of pollutants in river water. The chemicals and their dangers for humans are described shortly in the following.

>

Since the remaining seven variables are not relevant for this exercise, I will not give an explanation of those. This will be done in Exercise 1.2. Another neat possibility to take a look at the data is the function sample_n() from the package dplyr. Its use allows you to get a random sample of your data set. The function call is sample_n(dat, size) where dat is the used data frame and size specifies the number of rows that should be displayed. Check https://www.rdocumentation.org/packages/Momocs/versions/1.1.0/topics/sample_n for more information. The next task can be solved optionally.

Task: Use the function sample_n() in order to show a random sample of six rows of our data set dat. Press check afterwards.

#< task_notest
# Display six randomly chosen rows of dat
#>
sample_n(dat, 6)

#< hint
display("Call the function sample_n() with two arguments that are separated by a comma: the name of the data frame and the number of rows you want to display.")
#>

Now, you can see a random choice of rows of our data frame dat.

Analysis of the DSP Sites

As you can see in the data frame, the DSP sites are assigned to different river basins. Now, we want to find out the names of all river systems. We could just show the whole column river but the output vector would have a length of 145 and contain each name multiple times since every river system holds several DSP sites. Instead, we use the command unique() from base R, which removes entries that occur more than once. We achieve a neat and short R command by means of the $ syntax.

Task: Find out in which river systems China is divided. The code is already given. Just press check.

#< task
# Display all river names
unique(dat$river)
#>

The last line of the output indicates that river is a factor variable (or categorical variable) having ten different levels. Note that the vector contains the names of the nine main river systems of China and one additional one, namely Systems-West. The latter includes rivers that flow inwards from other countries west of China.

Besides being the longest river of China, one of the above mentioned river systems is simultaneously the third longest of the world. If you know it, answer the following question. If not, you can make a guess.

< quiz "largest river system"

question: Which of the river systems is the largest in China? Please type in the name as it is displayed in the output of the task before. answer: Yangtze

>

Instead of giving a solution immediately, we will draw a map of China and visualize the DSP sites in different colors according to their affiliation to a river system. We achieve this using functions from the ggmap package.

< info "Spatial visualization by the ggmap package"

Basically, ggmap provides functions for the visualization of spatial data from online sources, such as Google Maps and Stamen Maps. It uses the layered grammar of graphics implementation of the package ggplot2, which we will get to know later. Kahle and Wickham (2013, pp. 144-151) provides a good overview of ggmap.

Drawing a map is basically achieved by the following two steps:

  1. Download a map as an image and transform it into a plottable format. This is done by means of the get_map() function. The main characteristic of this function is location. It can be either a longitude/latitude pair or a character string containing an address, a zip code or a proper name. Additionally, the argument zoom taking integer values from 3 to 20 specifies the map's level of detailedness.

  2. Create the plot using the function ggmap(). Since we want to add geographic sites to the map, we will additionally make use of some functions of the ggplot2 package that are already integrated in the ggmap package. The layered structure of ggplot2 ensures a better understanding of the code. Here, only the functions we will use for drawing the maps will be discussed briefly. The function geom_point() creates a scatterplot using the input data specified by data and the variables on the x- and y-axes declared in aes(). The argument color influences the color of the data points. alpha is used for making the plotted points more transparent and size specifies the diameter of the points in millimeter. The function theme() is intended for modifying components of a theme, that is it is possible to change visual elements that are not related to the data. Using the option element_blank() allows you to remove parts of the graphic. You get a more detailed explanation of ggplot2 and its layered grammar at the end of this exercise and at https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf.

There are plenty of functions which I do not explain here because they are not needed in our problem set. However, you can take a look at https://cran.r-project.org/web/packages/ggmap/ggmap.pdf for more details.

>

Task: Draw a map of China with the 145 DSP sites colored according to the river system they are located in. The code is already given, so just press check here.

#< task_notest
# Load the package
library(ggmap)

# Download a map of China
map = get_map(location = 'China', zoom = 4)

# Generate a plot of the map of China together with the DSP sites which
# are colored according to the river basin they belong to
ggmap(map) + 
  geom_point(data=dat, aes(x=long, y=lat, color=river), alpha=0.5, size=4) +
  ggtitle("DSP Sites Colored in accordance to their River Systems") +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())
#>

We see that the DSP sites are geographically widespread all over China. Each of those geographic points is assigned to one river basin. By means of the argument color, the DSP sites are colored according to the river system they are located in. That means DSP sites that belong to the same river system are identified by the same color. Hence, the map shows ten groups of differently colored sites. As you surely have expected, DSP sites with the same color are located in the same region since they belong to the same river basin.

Have a look at the map again and answer the following question.

< quiz "most geographic sites"

question: Which river basin has the most DSP sites? You do not have to count each of the points but rather estimate their frequency. Please type in the name as it is displayed in the legend of the map. answer: Yangtze

>

We could expect that the longest river (with the greatest river basin) has the most DSP sites compared to all other river systems. We will check this by transforming our data frame dat. Our aim is to count the number of DSP sites for each river basin separately. We will first group the data set by river systems and compute their amount of DSPs in a second step. The dplyr package provides powerful functions, like group_by() and summarise() of which we will make use now.

< info "Data Transformation with group_by() and summarise()"

The function group_by() out of the package dplyr aggregates subsets of rows in a specified way. It is used in combination with other functions from the same package. In our case, we need the function summarise(), which typically summarizes multiple values to a single one. Used together with group_by() it produces one row for every group. We will illustrate the use of the two functions by the following example.

Say, we want to receive the minimum overall water grade for each river basin, respectively. This can be achieved by the following code.

# Load the package
library(dplyr)

# Group the data frame by river
dat_group = group_by(dat, river)

# Compute the minimum of water_grade for each river system
summarise(dat_group, min = min(water_grade))

At first, we have to load the package dplyr. The group_by() command converts the data set dat into a grouped data set, where the subsequent operations will be applied per group. Here, each river system forms a group. The summarise() function applies the function min() to the variable water_grade of the grouped data set dat_group. A new variable is created to which we can also assign a name. In our case, the column variable is named min. We obtain a tibble (a format for data frames) which consists of ten rows, each representing a river with its minimum water grade.

At https://cran.r-project.org/web/packages/dplyr/dplyr.pdf you can get more detailed information about the above mentioned functions.

>

Recall that we are interested in the number of DSP sites per river system. We will proceed according to the description in the above info box. However, instead of typing min() in the summarise() command, we will make use of the function n(), which counts the number of elements in each group.

Task: After loading the package dplyr, group the data frame dat by the variable river using the function group_by(). Call the resulting data frame river_sites. Then, by means of the functions summarise() and n() calculate the number of DSP sites for each group. The newly created column vector should be named sites. The main code is already given. Remove the comments and replace the ??? by the correct code. For advice, have a look at the above info box. Finally, press check.

#< task_notest
# Load the package
# ???
#>
library(dplyr)

#< hint
display("Just write library(dplyr) in order to load the package dplyr.")
#>

#< task_notest
# Group the data frame by river
# river_sites = group_by(dat, ???)
#>
river_sites = group_by(dat, river)

#< hint
display("For the second command, replace the ??? by the grouping variable. In our case, it is called river.")
#>

#< task_notest
# Count the number of DSP sites in every river basin
# summarise(???, sites = n())
#>
summarise(river_sites, sites = n())

#< hint
display("The ??? have to be replaced by a data set. In order to apply the function summarise() to every group separately, you have to enter the name of the grouped data set river_sites. ")
#>

< award "dplyr Expert Level 1"

Great! You completed the first step to become an expert in the dplyr package usage! You successfully learned how to handle the functions group_by() and summarise().

>

Have a look at the output of the summarise() function. We get a tibble with ten rows and two columns showing the number of DSP sites for each river. The number of DSPs is not equally distributed. With one glance, we spot the maximum of $40$ sites that belongs to the Yangtze river. Indeed, with its 6,300 kilometers the Yangtze is the longest river in China and the third longest of the world (cf. http://www.worldatlas.com/articles/the-longest-rivers-in-china.html).

Comparison of Water Grades

In general, each of China's river systems is polluted to a different extent since their deterioration over the last years was no regionally consistent development. Instead, you can rather detect a north-south divide. Our next task is the examination of those differences. Before starting, answer the following quiz. If you do not have any clue, you can just make a guess.

< quiz "north-south divide"

question: What do you think? Which part of China is probably more polluted? sc: - North* - South success: Great, your answer is correct! failure: Try again.

>

We assume that the northern parts of China are more polluted. This can be easily explained. The rainy season in the southern parts of China lasts three to five months longer than the one in the north. More rainfall results in more fresh water in the rivers which dilutes the concentration of contaminants in a river system. We will check if our expectation coincides with the data. In a first step, we visualize the DSP sites again but color them according to their water grade. We will again use the function ggmap().

Compared to the code of the previous ggmap() call, some small changes have to be applied. Of course, the argument of color is water_grade since we are interested in the visualization of the DSP sites' water grades. Additionally, I added one layer by calling the function scale_color_gradientn(). Its use allows to modify continuous color scales. Due to the argument colors=rainbow(4), the DSP sites and their water grade are illustrated using a continuous scale of four rainbow colors. It serves the purpose of highlighting the water grade differences.

Task: Press check in order to get a map of China where the DSP sites are colored according to their water grade.

#< task_notest
# Generate a map of China and add DSP sites that
# are colored according to their water grade
ggmap(map) + 
  geom_point(data=dat, aes(x=long, y=lat, color=water_grade), alpha=0.5, size=4) +
  scale_color_gradientn(colors=rainbow(4)) +
  ggtitle("Visualization of Water Grades at the DSP Sites") +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())
#>

Now, have a look at the map and answer the following question.

< quiz "water quality"

question: Which of the following statements is correct? sc: - The overall water quality seems to be lower in northern river systems.* - The southern rivers are more polluted. success: Great, your answer is correct! failure: Try again.

>

If you examine the DSP sites carefully, you can observe the following. DSP sites with high water grades, which are colored in purple, are mainly found in the northern part of China. The only red points indicating high water quality are located in China's southern region. In general, you find more green and orange colored points in the south standing for comparable high water quality. In the north we mainly observe the colors blue and purple which represent lower water quality with one exception. In the northeast of China, with values of two to four, the water grades are comparably low. However, we find that the population living in the northern part of China is, on average, exposed to higher polluted river water than people living in the south.

In a next step, we will compute the average water quality and average concentration of chemicals of every river system. In order to do so, we have to complete several steps. For each of the river systems there are several sites (stored in dat), whereas each of them has a couple of columns. For this task, we are only interested in the variable river and the variables storing grades for pollutants and overall water quality. In a first step, the column variables that are not needed will be excluded from the data set. Then, as we already did before, we group the remaining data by river. Lastly, the grades of the quality criteria of the sites will be averaged for each river system separately. After performing these steps, we will obtain a data frame consisting of ten rows where each row represents one river system. Again, we will make use of functions from the package dplyr. To remove columns from a data frame, we need the function select().

< info "Select Column Variables with select()"

The function select() out of the package dplyr allows us to consider only a subset of columns of a data frame. The following code shows an exemplary application.

# Generate a new data frame of the variables river and water_grade
dat_subset = select(dat, river, water_grade)

The data frame dat_subset now only consists of the columns river and water_grade, whereas all other variables of dat were omitted. If you want to keep almost all variables and eliminate only a few, it makes sense to type the following.

# Omit the variable water_grade
dat_subset2 = select(dat, -water_grade)

Adding a minus in front of the column named water_grade results in its omission, whereas all other variables of dat will be stored in dat_subset2. You can also drop more than one variable by listing them (separated by commas).

You get more information about select() at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

>

The data frame dat consists of 19 columns. We want to drop all variables but river, water_grade, the six chemicals and pol. Hence, we keep less variables than we want to omit.

Task: Generate a new data frame dsp_sub of dat keeping the variable river, all grades for the pollutants and pol. Use the function select() from dplyr. The first command is already given but is not correct yet. Detect the errors and correct the code. Thereafter, show six random rows of the generated data set. Press check afterwards.

#< task_notest
# Generate a subset of the data set dat
# Correct these lines of code
# dsp_sub = keep(dat, -river, -water_grade, -ammonia, -bod, -lead, -oils, 
#                -permanganate, -vol_phenol, -pol)
#>
dsp_sub = select(dat, river, water_grade, ammonia, bod, lead, oils,
                 permanganate, vol_phenol, pol)

#< hint
display("There are two types of errors in the given command:
        1. The name of the function is not correct. Use select() instead.
        2. The minus signs in front of the variables imply that they will be removed although we want to keep them. Just remove all minus signs.")
#>

#< task_notest
# Display a randomly chosen part of the generated data frame
#>
sample_n(dsp_sub, 6)

#< hint
display("Enter sample_n(dsp_sub, 6) in order to show six randomly chosen rows of dsp_sub.")
#>

< award "dplyr Expert Level 2"

Well done! You already master the dplyr functions to such an extent that you can detect errors in the code! In addition, you can deal with the function select().

>

The data set dsp_sub now only contains variables that we require for our investigation. After grouping the data frame by the variable river we will calculate the mean values of each of the column variables. Since we are now interested in more than one variable, we cannot longer use summarise(). Instead, we apply the mean function on several columns using summarise_all().

< info "Apply Functions with summarise_all()"

The function summarise_all() from the package dplyr allows us to apply one (or more) functions to one (or more) columns. It is often used in combination with group_by() out of the same package. As argument the function receives funs() which is a list of functions that should be applied. We will only make use of the function mean() from base R (and weighted.mean() in a later exercise).

Say, we want to get a data frame where every row represents the grades for one region. Imagine that the region of every DSP site is stored in a variable called region. One possibility to receive this result is shown in the next code chunk.

# Group the data frame by region
dat_group = group_by(dat, region)

# Compute the mean values of each column variable for each region
summarise_all(dat_group, funs(mean))

At first, the data set is grouped by region. The summarise_all() function applies the function mean() to the grouped data set dat_group. We obtain a data frame where each row represents the average water grade and concentration of pollutants of one region.

At https://cran.r-project.org/web/packages/dplyr/dplyr.pdf you can get more detailed information about summarise_all() out of the dplyr package.

>

Using the functions of the dplyr package we can now easily compute the average grades of the water pollutants for each river system separately. Note that we will not only group by river but additionally by pol. The reason for this is that pol is a factor variable, for which it is not possible to compute mean values. The code would be executed nonetheless, but R would produce warnings. To avoid this, we add pol as argument in the group_by() function.

Task: Compute the mean values of every contaminant for each river system, respectively. Use the functions group_by() and summarise_all() out of the package dplyr. The necessary data is stored in dsp_sub and the grouping variables are river and pol. Uncomment the lines and replace the ??? by the correct code. If you need further advice, check the latest info boxes. In order to get feedback if your code is correct, press check.

#< task_notest
# Group the data by the variables river and pol
# dsp_group = group_by(dsp_sub, ???, ???)
#>
dsp_group = group_by(dsp_sub, river, pol)

#< hint
display("The data set should be grouped by the variables river and pol. Insert both in the given order.")
#>

#< task_notest
# Calculate the mean values of every column variable of the grouped data set
# summarise_all(???, funs(mean))
#>
summarise_all(dsp_group, funs(mean))

#< hint
display("Replace the ??? by dsp_group.")
#>

You finally computed the arithmetic mean values for each pollutant for every river system, respectively. Well done! As you have noticed, the functions from the dplyr package seem to be very useful for working with large sets of data. However, the calculation required a couple of lines of code, where every output had to be assigned to a new variable. To get rid of those intermediate results and to achieve a more compact code, we can use the pipe operator %>%.

< info "%>% Operator"

The pipe operator %>% out of the package dplyr pipes the output from one function to the input of the next function. Using this function allows you to read the code easily from left-to-right and top-to-bottom instead of reading from the inside to the outside of functions. Have a look at the code of the previous info box. You can rewrite the same code by using the pipe operator in the following manner.

# Compute the average values for each column and region in one pipe
dat %>%
  group_by(region) %>%
  summarise_all(funs(mean))

Note that the functions are applied from top to bottom. Every code line (except for the last one) is terminated by %>% meaning that the pipe continues. Instead of calling every function with an additional specification of the data set, we mention it only once in the first line. Here, the data set dat is initially grouped by the variable region. Thereafter, the mean values for every column and each group are computed by the function summarise_all().

You can get more information about the %>% operator at http://genomicsclass.github.io/book/pages/dplyr_tutorial.html.

>

As a last step, we want to revise the code we wrote before and get a neat solution by means of the %>% operator. In addition, we will sort the river systems decreasingly in their overall water quality water_grade. This is done by adding an additional line to the pipe. For this, we need the function arrange() from the dplyr package.

< info "Reorder Rows by arrange()"

The function arrange() contained in the package dplyr is used to reorder rows of a data frame. You can specify one or more variables by which the data frame should be ordered. The application of this function is similar to the one of the function select() of the same package and is shown in the following example code.

# Sort dat by the values of the variable river
dat_ordered = arrange(dat, river)

Since river is a factor variable and consists of characters, the data frame dat is sorted alphabetically by the values of the variable river. Again, we can attach the function to the sequence of other dplyr functions using the %>% operator. We want to order the mean values which we already computed in the previous info box by the variable river. We just have to add one line of code which is shown in the following code chunk.

dat %>%
  group_by(region) %>%
  summarise_all(funs(mean)) %>%
  arrange(river)

Note that we omit the used data frame in the call of the function arrange() since it is already mentioned in the first row of the code block.

If the variable that is used for reordering is numeric, the data frame will be ordered ascendingly in its values. Adding a minus sign in front of the variable name results in a data frame that is sorted in descending order.

Read more about the arrange() function at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

>

So far, you learned many manipulation tasks for data. Now, you can apply your newly acquired knowledge in the next task.

Task: Transform the code of the previous code chunk using the pipe operator %>% and sort the data frame decreasingly by water grade. The resulting data set should be named dsp_mean. The code lines are copied and given in the comments. Afterwards, show the data frame by calling it. For further advice on handling the pipe operator or for the function arrange() have a look at the last two info boxes or press hint. Finally, press check.

#< task_notest
# Modify the two commands of the previous chunk given here 
# dsp_group = group_by(dsp_sub, river, pol)
# summarise_all(dsp_group, funs(mean))
#>
dsp_mean = dsp_sub %>%
  group_by(river, pol) %>%
  summarise_all(funs(mean(., na.rm = TRUE))) %>%
  arrange(water_grade)

#< hint
display("The first line of the command should have the following form: dsp_mean = dsp_sub %>%.
        In the second line, use the command group_by().
        The third line should look as follows: summarise_all(funs(mean(., na.rm = TRUE))) %>%.
        Lastly, make use of the command arrange().
        Remember to terminate every line (except for the last one) with the following sequence of signs %>%.")
#>

#< task_notest
# Display the generated data frame
#>
dsp_mean

#< hint
display("Just type in the name of the generated data set in order to show it.")
#>

< award "dplyr Expert Level 3"

You reached Level 3 of the package dplyr application! You are now able to transform data sets in (almost) every desired way. Apart from handling the functions summarise_all() and arrange() you are able to create neat and clearly structured code due to the %>% operator.

>

Take a look at the data frame dsp_mean. Recall that higher numbers reflect lower water quality and a greater concentration of contaminants. Partially, the average values among one variable vary considerably. By way of illustration, we will visualize the overall water grade since it summarizes the current state of the river system. Before doing this, we will analyze Ebenstein's classification of the river systems.

Classification in Polluted and Cleaner Rivers

For the further analysis we will need the variable pol that is stored in dat and dsp_mean. Recall that pol indicates if the corresponding river system is polluted. Rivers with pol==1 belong to the group of polluted rivers, whereas all other rivers constitute the group of cleaner rivers. Hence, Ebenstein divided the data for the 145 DSP sites into two groups. We want to find out which river systems are included in the group of polluted rivers. To be more precise, we want to identify all river systems (stored in the variable river) for which pol is equal to $1$. For this task we will make use of the function filter() from dplyr.

< info "Select Rows by filter()"

The function filter() from the package dplyr allows us to take a subset of rows in a data frame. The function is designed by filter(dat, cond), where cond is the condition by which the data frame dat should be filtered.

For instance, the data set dat contains the variable bod (variable for biological oxygen demand) taking values from $1$ to $6$. If you want to have a separate data frame dat_bod2 containing only the sites of the river systems with a biological oxygen demand of $2$, you can use the following command.

# Keep the rows where bod==2 only
dat_bod2 = filter(dat, bod == 2)

If cond consists of characters you have to use quotation marks, e.g. you write river == "Songhua" if you want to get a data frame containing only information about the Songhua river system.

Visit https://cran.r-project.org/web/packages/dplyr/dplyr.pdf for more information about the function filter().

>

In order to apply the function filter(), edit the following task.

Task: Use the function filter() from the package dplyr in order to find all DSP sites where pol==1. The resulting data frame should be named pol_river. Press check afterwards.

#< task_notest
# Generate a data frame for polluted rivers only
#>
pol_river = filter(dat, pol==1)

#< hint
display("Your command should be of the following form: pol_river = filter(). In the brackets, indicate the data frame dat and, separated by a comma, add the condition pol==1.")
#>

If you want, you can solve the next task in order to answer the upcoming quiz. Its editing is optionally, however.

Task: Show the entries of the variable river stored in pol_river in order to answer the next question. You can enter whatever you think is needed to get the answer. If you need further help, press hint. Finally, click on check.

#< task_notest
# You can enter arbitrary code here...
#> 

#< notest
unique(pol_river$river)
#>

#< hint
display("For example, you can use the function unique() and pass the variable river as argument. Remember to refer to the used data frame by the $ operator.")
#>

< quiz "polluted rivers"

question: Check all river systems that are classified as polluted rivers. mc: - Liao river - Yellow river - Yangtze river - Huai river - Songhua river - Hai river - Pearl river success: Great, all answers are correct! failure: Not all answers are correct. Try again.

>

The following four (of ten river systems) are assigned to the group of polluted river systems: Hai, Huai, Liao and Yellow rivers. For a better comparability, we will visualize the overall water quality of the river systems by a bar plot. A really useful package is ggplot2 which provides powerful tools for creating complex plots.

< info "Package ggplot2"

The package ggplot2 provides tools and functions for creating high-quality graphics. It is based on the layered grammar of graphics whose components are, amongst others, scales, coordinate system, faceting, aesthetic and geometry. The notation is simple since every layer is added by means of the + operator. You can read more about it at Teutonico (2015, Chapter 3).

We will shortly introduce the functions that we will need for our plot.

You can find more detailed descriptions and many more functions at https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf.

>

By means of these tools, we are able to draw a bar plot where we can compare the water grades of each river system easily and additionally mark their affiliation to one of the groups.

Task: Let us have a look at the graph visualizing the grades of the overall water quality of China's river systems and their affiliation to the groups of polluted or cleaner water. The code is already given. Just press check here.

#< task_notest
# Load the package
library(ggplot2)

# Create the graphic
ggplot(data=dsp_mean, aes(x=river, y=water_grade, fill=pol)) +
  geom_col() +
  ggtitle("Average Water Grades by River System") +
  xlab("River System") + ylab("Average Overall Water Grade") + 
  coord_flip(ylim = c(0:6))
#>

At first glance, we see that all turquoise bars are longer than the red ones. The former represent the group of polluted rivers, which have the highest water grades and hence a worse water quality. The group of cleaner rivers are characterized by lower bars colored in red. In this context, answer the following question.

< quiz "border"

question: Which overall water grade separates the cleaner rivers from the polluted ones? Enter an integer value. answer: 4

>

In order to find out which border the author used, you have to look at the differences between the heights of the blue and red bars. The key bars are the highest red and the lowest blue bar, that are the ones representing the Fujian/Zhejiang and Hai rivers. Both values are close to $4$, which is the border Ebenstein has chosen.

Recall that we assumed that northern river systems are more polluted and hence have a higher water grade. We found that the rivers Hai, Huai, Liao and Yellow belong to the group of polluted rivers. In the last task of this exercise, we want to investigate if our expectation is true.

Task: Create a map of China displaying the location of all DSP sites, which are colored according to their affiliation to the group of polluted or cleaner rivers, respectively. Just press check here.

#< task_notest
# Create a map of China and add all DSP sites, which are colored
# according to the group they belong to
ggmap(map) +
  geom_point(data=dat, aes(x=long, y=lat, color=pol), alpha=0.5, size=4) +
  ggtitle("Visualization of DSP Sites close to Polluted and Cleaner Rivers") +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())
#>

As expected, the rivers in northern China with the highest average water grades are assigned to the group of polluted rivers. They are depicted in red. The turquoise points represent the DSP sites that are close to cleaner river basins. However, in the northeast of China the points are painted in red as well. These DSP sites belong to the Songhua river system, which is characterized by comparably low average water grades of $3.5$ (as you can see in the plot before). We already observed this fact after coloring the DSP sites according to their water grades.

Exercise 1.2 -- Digestive Tract Cancer Death Rates

So far, we turned our attention to the concentration of pollutants and the overall water contamination of China's main river systems. In this section, we want to introduce the second part of our available data. In order to investigate the relationship between China's water quality and digestive cancer death rates, we will need population characteristics and environmental indicators which are available for the 145 DSP sites as well. After loading the data, they will be shortly introduced.

Task: Load the data dsp_dat.dta into R and name it dat. Thereafter, display six randomly chosen rows of dat using the function sample_n(). Since you started a new exercise, press edit initially.

#< task_notest
# Read in the data
#>
dat = read.dta("dsp_dat.dta")

#< hint
display("The command should be of the form dat = read.dta(). Pass the .dta file, wrapped in double quotes, as argument to the function read.dta().")
#>

#< task_notest
# Show a part of the data
#>
sample_n(dat, 6)

#< hint
display("Just type in sample_n(dat, ???) and replace the ??? by the correct integer value.")
#>

Since we already used the same data in Exercise 1.1 the first twelve variables are well-known. The remaining columns store average statistics about the population and environmental conditions at the sites. The variable educ measures the years of education of decedents with a minimum age of 20, farmer indicates the share of people who are employed in farming and urban specifies if the DSP site is situated in an urban area or not. air_pol is a measure for air quality and rainfall quantifies the monthly rainfall. The variable pop specifies the total population living at each DSP site.

The data stored in the variable digestive will be discussed in more detail. It specifies the digestive tract cancer death rate in China. Its calculation is based on approximately $500,000$ deaths that were recorded at DSP sites in the years from 1991 to 2000. The recorded deaths were transformed into death rates using population counts by age and sex in order to allow a comparison of mortality rates at sites with different age and sex distributions. At each DSP site, the death rates are aggregated according to the age and sex distribution at China's 2000 census. The values stored in digestive measure the number of fatalities caused by digestive cancers per $100,000$ people.

At first, we want to provide an overview of the dimension of digestive's values. We will create a histogram that displays the absolute frequency of the digestive cancer death rates. We make use of the powerful ggplot2 package again. The function geom_histogram() allows us to create a histogram of the variable x that is specified in the function aes(). We set the binwidth to $5$ and choose different colors for color and fill in order to distinguish the bars. By means of the function scale_x_continuous() the x-axis can be modified. In the following code chunk, the axis labeling and the axis ticks are adapted by the options name and breaks, respectively. A title is added using ggtitle(). Visit https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf for more information.

Task: Press check in order to receive a histogram of the values stored in digestive.

#< task_notest
# Generate a histogram of the digestive cancer death rate
ggplot(data=dat, aes(x=digestive)) + 
  geom_histogram(binwidth=5, color="grey", fill="#f8766d") +
  scale_x_continuous(name="Digestive Cancer Death Rates by DSP Site",
                     breaks=seq(0, 175, 25)) +
  ggtitle("Frequency Histogram of Digestive Cancer Death Rates")
#>

You can see that the values of digestive range from roughly $5$ to $175$. The values indicate the death rate for each DSP site. More precisely, they refer to the number of deaths per 100,000 people. The frequencies are layed on the y-axis. Most of the death rates range from $25$ to $90$. The maximal death rate amounts to roughly $175$.

The values vary greatly depending on which DSP site is considered. One could imagine that also in this context the location of the sites is responsible for differences in the cancer death rates. On these grounds, answer the following question. If you have no idea, you can make a guess.

< quiz "North-South divide 2"

question: What do you think? In which part of China are the digestive cancer death rates higher? sc: - North* - South success: Great, your answer is correct! failure: Try again.

>

If there is a relationship between water quality and digestive cancers in China, we would expect that in regions with higher water grades also the number of deaths caused by digestive cancers is high. Since the water quality is poorer in northern areas (as investigated in the previous exercise), we expect the digestive cancer rates to be higher in those areas. We can investigate this matter by visualizing the spatial distribution of digestive cancers. As in Exercise 1.1 we will make use of the ggmap package.

Task: Edit the following code in order to get a map of China where the 145 DSP sites (given by their longitude/latitude pair) are colored according to the height of their digestive cancer death rate which is stored in the variable digestive. Uncomment the code and replace the ??? by the correct variables.

#< task_notest
# Download a map of China
map = get_map(location = 'China', zoom = 4)

# Create a map and add the DSP sites which are colored
# according to their digestive cancer rate
# ggmap(map) + 
#   geom_point(data=dat, aes(x=???, y=???, color=???), size=4) +
#   scale_color_gradientn(colors = rainbow(3)) +
#   ggtitle("Visualization of Digestive Cancer Death Rates at the DSP Sites") +
#    theme(axis.title=element_blank(),
#          axis.text=element_blank(),
#          axis.ticks=element_blank())
#>
ggmap(map) + 
  geom_point(data=dat, aes(x=long, y=lat, color=digestive), size=4) +
  scale_color_gradientn(colors = rainbow(3)) +
  ggtitle("Visualization of Digestive Cancer Death Rates at the DSP Sites") +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())

#< hint
display("The data plotted on the x- and y-axes should be the longitude and latitude of the DSP sites which are stored in long and lat. The argument of color should be digestive.")
#>

< award "Map Creator"

Congratulations! You know how to use the packages ggmap and ggplot2 in order to create maps and add colored geographic points.

>

The digestive cancer death rates are displayed in colors of red and orange to green and blue. Lower rates are colored in red. Most of the red and orange points can be found in the southern region of China suggesting that the cancer death rate is comparably smaller there. In the northern region we find more sites colored in dark green and blue. Hence, in the north of China more fatalities due to digestive cancer are reported. In the northeast however, the colors of the DSP sites (orange and red-green) imply smaller death rates.

In order to facilitate the comparison of water grades with digestive cancers, we will modify the code in the previous code chunk. As you already know, Ebenstein classified the river systems in two groups according to their degree of contamination. We will add this classification (into polluted and cleaner) to our graphic.

Task: Press check in order to visualize the map of China together with the DSP sites that represent the corresponding digestive cancer death rate and the classification in the group of polluted or cleaner rivers similarly.

#< task_notest
# Create a map of all DSP sites depicting their digestive cancer rate and
# affiliation to one group of rivers
ggmap(map) + 
  geom_point(data=dat, aes(x=long, y=lat, color=digestive, shape=pol), size=4) +
  scale_color_gradientn(colors = rainbow(3)) +
  ggtitle("DSP Sites by Digestive Cancer Death Rates and Pollution Degree") +
  theme(axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank())
#>

Now, the DSP sites are not only displayed in different colors (depending on the height of the death rate) but also in a different shape. The DSP sites that are located in a polluted river basin are illustrated by a triangle shape whereas the sites at cleaner rivers are represented by circles. The explanation can be found in the legend as well. The modification is a consequence of adding the argument shape=pol to the function aes().

We can see at one glance, that the circles are mostly colored in the range of red to green. That means that at sites close to cleaner river systems the digestive cancer rate is comparably small. Concurrently, the triangles mainly show green and blue colors. Hence, areas with more polluted rivers exhibit higher death rates. Again, we recognize the north-south divide.

Our data suggests that in areas with more polluted water the digestive cancer death rates are higher compared to regions with higher water quality. In the course of the problem set we will investigate this relationship and its causality further.

Exercise 2 -- Comparison of the Variables' Mean Values

In Exercise 1 we already got an overview of our data set. In this section, we want to deepen the analysis by examining the average values of the variables. We first load the data.

Task: Press check in order to load the data dsp_dat.dta into the variable dat and to show a part of it. Since you started a new exercise you have to press edit first.

#< task
# Read in the data
dat = read.dta("dsp_dat.dta")

# Show a random sample of dat
sample_n(dat, 6)
#>

Recall the structure of the data set dat. Beside the ID, river name, longitude and latitude the data frame stores the measures for the overall water grade and concentration of pollutants. The quantity stored in the variable digestive is the digestive cancer rate, educ quantifies the average years of education and farmer indicates the share in farming at each DSP site. An entry of the binary vector urban is equal to $1$ if the DSP site is located in an urban region and $0$, otherwise. pop quantifies the total population at each DSP site.

The variable air_pol measures the level of air pollution. The readings are provided by NASA's satellite imagery for the years from 2002 to 2007 and take values between $0$ and $1$. Higher numbers indicate more particulates in the air and thus worse air quality. All readings belonging to one river basin are averaged and assigned to each DSP site that lies in the corresponding river system.

As the name indicates, the variable rainfall measures the average monthly rainfall in millimeters in the river basin containing the corresponding DSP site. The data is based on the readings from 1961 to 1990 and provided by the Global Precipitation Climatology Center.

Average of Characteristics for Polluted and Cleaner Rivers

We want to analyze the data stored in dat more precisely. In Exercise 1.1 we already shortly examined those rivers which the author classified as being polluted. Now, we are interested if the rivers' characteristics differ for the two groups of polluted and cleaner rivers. What do you expect? You can just make a guess here.

< quiz "different means"

question: Do the above described variables differ significantly for polluted and cleaner rivers? sc: - yes* - no success: Great, your answer is correct! failure: Try again.

>

To figure this out, we have to transform our data set. In order to examine if the variables differ for the two groups of rivers, we require a quantity that we can use for their comparison. We will compute the mean values of each variable for DSP sites located close to polluted and cleaner river systems, respectively, and compare them in a second step. For the first part, we will make use of the powerful tools of the package dplyr.

We proceed in the following way. In order to get mean values for the two categories polluted and cleaner, we have to group our data set dsp_dat by the variable pol. Thereafter, we will exclude some of the variables since in our investigation calculating mean values does not make sense for all variables. We already found out that river stores the names of the river systems. Since it is not meaningful to compute means of a factor variable, we have to remove river from our data set. Additionally, we omit the factor variable dsp_id. Also, long and lat are excluded from our data frame since calculations with geographic coordinates are not reasonable. We will omit the mentioned variables directly before computing the mean values using summarise_all().

So far, the argument of the function summarise_all() was mean(). Now, we will make use of the function weighted.mean().

< info "Compute Weighted Averages with weighted.mean()"

The function weighted.mean() computes the weighted arithmetic mean of a variable. The use of the function is demonstrated in the following code chunk.

# Compute the weighted mean of x
weighted_mean = weighted.mean(x, w=wgts)

The weighted arithmetic mean of the vector x is calculated using the weight vector wgts. The result is named weighted_mean.

Check out https://stat.ethz.ch/R-manual/R-devel/library/stats/html/weighted.mean.html for more detailed information.

>

The advantage of using the weighted mean is that it accounts for size differences in the population. In other words, we respect that DSP sites with higher population have more impact on the size of statistics than others with lower population. By comparison, the function mean() weights all observations equally.

Task: Press check in order to get a data frame that stores the mean values of every column variable for each category of river systems.

#< task
# Compute the average of each variable for both river categories
cat_mean = dat %>%
  group_by(pol) %>%
  select(-dsp_id, -river, -long, -lat) %>%
  summarise_all(funs(round(weighted.mean(., w=round(pop)),2))) %>%
  select(-pop)

# Display the generated data
cat_mean
#>

Note that the function round() is used twice. As the name of the function suggests, it is used to round values. The call round(x,n) rounds the vector x to n decimal places. If n is omitted, the vector x is rounded to whole numbers. With respect to our code, the vector of weights pop is rounded to whole numbers, whereas the result of the computation of the weighted mean is rounded to two decimal places. In this way, we were able to replicate columns 1 and 2 of Table 2 shown in the paper on page 192.

Look at the last line of the dplyr sequence. There, we omit the variable pop since we are not interested in the weighted mean of the groups' population. The question rises why we did not add this particular column variable as argument to the first select command.

< quiz "variable pop"

question: Is there a good reason why not all of the mentioned variables are excluded from the data frame simultaneously before calculating the weighted mean values? mc: - Yes, since the program would issue an error. - No, the result would be the same if we omit all variables simultaneously before computing the weighted mean values. - Yes, pop should not be excluded before the calculation of the weighted means. success: Great, all answers are correct! failure: Not all answers are correct. Try again.

>

In the code chunk we calculated the average values for each column variable separately for each category of river systems. Every value is weighted by the population at the corresponding DSP site. This is the reason why the vector pop is necessary for the computation step. Its omission would cause an error.

Take a look at the mean values and compare the magnitude of the characteristics of polluted rivers to the ones of cleaner river systems. You see at first glance that most of them differ considerably. To be more precise, especially the digestive cancer death rate and the pollutants show high deviations. Moreover, except for the variable educ the mean values for the polluted rivers are higher. This relationship is also intuitively understandable since we suggest that higher values for the pollutants are linked to higher death rates. However, it is not sufficient to just observe some variations in the mean values' heights. In fact, we will examine if the values differ significantly for points along polluted rivers and cleaner rivers using a two-sample Student's t-test, which was also applied by Ebenstein.

Data Preparation

Since R requires separate data sets for performing t-tests we have to split our data dat in one data frame for polluted rivers only and one for cleaner river systems. We already generated the data set pol_river in Exercise 1.1 which contains information only about the river systems that are classified as being polluted. However, for this exercise, we will generate the data set again. We will proceed analogously with the cleaner river systems. We will remove the same variables as in the previous code chunk since they are not needed in the further analysis. Note that after filtering the data and dividing it into two smaller data sets depending on their affiliation to cleaner or polluted rivers, the variable pol becomes redundant as well.

Task: Press check in order to get a data frame with information about polluted rivers only and to display a part of it.

#< task
# Generate a data set for polluted rivers
pol_river = dat %>%
  filter(pol==1) %>%
  select(-dsp_id, -river, -long, -lat, -pol)

# Show a part of pol_river
sample_n(pol_river, 6)
#>

In a similar manner you will create the second data set in the next task.

Task: Generate a data frame cl_river that only consists of information about cleaner river systems. More precisely, the variable pol should be equal to $0$ for all rows. Thereafter, drop the variables dsp_id, river, long, lat and pol. Uncomment the line of code and replace the ??? by the correct functions. Proceed analogously to the code chunk before, that is make use of the %>% operator. Afterwards, show a random sample of six rows of the data set. Press check after you have finished.

#< task_notest
# Generate a data frame for cleaner river systems
# cl_river = ???
#>
cl_river = dat %>%
  filter(pol==0) %>%
  select(-dsp_id, -river, -long, -lat, -pol)

#< hint
display("The command is structured in three lines.
        The first line should be cl_river = dat %>%.
        In the second line, call the function filter() with the arguments dat and pol==0.
        At last, as arguments of the function select() specify the variables you want to drop and add a minus sign in front.")
#>

#< task_notest
# Display a random sample of cl_river
# ???
#>
sample_n(cl_river, 6)

#< hint
display("Make use of the command sample_n().")
#>

Performing Weighted Two-Sample T-Tests

We can now apply a two-sample t-test on the separate data sets pol_river and cl_river. The details are presented in Taeger and Kuhnt (2014, Chapter 2.2). In brief, the two-sample Student's t-test is used to test if two population means differ significantly by a certain value. The null hypothesis in our case is $$ H_0: \mu_{pol} - \mu_{cl} = 0 $$ where $\mu_{pol}$ and $\mu_{cl}$ are the mean values of an arbitrary statistic of pol_river and cl_river, respectively. In words, we want to test if the mean values differ significantly from each other or not.

In order to apply the t-test, Ebenstein assumed that the populations' variances (and standard deviations) are unknown and equal. The first assumption, that is that the true standard deviations are unknown, is reasonable. They will be estimated using the available data. However, it is not obvious why the true standard deviations should be equal. If they are assumed to be unequal instead, a t-test using Welch's approximation should be performed. In this exercise, we will perform the so-called Welch test, which was developed by Welch (1947, pp. 28-35).

< info "Welch Test"

As described in Taeger and Kuhnt (2014, pp. 28-31), the Welch test allows us to test if two population means $\mu_1$ and $\mu_2$ differ by a value $d_0$. The null hypothesis is $$ H_0: \mu_1 - \mu_2 = d_0. $$ The test statistic of the Welch test is $$ T = \frac{(\overline{X_1} - \overline{X_2}) - d_0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} $$ where $\overline{X_1}$ and $\overline{X_2}$ are the sample means, $s_1^2$ and $s_2^2$ the sample variances and $n_1$ and $n_2$ the sizes of both populations.

Under the null hypothesis, the test statistic $T$ is Student t-distributed with $$ \nu = \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 \bigg/ \left( \frac{(s_1^2/n_1^2)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1} \right) $$ degrees of freedom. Hence, $H_0$ can be rejected if $$ t < t_{\kappa/2, \nu} \ \text{ or } \ t > t_{1-\kappa/2, \nu} $$ holds for the observed value $t$ of the test statistic $T$. Note that $t_{\kappa, \nu}$ is the $\kappa$-quantile of the t-distribution with $\nu$ degrees of freedom.

Equivalently, it is possible to reject $H_0$ if the p-value is smaller than a reasonable significance level $\alpha$. According to Stock and Watson (2015, pp. 118-119), the p-value is the probability of obtaining a test statistic that is at least as extreme as our observed value provided that $H_0$ is true. Hence, it is defined by $P \left(|T| > |t| \ | H_0 \text{ is true} \right)$. For the Welch test, the p-value is computed by $p = 2 P \left(T \leq (-|t|) \right)$ (cf. Taeger and Kuhnt (2014), p. 29).

>

We still want to take size differences in the population into consideration. That is why we will perform a weighted Welch test. The procedure is the same as presented in the info box for the Welch test. The only difference is that the sample means and variances are substituted by their weighted versions. In columns 1 and 2 of Table 2 on page 192 Ebenstein displayed the sample means of some statistics weighted by the population at the corresponding DSP sites. We already replicated those for a choice of statistics and stored them in the data frame cat_mean. The third column of Table 2 in the paper shows the differences of the weighted mean values in combination with the results of the t-tests that were performed by Ebenstein.

In lack of a function that applies a weighted t-test on all columns of a data frame simultaneously, we will perform the test for the following chosen characteristics: the digestive cancer death rate and the overall water quality. We will make use of the function wtd.t.test().

< info "Perform Weighted t-tests using wtd.t.test()"

The function wtd.t.test() from the package weights performs weighted one- and two-sample t-tests. We will only focus on the two-sample version using Welch's approximation. The commands you need are the following.

# Load the package
library(weights)

# Perform a weighted two-sample t-test using Welch's approximation
wtd.t.test(x=var_pop1, y=var_pop2, weight=pop1, weighty=pop2, samedata=FALSE)

The weighted t-test is performed for the vectors var_pop1 and var_pop2 from two populations using the weights pop1 and pop2, respectively. Adding the option samedata=FALSE indicates that the vectors var_pop1 and var_pop2 do not result from the same data.

You get more detailed information at https://cran.r-project.org/web/packages/weights/weights.pdf.

>

Task: Perform a weighted two-sample t-test for the digestive cancer death rate digestive stored in pol_river and cl_river, respectively, using Welch's approximation by means of the function wtd.t.test(). The variables should be weighted by the (rounded) population pop at each DSP site. The code is given. Just press check here.

#< task
# Load the package weights
library(weights)

# Perform a weighted t-test using Welch's approximation
wtd.t.test(x=pol_river$digestive, y=cl_river$digestive,
           weight=round(pol_river$pop), weighty=round(cl_river$pop),
           samedata=FALSE)
#>

Have a look at the output, which is basically divided into three parts. The first part shows the name of the performed test. In our case, it is the weighted two-sample t-test using Welch's approximation. From the coefficients section we will only consider the result of p.value. The p-value is of central importance for the interpretation of tests. We can reject the null hypothesis if the p-value is smaller than the significance level $\alpha$. Ebenstein focused the following three values for $\alpha$: $0.01$, $0.05$ and $0.10$. Here, the p-value is smaller than $0.01$. Hence, we can reject the null hypothesis even at the 1% level. The last section shows, amongst others, the following statistics: Mean.x and Mean.y (which are the weighted mean values of digestive stored in the data sets x and y, respectively) and Difference (which indicates their deviation).

< info "Comment on Inconsistent Significance Levels"

The value stored in Difference coincides with the value displayed in column 3 of Table 2 on page 192 of the paper. However, Ebenstein claimed a significance level of $5$% whereas we found an $\alpha$ that is even smaller ($\alpha=0.01$). The reason for this difference is an inconsistency that occurred in the paper. Ebenstein performed the t-test without considering weighted means, which results in a p-value of $0.0106$ (which is greater than $0.01$). We get the same result using the function t.test() in R. The lack of accuracy in the presentation of results in the paper leads to different p-values.

>

We will perform one more weighted t-test in order to find out if the average water quality differs significantly for polluted and cleaner rivers.

Task: Perform a weighted two-sample t-test for the overall water quality water_grade stored in pol_river and cl_river, respectively. As in the code chunk before, use Welch's approximation. The variables should be weighted by the rounded population pop at each DSP site. If you don't know how to start, look at the last info box and code chunk or just press hint. Press check afterwards.

#< task_notest
# Perform a weighted Welch test for water_grade
#>
wtd.t.test(x=pol_river$water_grade, y=cl_river$water_grade,
           weight=round(pol_river$pop), weighty=round(cl_river$pop),
           samedata=FALSE)

#< hint
display("The command should have the following form:
        wtd.t.test(x=???, y=???, weight=round(???), weighty=round(???), samedata=???)
        Replace the ??? by the correct variables. Follow the explanation of the task.")
#>

< award "Test Performer"

Excellent! You performed a weighted two-sample t-test using the function wtd.t.test().

>

In this particular test, the p-value is small, exhibiting nine zeros rightside the point. Answer the following quiz in order to interpret this result.

< quiz "p-value"

question: Which of the following statements are true? mc: - The null hypothesis cannot be rejected at every reasonable significance level. - The null hypothesis can be rejected at the 1% level. - The average water quality differs significantly for polluted and cleaner rivers using a significance level of 0.01. success: Great, all answers are correct! failure: Not all answers are correct. Try again.

>

With a p-value considerably smaller than $0.01$ we can reject the null hypothesis even at a level of $\alpha$=1%. In words, the average water quality in China`s rivers is significantly different for DSP sites along polluted and cleaner rivers.

The t-tests of all other water contaminants except lead show similar results. The null hypotheses are rejected due to small p-values. That means that the average grades of the contaminants are significantly different for polluted and cleaner rivers. For lead, we get a p-value of roughly $0.37$ which is greater than any reasonable significance level. Hence, we cannot reject the null hypothesis.

For the remaining variables stored in the data set only the t-tests for air_pol and rainfall can be rejected at $\alpha=0.10$ and $\alpha=0.01$ significance levels, respectively. We can conclude that population characteristics such as rates of urbanization and farming do not vary significantly for rivers with different water quality levels.

Summing up, we performed weighted two-sample t-tests using Welch's formula. That means, that we assumed that the standard deviations are not equal. Of course, it might be reasonable to additionally perform the actual weighted two-sample t-test without using Welch's approximation. In this case, it is assumed that the data sets have equal variance. We could compare the results and, if necessary, conduct further investigations. However, we will not deepen this analysis since it does not provide key information for our aim to establish a causal link between digestive cancer death rates and water quality.

Exercise 3 -- Regression Analysis

In this chapter and its subsections, our aim is to find the causal effect of water quality on digestive cancers. More precisely, we want to find a satisfying answer to the following question. Does the deterioration of water quality cause higher cancer rates in China? Over the last decades, China has experienced a decline in water quality caused by the industrialization. Concurrently, China has faced increasing cancer rates, which result in one million deaths annually only caused by digestive tract cancers. The investigation will be based on regression techniques.

Introduction to Regression Theory

Their relationship will be represented using a linear regression model. We will introduce the theory by means of the following multiple linear regression model

$$ y = \beta X + \varepsilon. $$

Our aim is to find good estimates of the vector $\beta$ in order to predict the response (or dependent or explained) variable $y$ by means of the $n \times (k+1)$ matrix $X$ consisting of $k$ control variables $x_1, \ldots, x_k$ (also known as independent or explanatory variables) and a constant vector. The index $n$ indicates the number of data points, whereas $k$ is the number of independent variables. The estimation is the better the closer the estimates are to the true values (cf. Kennedy (2008), pp. 4-13). Hence, we have to minimize the vector of residuals $\varepsilon$. But how to do that? There are plenty of possibilities to minimize the error terms. We will make use of the most popular method, that is the ordinary least squares (OLS) estimation. Then, the vector $\hat{\beta}$ is called OLS estimator. Vectors with the ^ symbol on their top denote estimators.

< info "Ordinary Least Squares Estimation"

Consider the following multiple regression model $$ y = \beta X + \varepsilon $$ where $y$ is the dependent variable, $X$ a matrix containing a constant vector and the $k$ independent variables $x_1, \ldots, x_k$, $\beta$ a vector of the true coefficients and $\varepsilon$ the vector of residuals. The residuals are defined by $\varepsilon = y - \hat{y}$, where $\hat{y} = \hat{\beta} X$ is the estimator of $y$. As described in Verbeek (2012, Chapter 2) and in Wooldridge (2016, Chapter 3) the OLS estimator $\hat{\beta}$ is determined to minimize the objective function $S(\hat{\beta})$ defined by $$ S(\hat{\beta}) = \sum_{i=1}^n \varepsilon_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 $$ with $n$ indicating the number of available observations. Hence, in the ordinary least squares estimation the estimator $\hat{\beta}$ is chosen to minimize the sum of squared residuals. The solution of the minimization problem is given by $$ \hat{\beta} = (X'X)^{-1}X'y. $$ The proof can be found in Verbeek (2012, pp. 7-9.).

>

Assumptions of the Linear Regression Model

The OLS estimator is considered to be optimal if the following five assumptions hold. A more detailed description can be found in Kennedy (2008, pp. 40-42).

  1. The dependent variable $y$ can be obtained by a sum of independent variables and a disturbance term $\varepsilon$ where all variables are considered in their linear form, i.e. $y = \beta X + \varepsilon$.

  2. The expected value of the disturbance term is zero, i.e. $\mathbb{E}(\varepsilon)=0$.

  3. The disturbances have the same variance and are uncorrelated, i.e. $\mathbb{E}(\varepsilon\varepsilon')=\sigma^2 I$.

  4. The observations on the independent variables are fixed in repeated samples.

  5. There are more observations than independent variables, that is $rank(X)=k+1 \leq n$, where $k$ is the number of independent variables $x_1, \ldots, x_k$ stored in the matrix $X$ and $n$ is the number of observations. Additionally, there are no exact linear relationships between the independent variables.

Under assumptions 1 to 5 the OLS estimator is said to be optimal. This statement is too vague.

Gauss-Markov Theorem

The Gauss-Markov theorem specifies that the fulfillment of those five assumptions causes the OLS estimator to be the best linear unbiased estimator (BLUE). Thus, the OLS estimator $\hat{\beta}$ satisfies the following conditions:

The theory can be found in Auer and Rottmann (2015, pp. 260-261) and Greene (2012, p. 100).

Simple Linear Regression

We start with a simple linear OLS model taking only one independent variable into account, i.e. it holds $k=1$ in the definition of the multiple linear regression model. We want to find out if there is an effect of the pollution in China's rivers and lakes on the digestive cancer death rate.

< quiz "regression formula"

question: What is the correct position of the variables in the regression formula? sc: - The measure of the cancer death rate is the independent variable and the measure of the water quality is the dependent variable. - It is vice versa, that is the measure of the cancer death rate is the dependent variable whereas the variable water_grade stands on the right hand-side of the equation.* success: Great, your answer is correct! failure: Please try again.

>

We will regress a measure of the cancer death rate on a measure of the water quality in China's river systems. For the independent variable of our regression we choose the variable water_grade. We will estimate the effect of a change in the independent variable (which is initially only water_grade) on the variable storing the cancer death rates. Note that the death rates are measured in total numbers. However, determining the change in death rates in total numbers does not seem to be appropriate. Instead, we logarithmize the variable in order to allow a more sensible interpretation of the estimates.

One could argue that assumption 1 of the linear regression model is violated now since we allow the use of logarithmized variables. However, the estimation of a linear regression model is still possible with logarithmized variables. You just have to define the dependent variable to be the logarithmized variable (cf. Wooldridge (2016), pp. 37-39). For example, in the equation $y = \beta X + \varepsilon$ we can replace the dependent variable $y$ by $\tilde{y}$ which is defined as $\tilde{y} = log(y)$. The same is valid if a logarithmized explanatory variable is part of our equation.

In preparation, I already logarithmized the variable digestive and named it ln_digestive. It contains the logarithmic digestive cancer death rates for every DSP site. Hence, the first OLS model we consider is given by

$$ ln_digestive_i = \beta_0 + \beta_1 \cdot water_qlty_i + \varepsilon_i, \ \ i \in {1, \ldots, 145} $$

where index $i$ indicates the i-th DSP site. You get an overview of the interpretation of different model types in the following info box.

< info "Interpretation of Regression Coefficients"

Amongst others, Wooldridge (2016, p. 39) distinguishes between four types of regression models. Depending on using logarithmic variables or not, each of the coefficients is interpreted differently. In each of the following regressions the response variable is on the left hand side of the regression equation, whereas $x$ and $log(x)$ respectively represent the control variable.

Regression Type Regression Model Interpretation of \(\beta_1\)
Level-level regression \(y = \beta_{0} + \beta_{1} \cdot x + \varepsilon\) \(\Delta y = \beta_1 \Delta x\)
If \(x\) is changed by one, we expect a change of \(y\) by \(\beta_1\).
Level-log regression \(y = \beta_{0} + \beta_{1} \cdot log(x) + \varepsilon\) \(\Delta y = (\beta_1/100) \% \Delta x\)
If we change \(x\) by one percent, a change of \(y\) by \((\beta_1/100)\) is expected.
Log-level regression \(log(y) = \beta_{0} + \beta_{1} \cdot x + \varepsilon\) \(\% \Delta y = (100 \cdot \beta_1) \Delta x\)
If \(x\) is changed by one, we would expect \(y\) to change by \((100 \cdot \beta_1)\) percent.
Log-log regression \(log(y) = \beta_{0} + \beta_{1} \cdot log(x) + \varepsilon\) \(\% \Delta y = \beta_1 \% \Delta x\)
If \(x\) is changed by one percent, \(y\) is expected to change by \(\beta_1\) percent.

For the sake of completeness, the table shows all cases of regression models. However, in this problem set we will only make use of the models where the response variable is logarithmic.

>

All variables that we need in Exercise 3 and its subtasks are stored in the file reg_dat.dta, which is based on the data set dsp_dat.dta of former exercises. Before we can start with our analysis we need to load the data.

Task: Read in the data reg_dat.dta and have a look at it. To do so, just press edit and check afterwards.

#< task
# Load the data into R
dat = read.dta("reg_dat.dta")

# Display a random part of dat
sample_n(dat, 6)
#>

Most of the variables are already known. Instead of air_pol the data set stores the logarithmized values ln_air_pol. Note that ln_air_pol stores negative values only. This results from logarithmizing the original values that range from $0$ to $1$. The variable income indicates the income level of the population at each DSP site and region is a factor variable for the part of China the DSP site is located in. province specifies the province the DSP sites belong to.

We now want to perform the regression in R. You certainly know the function lm() which is often used to fit linear models. In the brackets, you have to specify the regression formula. Its basic form is dependent_var ~ independent_var. If you have more than one control variable, you have to combine them by a + operator.

Task: Regress ln_digestive on water_grade using the function lm() and save the result in the variable ols. The variables are stored in the data set dat.

#< task_notest
# Perform an OLS regression of ln_digestive on water_grade
#>
ols = lm(ln_digestive ~ water_grade, data=dat)

#< hint
display("The structure of the command is ols = lm(??? ~ ???, data=dat). Specify the dependent variable on the left hand-side of ~ and the explanatory variable on its right hand-side.")
#>

< award "Regression Expert Level 1"

You performed the first linear regression using lm(). Congratulations!

>

A useful tool to display regression results is the function summary() from base R. Check https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html for further information.

Task: Just press check in order to have a look at the regression output.

#< task
# Display the regression results
summary(ols)
#>

< info "Regression Output of summary()"

The function summary() gives you a large set of information which has to be interpreted. As argument, the function receives results from a regression that was performed in R beforehand. I will give a short explanation for the case where the regression results were generated by lm(). Basically, its output can be divided into four parts: Call, Residuals, Coefficients and some statistics. Call just shows the formula that R used to fit the data. The Residuals section displays some summary statistics for the residuals of the fitted model.

The part headed Coefficients is of major interest for us since there you can find the estimates of the regression coefficients. The explanatory variables (and the intercept) are listed in the first column. For each of the independent variables, the subsequent four columns show the following statistics:

Concerning the statistics at the very end of the output we will only focus on the Multiple R-squared and its adjusted version. The $R^2$ or R squared is a measure for the proportion of the variation in the response variable that is explained by the explanatory variables. The adjusted R squared, often written as $\overline{R}^2$, corrects the R squared for degrees of freedom (cf. Wooldridge (2016), pp. 180-182).

>

Now, we can begin with the interpretation of our first regression. Generally, we will focus on the interpretation of the coefficients. However, a good interpretation of the intercept $\beta_0$ is not meaningful in our model. Often we will also report the R squared or adjusted R squared, respectively.

Have a look at the output of the command summary(ols). The estimate of $\beta_1$ is roughly $0.072$. This can be interpreted as follows. An increase in the water grade by one increases the digestive cancer death rate by $7.2$%. If you do not remember how to interpret log-level regressions, check the info box Interpretation of Regression Coefficients again. The standard error amounts approximately $0.033$, which is almost half the size of the estimate. The p-value of water_grade is $0.033$. Hence, the variable is significant at a level of $5$%. Additionally, the R squared is $0.032$, that is, $3.2$% of the variance found in ln_digestive can be explained by the explanatory variable.

< info "Remark on Supplemental Regressions in the Paper"

In the paper, Ebenstein additionally performs regressions for subgroups of digestive cancer. Ebenstein explicitly analyzes the relationship between water grade in China's rivers and lakes and death rates for esophageal, liver and stomach cancers, respectively. Those findings largely support the results we obtain when considering digestive cancers in general. Due to clarity, we will not investigate the link of water quality and each of the subgroups but rather focus on digestive cancers in general.

>

In the next sections we will analyze if, in our scenario, assumptions of the linear regression model are violated. If this is the case, the model will be modified in order to detect the causal effect of water quality on digestive cancer rates.

Exercise 3.1 -- Correction of Standard Errors

In econometric analysis, assumption 3 of the linear regression model is often not fulfilled. Recall that we assumed the disturbances to be uncorrelated and to have the same variance. In this chapter, we will discuss if these assumptions are fulfilled. If this is not the case, we will modify our model.

Clustered Data

We can assume that our available data is composed of several groups (or so-called clusters). As described in Ullah and Giles (2017, Chapter 1), clustered data is characterized by independence across clusters and correlation within clusters. We can imagine that water quality in DSP sites close to each other is not independent from each other but has a similar value. Hence, we have geographical correlation. Of course, we have to control for this circumstance since ignoring it can lead to strongly underestimated standard errors and over-rejection using hypothesis tests (cf. Ullah and Giles (2017), pp. 2-3). To correct for this, we cluster our analysis at the province level. Consequently, DSP sites that are located in the same province belong to one cluster.

We want to find out how many clusters we will have in our further analysis. Before starting, we load the data.

Task: Load the data. Just press edit and check afterwards.

#< task
# Read in the data
dat = read.dta("reg_dat.dta")
#>

The information about the affiliation to a province of each DSP site is stored in the variable province. Each province is characterized by a unique number. Use the next code chunk in order to find out the number of provinces. Performing the next task is optionally, however.

Task: You can type in whatever you think is needed in order to answer the following quiz.

#< task_notest
# You can enter arbitrary code here...
#> 

#< notest
length(unique(dat$province))
#>

#< hint
display("For example, you get the solution by means of the functions length() and unique(). Both functions should be nested in the following form: length(unique(???)). Instead of the ??? you have to indicate the variable province.")
#>

< quiz "Number of Clusters"

question: How many clusters do we have in our data? Type in an integer. answer: 31

>

We divide the data in $31$ different provinces, that is, we have $31$ clusters where each represents several DSP sites.

Correcting for Clustering using Cluster-Robust Standard Errors

Now, we perform a multiple linear regression that uses cluster-robust standard errors. If you are interested in the theory and formulas behind the concept, check the following info box.

< info "Cluster-Robust Standard Errors"

Assume, we have the model $y = X \beta + \varepsilon$ where y is the vector of dependent variables, $X$ a matrix of explanatory variables, $\beta$ the vector of true coefficients and $\varepsilon$ the residual vector. The variance matrix conditional on $X$ is given by $$ Var(\hat{\beta}) = (X^T X)^{-1} B (X^T X)^{-1} $$ with $B = X^T Var(\varepsilon|X) X$.

According to Cameron and Miller (2015, pp. 323-325), if the given data can be divided in $C \in \mathbb{N}$ clusters, the regression model can be written as $$ y_c = X_c \beta + \varepsilon_c $$ where $c = 1, \ldots, C$ indicates the cluster $c$. The submatrices $y_c$, $X_c$ and $\varepsilon_c$ store values for cluster $c$ only. Hence, we get an individual regression equation for every cluster. In this case, the variance of $\hat{\beta}$ is given by

$$ Var_{clu}(\hat{\beta}) = (X^T X)^{-1} \hat{B}{clu} (X^T X)^{-1}. $$ Since the residuals are independent across clusters, the matrix $B$ in the general case simplifies to $\hat{B}{clu}$ which can be estimated by $$ \hat{B}{clu} = \sum{c=1}^{C} X_c^T \hat{\varepsilon}c \hat{\varepsilon}_c^T X_c. $$ The estimator of the residual vector for cluster $c$ is defined by $\hat{\varepsilon}_c = y_c - X_c \hat{\beta}$. Calculating the square root of $Var{clu}(\hat{\beta})$ provides the clustered standard errors of $\hat{\beta}$. For more details, have a look at Cameron and Miller (2015, Chapter II.C).

>

Instead of lm(), we will make use of the function felm() from the package lfe since it is a more powerful tool for performing linear regressions. One important functionality is the possibility of clustering.

< info "Clustered Standard Errors with felm()"

The function felm() out of the package lfe allows us to fit linear models. It can be used equivalently to the function lm() for performing standard linear regressions as follows.

# Load the data
library(lfe)

# Perform a standard linear regression
felm(y ~ x1 + x2, data=dat)

We regress the variable y on x1 and x2, which are stored in the data set dat. This is one of the simplest calls you can perform based on the function felm().

However, there are numberless features implemented for felm() that are more powerful. The expression y ~ x1 + x2 is one example for the argument formula in felm(formula, data), which is a symbolic description of the model that should be fitted. It is possible to extend the formula by adding several options separated by vertical bars. We can modify our first call of felm() in order to consider standard errors that are clustered by the variable clust_var.

# Perform a regression with clustered standard errors
felm(y ~ x1 + x2 | 0 | 0 | clust_var, data=dat)

The two 0 arguments are placeholders for other options that can be included. We will learn about them step by step in the course of the problem set. If you want to get more detailed information immediately, have a look at https://cran.r-project.org/web/packages/lfe/lfe.pdf.

Note that the values of our estimator $\hat{\beta}$ do not change when using clustered standard errors.

>

Let us perform our first linear regression with clustered standard errors. Recall that ln_digestive is the response variable and water_grade is the only explanatory variable. Our analysis should be clustered by the variable province.

Task: Use felm() in order to regress ln_digestive on water_grade with standard errors that are clustered at the province level. Store the result in reg_clust. If you do not know how the function works, just look at the info box above or press hint. Before, load the package lfe. Replace the ??? by the correct code and press check afterwards.

#< task_notest
# Load the package first
# ???
#>
library(lfe)

#< hint
display("In order to load the package, use the command library() and specify the name of the package inside the brackets.")
#>

#< task_notest
# Perform the regression using cluster-robust standard errors
# reg_clust = ???(ln_digestive ~ water_grade | 0 | 0 | ???, data=???)
#>
reg_clust = felm(ln_digestive ~ water_grade | 0 | 0 | province, data=dat)

#< hint
display("Replace the ??? by the correct code. The first ??? are placeholders for the function that should be used, the second are for the clustered standard errors and the last for the data frame that stores all required variables.")
#>

< award "Regression Expert Level 2"

Well done! You performed the first regression using felm() from the package lfe. You learned how to correct for clustering in the data, which is one of the most powerful applications of this function.

>

Now, we will display the results using the function summary().

Task: Display the regression results stored in reg_clust using the function summary(). Just press check here.

#< task
# Show the regression results
summary(reg_clust)
#>

The output structure received when using felm() for the regression differs slightly from the one we already discussed. Compared to the output of lm() we have two additional lines in the last section since we get two values for $R^2$ and the F-statistic, one for the projected model and one for the full model. This is the case because one (for us still unknown) functionality of felm() is the ability to project factors out. We will learn about this feature in the course of the problem set. Since we did not project any factors out here, both values are equal.

Due to clustering, the third column of the section Coefficients now shows the clustered standard errors. We will compare the regression results of reg_clust with the regression we already performed in Exercise 3 without considering clustered residuals.

Comparison of Original OLS and Cluster-Robust OLS

For a nice comparative analysis of the regression outputs we will use the function stargazer() from the package of the same name.

< info "Compare Regression Results using stargazer()"

By means of the function stargazer() from the package stargazer it is possible to present regression results from several models in one table. The basic call of the function is shown in the next code chunk.

# Load the package
library(stargazer)

# Compare the results of several regressions
stargazer(reg1, reg2, ...)

We get a table consisting of the regression results of two regressions reg1 and reg2. There are numberless options which can be specified in order to determine the structure of the table. They will be added in place of the dots. I will shortly present the ones we will make use of.

Check https://cran.r-project.org/web/packages/stargazer/stargazer.pdf for more information.

>

Before comparing the results from OLS and cluster-robust OLS using stargazer(), we have to perform OLS again. In Exercise 3.1 we already performed a simple linear regression without controlling for clustering. We made use of the function lm(). Now, we will perform the same regression again but use felm() instead. The results remain the same, however. Note that we can omit the placeholders in the brackets of the function felm() if we do not cluster our data.

Task: Press check in order to get a table showing two regression outputs of the same regression model, one time using default standard errors, and clustered residuals the other. Just press check here.

#< task
# Perform ordinary least squares without clustering
ols = felm(ln_digestive ~ water_grade, data=dat)

# Load the package
library(stargazer)

# Compare OLS and cluster-robust OLS
stargazer(ols, reg_clust,
          type="html",
          digits=3,
          keep="water_grade",
          report="vcsp*",
          omit.stat="ser",
          model.numbers=FALSE,
          column.labels=c("OLS", "Clustered"))
#>

Due to the option keep the table displays only statistics for the variable water_grade. The remaining control variables are omitted. In the case of simple regression, this applies for the constant $\beta_0$ only. The variable name, coefficient estimate, standard error, p-value and significance level are shown due to the argument report="vcsp*". All variables are rounded to three decimal digits. For the model statistics, the standard error of regression is omitted because of the argument omit.stat.

As stated above, the values of the estimated coefficients $\hat{\beta}$ do not change when using standard errors that are clustered at the province level. Also the values of $R^2$ and its adjusted version remain the same. The standard errors of each coefficient estimate which are displayed in brackets exhibit modified numbers, however. The same applies to the p-value, which is displayed below the value of the standard error. The reason is that we control for the correlation of DSP sites belonging to the same provinces. As you can see, controlling for clustering results in a higher standard error and a higher p-value for water_grade. In the second model, with a p-value of $0.074$ water quality is significant at $10$%.

Test for Heteroskedasticity

So far, we corrected for correlation across clusters by means of cluster-robust standard errors. What if the residuals additionally do not all have the same variance? The technical term for this state is heteroskedasticity. In this case, our OLS estimator would still be unbiased and consistent but no longer efficient (cf. Auer and Rottmann (2015), Chapter 2.2). The opposite of heteroskedasticity is homoskedasticity, that is the error terms have the same variances.

In addition to the analysis conducted in the paper, we want to test for heteroskedasticity. As Kleiber and Zeileis (2008, pp. 101-102) suggest, we make use of the Breusch-Pagan test (BP test) which has been developed by Breusch and Pagan (1979, pp. 1287-1294).

< info "Breusch-Pagan Test"

Assume the following linear model $$ y = \beta_0 + \beta_1 \cdot x_1 + \ldots + \beta_k \cdot x_k + \varepsilon \tag{1} $$ where $y$ is the vector of dependent variables, $x_1, \ldots, x_k$ the $k$ independent variables, $\beta_0, \beta_1, \ldots, \beta_k$ the true coefficients and $\varepsilon$ the vector of residuals. All assumptions of the linear regression model (see Exercise 3) should be fulfilled -- except for homoskedasticity, which is specified in assumption 3.

The null hypothesis of the BP test should be that additionally assumption 3 is true, that is $$ H_0: \text{The residuals } \varepsilon_1, \ldots, \varepsilon_{145} \text{ are homoscedastic}, \tag{2} $$ which means that $\mathbb{E}(\varepsilon^2| x_1, \ldots, x_k)=\sigma^2 > 0$. Breusch and Pagan presume the following linear dependency structure of the error term. $$ \varepsilon^2 = \delta_0 + \delta_1 \cdot x_1 + \ldots + \delta_{k} \cdot x_k + v $$ where $\mathbb{E}(v|x_1, \ldots, x_k) = 0$. Hence, the null hypothesis of homoskedasticity given in $(2)$ is equivalent to $$ H_0: \delta_1 = \delta_2 = \ldots = \delta_{k} = 0. $$ As described in Hackl (2013, pp. 193-195) and Wooldridge (2016, pp. 250-252), the BP test can be performed in the following steps. At first, equation $(1)$ is estimated using OLS.

In step 2 of the procedure, the equation $$ \hat{\varepsilon}^2 = \delta_0 + \delta_1 \cdot x_1 + \ldots + \delta_{k} \cdot x_k + v' $$ is estimated using OLS. The estimates $\hat{\varepsilon}^2$ result from OLS estimation of $(1)$. The test statistic for heteroskedasticity depends on the value of R squared from the regression in step 2. It will be called $R^2_\varepsilon$ in order to distinguish it from the R squared of regression model $(1)$.

In the last step, the test statistic is computed. We will make use of the LM statistic, which was proposed by Koenker (1981, pp. 107-112), since it is characterized by greater applicability. His version is known as studentized Breusch-Pagan test. The LM statistic is given by $$ LM = n \cdot R^2_\varepsilon $$ where n is the number of observations. Under the null hypothesis, $LM$ approximately follows a $\chi^2_k$ distribution. In order to decide if $H_0$ should be rejected or not, the p-value has to be computed using the $\chi^2_k$ distribution.

>

In R, we perform this test using bptest() from the package lmtest. As argument a formula of type dependent_var ~ independent_var is required. Read more about the function at https://cran.r-project.org/web/packages/lmtest/lmtest.pdf.

Task: Press check in order to get the results of the Breusch-Pagan test.

#< task
# Load the package
library(lmtest)

# Perform the Breusch-Pagan test
bptest(ln_digestive ~ water_grade, data=dat)
#>

As output we get a character string indicating what type of test was performed, which data was used and -- the most important part -- the statistics of the test: the value of the test statistic (BP), the degrees of freedom (df) and the p-value (p-value). Consider the output in order to answer the following question.

< quiz "Breusch-Pagan Test"

question: Would you reject the null hypothesis at a reasonable significance level (5% or smaller)? sc: - yes* - no success: Great, your answer is correct! failure: Please try again.

>

With a p-value of $0.034$ we can reject the null hypothesis at a significance level of $\alpha=5$%. Hence, we cannot assume that the residuals of our regression are homoskedastic. Consequently, we have to correct for heteroskedastic standard errors.

Correcting for Heteroskedasticity by Weighted Least Squares Estimation

We will make use of weighted least squares estimation (WLS) in order to eliminate heteroskedasticity. The theoretical background is presented in the following info box.

< info "Weighted Least Squares Estimation"

As usual, assume that we have a relationship of the form $$ y_i = \beta_0 + \beta_1 \cdot x_{i1} + \ldots + \beta_k \cdot x_{ik} + \varepsilon_i, \ \ \ i \in {1, \ldots, n} $$ where $y$ is the dependent variable vector, $x_1, \ldots, x_k$ the independent variables, $\beta_0, \beta_1, \ldots, \beta_k$ the true coefficients and $\varepsilon$ the vector of residuals. $n$ is the number of observations and $k$ the number of explanatory variables. Assume that the residuals are heteroskedastic, that is $$ Var(\epsilon|x) = \sigma^2 h(x) $$ where x denotes all independent variables and $h(x)$ determines the heteroskedasticity. Given a random drawing of the population, we get $\sigma_i=Var(\epsilon_i|x_i) = \sigma^2 h_i$ with $x_i$ indicating all explanatory variables for observation $i$. Dividing our regression equation by $\sqrt{h_i}$ results in $$ y_i^ = \beta_0 \cdot x_{i0}^ + \beta_1 \cdot x_{i1}^ + \ldots + \beta_1 \cdot x_{ik}^ + \varepsilon_i^*, \ i \in {1, \ldots, n} $$ where the starred variables denote the original variables divided by $\sqrt{h_i}$. Using this equation, we get homoskedastic error terms (see Wooldridge (2016, p. 255) for the proof).

The obtained so-called weighted least squares estimator $\hat{\beta}$ will differ from the original OLS estimator. Now, $\hat{\beta}$ minimizes the weighted sum of squared residuals $$ \sum_{i=1}^n \frac{1}{h_i} \varepsilon_i^2 = \sum_{i=1}^n \frac{1}{h_i} (y_i - \hat{y}_i)^2. $$ Now, each residual is weighted individually and we get an efficient estimator for the true coefficient $\beta$. Note that ordinary least squares is a special case of WLS, where each observation gets the same weight.

The theory for WLS estimation can be found in Wooldridge (2016, Chapter 8.4).

>

In order to eliminate heteroskedastic residuals in our case, we will use the transformed multiple regression model

$$ \frac{1}{\sqrt{n_i}} \cdot ln_digestive_i = \frac{1}{\sqrt{n_i}} \cdot \beta_0 + \frac{1}{\sqrt{n_i}} \cdot \beta_1 \cdot water_qlty_i \ + \frac{1}{\sqrt{n_i}} \cdot \varepsilon_i $$

where $n_i$ is the population size of the i-th DSP site. We want to implement weighted least squares in R. The population of every DSP site is stored in the variable pop. Again, we make use of the function felm() contained in the package lfe.

< info "Weighted Linear Regressions with felm()"

By means of the function felm() from the package lfe it is possible to perform weighted least squares estimation. We just need to specify the variable by which the regression should be weighted. This is achieved by executing the following code.

# Perform a weighted linear regression using clustered standard errors
felm(y ~ x1 + x2 | 0 | 0 | clust_var, weights=dat$wgt_var, data=dat)

As argument of weights a vector of weights wgt_var should be specified which will be used in the fitting process. Note that you have to specify the data set that contains the vector of weights by dat$ in front of the vector's name. If the option weights is not specified, ordinary least squares is applied.

Note that only the addition weights=dat$wgt_var distinguishes the code from the one in the info box Clustered Standard Errors with felm(). We could also replace clust_var by 0 if we do not want to consider clustered robust standard errors. However, we will use clustered standard errors and apply weighted least squares simultaneously.

Have a look at https://cran.r-project.org/web/packages/lfe/lfe.pdf for more details.

>

Task: Regress ln_digestive on water quality using weighted least squares by means of the function felm(). The regression should be weighted by pop. Again, use standard errors that are clustered at the province level. All variables are contained in the data set dat. Store the result as reg_wgt and show the regression results. Just replace the ??? by the correct code since the code's main body is already given. If you need help, have a look at the latest info box or press hint. Finally, press check.

#< task_notest
# Perform the weighted regression using clustered standard errors
# reg_wgt = ???(ln_digestive ~ water_grade | 0 | 0 | ???, 
#               weights=???$???, data=dat)
#>
reg_wgt = felm(ln_digestive ~ water_grade | 0 | 0 | province,
               weights=dat$pop, data=dat)

#< hint
display("There are four placeholders that have to be replaced in the given command:
        The first is the function we often use for regressions.
        The second specifies the variable by which the regression should be clustered.
        The last two ??? stand for the variable which is used for weighting the regression. Note that the data frame has to be specified explicitly here.")
#>

#< task_notest
# Display the regression results
# ???
#>
summary(reg_wgt)

#< hint
display("Make use of the function summary() in order to show the regression results. Insert the name of the regression in the brackets.")
#>

< award "Regression Expert Level 3"

Excellent! You are able to perform weighted regressions using felm().

>

Work through the following quizzes in order to interpret the regression output. Complete the sentences of the first two quizzes. Note that more than one answer might be correct.

< quiz "Interpretation of the coefficient of water_grade"

question: An increase in the water grade by one unit (e.g. from grade III to IV)... mc: - ... leads to an increase of the digestive cancer rate by roughly 0.122%. - ... results in an increase of the digestive cancer rate by approximately 12.2%. - ... increases the logarithm of the digestive cancer rate by approximately 0.122. success: Great, all answers are correct! failure: Please try again.

>

< quiz "Interpretation of the significance of water_grade"

question: The estimated coefficient of the overall water quality is... sc: - ... significant at a 1% level. - ... significant at a level of 5%.* - ... not significant at a reasonable significance level. success: Great, your answer is correct! failure: Please try again.

>

< quiz "Interpretation of R squared"

question: Which of the following statements is true? sc: - 8.3% of the variance found in the digestive cancer rate can be explained by the measure for the overall water quality.* - 8.3% of the variance found in the overall water quality can be explained by ln_digestive. success: Great, your answer is correct! failure: Please try again.

>

< award "Quiz Expert Level 1"

Fantastic! You successfully answered a couple of questions concerning regression results. You are an expert in interpreting estimated coefficients, standard errors and R squared statistics.

>

Increasing the water grade by one unit increases the digestive cancer death rate by roughly $12.2$%. Equivalently, we can say that the logarithm of the digestive cancer rate rises by $0.122$ if the water grade increases by one. The variable water_grade is significant at a level of $5$%. An $R^2$ of $0.083$ indicates that $8.3$% of the variance found in ln_digestive can be explained by water_grade.

Comparison of Original OLS, Cluster-Robust OLS and WLS

In order to investigate the differences, we compare the regression results that we obtained so far using the function stargazer(). Instead of the argument keep="water_grade" we use omit="Constant", which results in the same output since we considered only one explanatory variable so far.

Task: Press check in order to list the results from OLS, cluster-robust OLS and cluster-robust WLS side by side.

#< task
# Compare OLS, cluster-robust OLS and cluster-robust WLS
stargazer(ols, reg_clust, reg_wgt,
          type="html",
          digits=3,
          omit="Constant",
          report="vcsp*",
          omit.stat = "ser",
          model.numbers=FALSE,
          column.labels=c("OLS", "Clustered", "Weighted"))
#>

We discover the following differences between the regression results of the cluster-robust weighted least squares estimation and the ones of the OLS regression with and without considering clustered standard errors: The WLS estimator of water_grade is more than $1.5$ times the size of the OLS estimator granting much more impact of water quality on the digestive cancer death rate. Additionally, it is significant at a significance level of $5$% instead of $10$%. The standard error of $\hat{\beta}_1$ has increased. Using WLS estimation lead to an increase in the $R^2$ from $3.2$% to $8.3$%.

However, be careful with the interpretation of simple regression models. Our regression model implies that we only have one variable that influences the digestive cancer death rate, and that is the overall water quality in China's rivers and lakes. That does not sound reasonable. So far, we did not check if there are other factors that influence digestive cancers and are correlated with water quality. If this is the case, water quality is said to be endogenous. The consequence is that we obtain biased estimators.

We want to sum up the results obtained in this exercise shortly. We detected that assumption 3 of the linear regression model is not fulfilled since the available data is clustered and the residuals are heteroskedastic. We corrected for both using cluster-robust standard errors and by weighting the regression. However, we did not check for other variables that are correlated with water quality. Thus, the next exercise will deal with the presence of endogeneity problems.

Exercise 3.2 -- Solving Endogeneity Problems

What happens if variables that are related to digestive cancers and correlated with water quality are not included in our regression model? As is known, all unobservable factors and data that is not available are included in the error term of the regression. If a relevant variable that is correlated with water quality is omitted, it will be part of the residual as well. Hence, water quality is correlated with the disturbance term which causes the violation of assumption 2 of the linear regression model. Recall that assumption 2 refers to the zero expected value of the disturbance term. The problem of their correlation is known as endogeneity. As a result, the causal effect of water grade on digestive cancers cannot be mapped correctly and the OLS estimates will be biased and inconsistent (cf. Kennedy (2008), Chapter 9). If all factors that are correlated with water quality are known, we would be able to eliminate endogeneity by adding these to our regression equation. As a result, the bias would vanish and we would get consistent estimators.

Revision: Simple Linear Regression

Assume that we estimate the simple regression model $$ ln_digestive_i = \beta_0 + \beta_1 \cdot water_grade_i + \varepsilon_i \tag{3} $$ where the standard errors are clustered by the province level. Recall that we already considered this model in the first part of Exercise 3.1. In the second part, we additionally weighted the regression by the DSP site's population size. For analysis purposes, we initially will not consider weights in this exercise. Later in this section, we will refer to equation $(3)$ as short model.

Before beginning with the investigation, we load the data and perform the regression of equation $(3)$.

Task: Press check in order to load the data and perform the ordinary least squares estimation with one explanatory variable water_dat considering clustered standard errors. Since you started a new exercise, press edit first.

#< task
# Load the data
dat = read.dta("reg_dat.dta")

# Perform a regression of digestive cancers on water grade using
# cluster-robust standard errors
reg_short = felm(ln_digestive ~ water_grade | 0 | 0 | province, data=dat)

# Show the regression results
summary(reg_short)
#>

With a value of $0.072$ and a p-value of $0.07$ the estimator $\hat{\beta}_1$ is significantly larger than zero (at a level of $10$%). Hence, ln_digestive and water_grade are correlated. Recall that we are mainly interested in the causal effect of water_grade on ln_digestive. By means of this regression we get a hint of the correlation between water quality and digestive cancer. However, it is not sufficient to conclude causality. If you want, you can read the following info box in order to distinguish both concepts.

< info "Correlation vs. Causality"

As explained in Panik (2012, Chapter 12.B) and Wright (1921, pp. 557-562), the distinction of the concepts of correlation and causality is an important issue. By means of correlation, linear relationships between two variables can be measured. This means that changing one variable involves a certain change in the other variable. In contrast, we talk about causality if one variable determines the outcome of another variable which means that the change in the second variable is a consequence of changing the first variable.

Besides the differences, both concepts are connected to each other. Correlation does not imply causality but causal linear relationships between two factors imply a non-zero correlation. More mathematically, correlation is not sufficient but necessary for the presence of linear causality. It is important not to confuse both concepts. One reason why correlation does not imply causality are spurious associations (cf. Panik (2012), pp. 288-289). At http://www.tylervigen.com/spurious-correlations the conflict between correlation and causality is presented by graphics in an entertaining manner.

Correlation can be estimated using regressions whereas proving causality is not a trivial task. It is not sufficient to run a regression. In fact, in order to get the true causal effect of water quality on digestive cancers, we have to perform a controlled randomized experiment. The data, which should consist of subjects with similar properties, is separated into one treatment and one control group. Typically, the experimenter is interested in the effect of the treatment. Only the treatment group is manipulated from outside, that is it receives the treatment holding all other factors constant. Causal effects of the treatment can thus be isolated by comparing the subjects of the manipulated treatment group with the control group (cf. Panik (2012), pp. 292-294). In our scenario, the treatment is water pollution. Its true causal effect on digestive cancers could be found by dumping chemicals and waste into one part of China's river systems. By comparison with the remaining untreated river systems that have similar characteristics the causal effect of water pollution on the digestive cancer death rate could be measured. Of course, due to its disastrous ecological consequences and possible health threats such an experiment will not be performed in reality. Instead, we will run a regression and additionally scrutinize the result critically. Hence, logical thinking is an important task in order to decide if a relationship is causal.

>

Considering the results of reg_short, a possible explanation for the positive relationship between water quality and cancer could be the following: higher water grades (meaning more polluted water) increase the digestive cancer death rate since in many regions of China unfiltered well or river water is the only source of drinking water. Its consumption may have negative effects on the medical condition of inhabitants and thus may cause digestive cancers.

As stated above, our model neglects other factors that are correlated with water quality, however. The resulting problem of endogeneity has to be addressed. Note that water_grade is said to be endogenous since it causes the endogeneity problem. In contrast, variables that are not endogenous are exogenous. Endogeneity can be measured by the omitted variable bias, which will be discussed in the following.

Multiple Linear Regression and Omitted Variable Bias

Of course, endogeneity can be avoided if we add all variables to our regression equation that are correlated with water quality. However, in practice not all of those factors are known by the researcher. Nevertheless, we are able to reduce the bias by taking into account known variables that are correlated with water quality. Hence, we will transform our simple regression model to a multiple regression model. Then, we will analyze how the consideration of further variables changes the estimated impact of water_grade on ln_digestive and we will discuss if this effect is causal.

At first, we add a measure of air quality (variable air_pol) to our model, which is assumed to affect digestive cancers and to be correlated with water quality. We already know that the variable air_pol takes values between $0$ and $1$ where higher values indicate poorer air quality. We will consider its logarithm, which is stored in ln_air_pol. Note that we will initially introduce the theoretical background.

Assume that the multiple linear regression model given by $$ ln_digestive_i = \beta_0 + \beta_1 \cdot water_grade_i + \alpha \cdot ln_air_pol_i + u_i. \tag{4} $$ is our true model, which maps the causal effect of water quality on digestive cancers. Hence, this model satisfies all five assumptions of the linear regression model. We will refer to equation $(4)$ as long model.

In the short model $(3)$ we did not include ln_air_pol into our regression but estimated a so-called underspecified model (cf. Wooldridge (2016), p. 78). Hence, the error term $\varepsilon$ in $(3)$ captures the measure for air pollution, that is

$$ \varepsilon_i = \alpha \cdot ln_air_pol_i + u_i. $$ Now, water_quality is correlated with $\varepsilon$ since it contains ln_air_pol. Hence, in the case of estimating the short model, we get a biased estimator $\hat{\beta}_1$ with $$ Bias(\hat{\beta}_1) = \mathbb{E}(\hat{\beta}_1) - \beta_1 = \alpha \cdot \frac{Cov(water_grade,ln_air_pol)}{Var(water_grade)}. \tag{5} $$ The right hand-side of this equation is called the omitted variable bias. In practice, often the direction of the bias is of importance. It is received by $$ sgn(Bias(\hat{\beta}_1)) = sgn(\alpha) \cdot sgn(Corr(water_grade,ln_air_pol)) \tag{6} $$ which follows directly from $(5)$ since $Var(water_grade)$ is non-negative. Its computation is not straightforward since $\alpha$ is an unknown parameter. However, often it is possible to make an "educated guess about whether [ln_air_pol] and [ln_digestive] are positively or negatively correlated" (cf. Wooldridge (2016), p. 80). Since in our case ln_air_pol is known, we can estimate the sign of $\alpha$ by analyzing the partial effect of ln_air_pol on ln_digestive and by computing their correlation. Note that, in practice, it is possible that the correlation between two variables does not have the same direction as the partial effect of one variable on the other.

If $sgn(Bias(\hat{\beta}_1)) > 0$, we say that $\hat{\beta}_1$ has an upward bias, if it is smaller than $0$ it has a downward bias. The theory can be found in Wooldridge (2016, Chapter 3-3b) and in Angrist and Pischke (2009, Chapter 3.2.2).

Now after we have introduced the theory, we are able to investigate the following issue: How does the estimator of $\beta_1$ change when, instead of the true model given in $(4)$, the short model is estimated? Hence, we can measure the direction of the bias that occurs when omitting ln_air_pol from our model. We will proceed by calculating $(6)$ and conclude the effect of the omitted variable on the size of $\hat{\beta}_1$. At first, we have to find out the correlations between ln_air_pol and ln_digestive on the one hand and ln_air_pol and water_grade on the other. Think about it carefully before solving the next quizzes. Of course, you do not know the answer yet but you can make a guess.

< quiz "Correlation of ln_air_pol and ln_digestive"

question: Which sign does the correlation of ln_air_pol and ln_digestive have? sc: - The variables ln_air_pol and ln_digestive are positively correlated.* - They have a negative correlation. success: Great, your answer is correct! failure: Please try again.

>

< quiz "Correlation of ln_air_pol and water_grade"

question: Complete the sentence. The variables ln_air_pol and water_grade are ... correlated. sc: - positively* - negatively success: Great, your answer is correct! failure: Please try again.

>

Consider the relationship of ln_air_pol and ln_digestive first. A higher value of the first variable means a higher pollution of the air. This probably results in higher cancer death rates since we cannot imagine a setting where poor air improves health. Hence, the two variables are positively correlated. A similar explanation can be applied to the second pair of variables. Worse air quality (which implies higher values) will hardly result in higher water quality but rather advance the water's deterioration and cause higher water grades due to exchanging particles. Hence, also the correlation between ln_air_pol and water_grade is positive.

We will check if our considerations were correct by computing the correlation of the mentioned variables in R using the command cor().

Task: Calculate the correlation of ln_air_pol and ln_digestive on the one hand and the correlation of ln_air_pol and water_grade on the other.

#< task_notest
# Compute the correlation of ln_air_pol and ln_digestive
#>
cor(dat$ln_air_pol, dat$ln_digestive)

#< hint
display("Your command should have the following structure: cor(dat$var1, dat$var2). Replace var1 and var2 by the correct variables. Note that you have to follow the order given in the task formulation.")
#>

#< task_notest
# Calculate the correlation of ln_air_pol and water_grade
#>
cor(dat$ln_air_pol, dat$water_grade)

#< hint
display("Your command should have the following structure: cor(dat$var1, dat$var2). Replace var1 and var2 by the correct variables. Note that you have to follow the order given in the task formulation.")
#>

Our expectations were true since both values are positive. Consequently, by definition, the exclusion of ln_air_pol from our regression model led to an upward bias. What does this mean for the estimator $\hat{\beta}_1$?

< quiz "Size of water grades coefficient"

question: How does the coefficient of water grade change when considering a measure of air pollution as well? sc: - The coefficient of water grade increases. - The coefficient decreases.* success: Great, your answer is correct! failure: Please try again.

>

< award "Quiz Expert Level 2"

Nice! You examined the topic of endogeneity and omitted variable bias thoroughly and solved quizzes on these subject on your own.

>

We will verify this result by performing the extended linear regression given in $(4)$.

Task: Regress ln_digestive on water_grade and ln_air_pol using the felm() function. Cluster the regression at the province level. The result should be named reg_long. If you need further advice, press hint.

#< task_notest
# Estimate the coefficients of the long model
#>
reg_long = felm(ln_digestive ~ water_grade + ln_air_pol | 0 | 0 | province, data=dat)

#< hint
display("The command should look as follows: 
        reg_long = felm(??? ~ ??? + ln_air_pol | 0 | 0 | province, data=dat)
        The ??? should be replaced by the correct variables.")
#>

In the next code chunk we will display the current regression results and compare them to the OLS estimates of our simple regression reg_short in order to determine the change of coefficient sizes. Again, we will make use of the function stargazer(). Since we are not interested in the constant, we add the option omit="Constant" to the call of the function. Additionally, we only report the variable names, coefficient sizes and significance levels by means of report. Due to the argument omit.table.layout="s" the model statistics will be omitted.

Task: Press check in order to get a table displaying the regression results of reg_short and reg_long.

#< task
# List the regression results side by side
stargazer(reg_short, reg_long,
          type="html",
          digits=3,
          omit="Constant",
          report="vc*",
          omit.table.layout="s",
          model.numbers=FALSE,
          column.labels=c("Short Model", "Long Model"))
#>

You can see that $\hat{\beta}_1$ of the underspecified model has a value of approximately $0.072$. When considering the longer model the coefficient estimate of water_grade decreases to a value of roughly $0.066$. Hence, without adding ln_air_pol to our equation we overestimate water grade's effect which leads to an upward biased estimator $\hat{\beta}_1$.

Now that we computed the direction of the bias, we can estimate the omitted variable bias itself and verify if formula $(5)$ accounts for the difference between $\hat{\beta}$ from the short and long model. By definition, this should be the case. In the next code chunk, we will compute the "true" omitted variable bias. Note that we write true in quotes since it is still an estimator for the bias. We will refer to the deviation of both coefficient estimates as estimated bias.

In order to access the coefficients of a regression, we will make use of the function coef(). In the brackets, you have to specify the name of the regression equation. In order to refer to a specific coefficient, you can add a square bracket where you indicate the number of the coefficient as an integer.

Task: Calculate the "true" omitted variable bias using the formula given in $(5)$. Use the already computed variables alpha_hat, cov_water_air and var_water, which are components of the bias formula. Press check afterwards.

#< task
# Extract the partial effect of ln_air_pol on ln_digestive
alpha_hat = coef(reg_long)[3]

# Compute the covariance of water_grade and ln_air_pol
cov_water_air = cov(dat$water_grade, dat$ln_air_pol)

# Calculate the variance of water_grade
var_water = var(dat$water_grade)

# Compute the omitted variable bias using the right hand-side of formula (5)
# Enter your code here...
#>
alpha_hat*cov_water_air/var_water

#< hint
display("In order to receive the omitted variable bias, just multiply alpha_hat with cov_water_air and divide the result by var_water.")
#>

According to formula $(5)$, the bias is roughly $0.006086$. By definition, the formula of the omitted variable bias is an estimator for the difference of $\hat{\beta}_1$ from the short and the long model. We will check if this deviation is as large as the "true" bias.

Task: Compute the difference between the estimator $\hat{\beta}_1$ from the short and long model. Your calculation should be based on beta_hat and beta that are calculated in the beginning of this code chunk. Thereafter, press check.

#< task
# Extract the estimated coefficient of water_grade in the short model
beta_hat = coef(reg_short)[2]

# Extract the "true" coefficient of water_grade in the long model
beta = coef(reg_long)[2]

# Compute the estimated bias
# Enter your code here...
#>
beta_hat - beta

#< hint
display("In order to calculate the difference between the estimated beta and the \"true\" beta, you just have to subtract the latter from beta_hat.")
#>

The deviation of the estimated coefficients results in a value of $0.006086$ as well. Hence, we can confirm that the omitted variable bias formula measures the difference in the estimators $\hat{\beta}_1$ from the underspecified model and the true model.

So far, we assumed that the long model with two explanatory variables water_grade and ln_air_pol is our true model. Of course, this does not reflect the reality since there are additional factors that influence digestive cancers and are simultaneously correlated with water quality. In the next task, we will add those variables and extent our multiple regression. So far, we did not weight our regression by the DSP sites' population since the omitted variable bias formula is not applicable for weighted regressions. Now, besides considering additional variables we use weights for the regression again.

We will add the following three control variables to our multiple linear regression model: urban, educ and farmer. Recall that for each DSP site the first variable indicates if it is located in an urban area or not. The variable educ measures the years of education of decedents over the age of 20 at the corresponding DSP site. For each DSP site, farmer indicates the share of people working in the farming sector.

Task: Press check in order to perform a multiple linear regression with all five control variables.

#< task
# Perform the multiple regression with five explanatory variables
reg_multiple = felm(ln_digestive ~ water_grade + ln_air_pol + urban + educ +
                      farmer | 0 | 0 | province, weights=dat$pop, data=dat)
#>

In order to draw a conclusion, we would like to compare the results of this multiple regression to the short model given in $(3)$. However, displaying both in one table is not reasonable since different methods were used for their estimation: the short model is estimated by OLS whereas WLS is applied for the multiple regression. Hence, we estimate the short model again, but additionally consider weights now. Thereafter, by means of the function stargazer() the results can be compared.

Task: Press check in order to perform a simple regression that is weighted by pop (and clustered by province). Additionally, a table of regression results will be displayed.

#< task
# Perform the weighted simple regression
reg_simple = felm(ln_digestive ~ water_grade | 0 | 0 | province,
                  weights=dat$pop, data=dat)

# Compare the results of the simple regression and the multiple regression
stargazer(reg_simple, reg_multiple,
          type="html",
          digits=3,
          omit="Constant",
          report="vcp*",
          omit.stat = "ser",
          model.numbers=FALSE,
          column.labels=c("Simple Regression", "Multiple Regression"))
#>

After including further relevant explanatory variables we reduced the error that occurred by violating assumption 2 of the linear regression model. Recall that assumption 2 states that the expected value of the residual given the explanatory variables is zero. We addressed the endogeneity problem and reduced the omitted variable bias. The control variable urban is significant at the $10$% level and water_grade and ln_air_pol are even significant at a level of $5$%. Comparing the size of the R squared to the one obtained when using less variables, the size increased strongly. Now, even $17.3$% of the variance of ln_digestive can be explained by the independent variables (compared to $8.3$% achieved by the simple regression).

Note that with more variables on the right hand-side of the regression equation the interpretation of the individual coefficients $\beta_i$ change. If you want to interpret the impact of one of the explanatory variables on the dependent variable, it is necessary to add that the remaining independent variables remain constant in their size. This is called the ceteris paribus condition. You find a good explanation in Verbeek (2012, pp. 58-59). For example, the proper interpretation of water_grade in the multiple regression model is the following: Increasing the water grade by one unit will increase the digestive cancer death rate by 10.2% holding all other factors constant. However, for reasons of clarity, in most papers its indication is omitted which we will adopt in this problem set.

When comparing the results of the multiple regression to the ones of the linear regression, you probably noticed that the coefficient of water quality has not dramatically changed after including controls. $\hat{\beta}_1$ decreased from a value of $0.122$ to $0.102$. This is not surprising since in Exercise 2 we already found out that the population characteristics have no strong correlation with water quality variation.

Exercise 3.3 -- Additional Control Variables

In our model, we additionally want to control for different income groups and for the region the DSP site is located in. The reason is that we want to account for variation in cancer due to income and regional differences. Since our available data is a cross-section of group averages with different group sizes every DSP site is categorized into a unique income class and belongs to one geographic region of China. This means that different income classes and geographical regions result in different intercepts $\beta_0$ for every DSP site. Not adding these factors to our regression equation would result in biased estimators. Instead, we map this diversity by allowing separate intercepts for DSP sites depending on their income class and location. Formally, we will include a dummy variable for each income class and region category. A dummy variable is a variable that only takes on two values: $0$ and $1$, depending on the absence or presence of some effect (cf. Hackl (2013), pp. 154-156).

At first, we have to load the data.

Task: Press check in order to load the data.

#< task
# Read in the data
dat = read.dta("reg_dat.dta")
#>

The variables that we want to include in our regression are income and region. Both are stored as factor variables in the data frame dat. We want to find out the levels of both variables. The simplest way to do so is to use the function levels() from base R. You just have to indicate the variable in the brackets of the function. For more details check https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html. Processing the next task is optionally, however.

Task: Display the levels of the variables income and region. Use the function levels(). If you need further advice, press hint.

#< task_notest
# Find out the levels of the variable income
#>
levels(dat$income)

#< hint
display("You just have to specify the name of the variable (in the form data_frame$variable_name) inside the brackets of the function levels().")
#>

#< task_notest
# Display the levels of region
#>
levels(dat$region)

#< hint
display("Call the function levels() and pass the correct variable name as argument.")
#>

We can see that income has five different values, that are rural1, rural2, rural3, rural4 and urban. The first four categories are dummies for rural sites. The higher the number at the ending of the dummy, the higher is the income class of the corresponding DSP site. urban is a dummy variable for urban income effects. Additionally, the DSP sites are divided into three regions: east, middle and west.

Hence, we have to consider a total of eight dummy variables. Since the regression equation with five independent variables and dummies for income and region is quite long, I will exemplarily show you a shorter version. With one explanatory variable water_grade and the region dummies the regression is given by $$ ln_digestive_i = \beta_1 \cdot water_qlty_i + \gamma_1 \cdot region_east_i + \gamma_2 \cdot region_west_i + u_i. $$

We make the following three observations:

The procedure is the same when we include the income dummies as well and consider the multiple regression model with five explanatory variables.

Note that these dummy variables are called fixed effects since, in our case, they imply to be fixed for the income classifications and for where the site is located in. With fixed effects, the model that we want to analyze is also called fixed effects model. Technically, after adding the fixed effects to our regression we apply OLS and get a fixed effects estimator $\hat{\beta}$ (cf. Wooldridge (2016), p. 438).

To perform the regression in R we will again use felm() from the package lfe.

< info "Fixed Effects Regressions with felm()"

The function felm() out of the package lfe allows to perform fixed effects regressions. The fixed effects have to be specified in the regression formula. The basic structure of the command is given by

# Perform a weighted regression with clustered standard errors and fixed effects
felm(y ~ x1 + x2 | fixed_eff | 0 | clust_var, weights=dat$wgt_var, data=dat)

where all fixed effects are specified in the argument fixed_eff and they are separated by the + operator.

In this way, multiple group fixed effects are extracted from the original regression equation. This means that they are projected out since we are not primarily interested in their estimation. After that, the remaining coefficients are estimated using the ordinary least squares approach.

Note that adding the fixed effects in the formula y ~ x1 + x2 would result in the same estimates for the coefficients of the explanatory variables. However, if you have numberless dummy variables and a large data set, the runtime of the program increases considerably.

Have a look at https://cran.r-project.org/web/packages/lfe/lfe.pdf for more information about lfe().

>

After transforming income into dummy variables, the dummy variable for category urban and the explanatory variable with the same name measure exactly the same effect. As explained in Auer and Rottmann (2015, Chapter 2.2), the result is perfect multicollinearity, that is two variables on the right hand-side of the equation contain exactly the same information. In R, we would get an NA for the coefficient of the explanatory variable urban whereas the dummy variable for income in urban areas would store the correct estimate. However, since we project out the income fixed effects, we would not see the results. Adding income to the list of explanatory variables will shift the estimate to urban and NA to the dummy variable of income in urban areas.

Task: Regress ln_digestive on the set of well-known explanatory variables using clustered standard errors and weighted least squares. Add income and region fixed effects. The former will be considered in the first part of the formula along with the independent variables. Press check in order to get the results.

#< task
# Perform a fixed effects regression
reg_fix = felm(ln_digestive ~ water_grade + ln_air_pol + urban + educ + farmer +
                 income | region | 0 | province, weights=dat$pop, data=dat)

# Show the results
summary(reg_fix)
#>

This regression equation is our final version. We want to interpret the results accurately. First, as stated above, we get no estimate for the intercept since its value is captured in the coefficients of the dummy variables. The variable water_grade is significant at a level of $5$% (with a p-value of $0.03$). A deterioration of water quality by one grade increases the digestive cancer death rate by roughly $9.7$%. With a p-value of $0.07$ the variable measuring the air pollution is significant at a $10$% level. Edit the following quiz in order to find out the size of ln_air_pol's impact on digestive cancers.

< quiz "Impact of ln_air_pol"

question: Complete the following sentence. An increase of the measure for air pollution by one percent... sc: - ... increases the logarithm of the digestive cancer rate by approximately 0.232.* - ... results in an increase of the digestive cancer rate by roughly 23.2%. success: Great, your answer is correct! failure: Please try again.

>

< award "Regression Expert Level 4"

Great! You are able to perform a fixed effects regression by means of the function felm() and interpret regression results.

>

Equivalently, one might say that an increase of ln_air_pol by $1$% leads to an increase of the digestive cancer rate by $0.232$%. The variables farmer, educ and urban are highly insignificant which makes an interpretation less utile. Since we did not project out the fixed effects for income for the reasons explained above, we can have a look at the estimates of the income dummy variables. Please answer the following question.

< quiz "Reference Category for Income"

question: Which of the income dummies is the reference category? Just type in ruralx (with the correct number from 1 to 4 instead of x) or urban. answer: rural1

>

Since you can see estimates for the entries rural2, rural3, rural4 and urban, the reference category has to be rural1. Considering the $R^2$, $26.7$% of the variance of digestive cancers can be explained by our explanatory variables.

Visualizing the Impact of the Explanatory Variables on Digestive Cancer

Beside the theoretical investigation, we want to visually compare the magnitude of the explanatory variables' influence on the digestive cancer death rate. So far, this is not possible since every explanatory variable is measured in a different unit. A useful function for visualizing effect sizes is effectplot() from the package regtools. Before we can begin with the plotting, we have to modify our regression equation slightly. The reason for this circumstance is that effectplot() is not able to handle multicollinearity. In order to enable a graphical comparison, we have to get rid of the perfect multicollinearity between urban and the dummy incomeurban. Using felm(), we achieve that if we omit urban from the list of explanatory variables and shift income to the fixed effects part of the regression formula. You will do this in the following chunk.

Task: Modify the original fixed effects regression reg_fix in order to eliminate the perfect multicollinearity. To do this, remove urban from the regression and move income to the fixed effects part of the formula. Uncomment the code which shows the command for the original fixed effects regression and modify it. The regression should be named reg_fix_mod.

#< task
# Uncomment the lines and modify the original fixed effects regression
# reg_fix = felm(ln_digestive ~ water_grade + ln_air_pol + urban + educ + farmer +
#                  income | region | 0 | province, weights=dat$pop, data=dat)
#>
reg_fix_mod = felm(ln_digestive ~ water_grade + ln_air_pol + educ + farmer |
                   income + region | 0 | province, weights=dat$pop, data=dat)

#< hint
display("There are three things to change in the given code:
        1. The name of the regression should be reg_fix_mod.
        2. The first part of the formula now consists of only four explanatory variables since urban is removed and income is shifted.
        3. Apart from the variable region, the fixed effects part of the formula additionally exhibits the variable income.")
#>

In spite of the modification, we still ensure a correct analysis. In the next task we will represent the results of both fixed effects regressions in one table using stargazer(). In order to enhance comparability, we display all explanatory variables, their standard errors and p-values in the table whereas the income dummies are omitted. Additionally, the model statistics are shown at the end of the table. By the argument add.lines we add two lines which indicate if income and region fixed effects are included in the corresponding regression.

Task: Press check in order to receive a table with the results from both fixed effects regressions.

#< task
stargazer(reg_fix, reg_fix_mod,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcsp*",
          omit.stat="ser",
          model.numbers=FALSE,
          column.labels=c("Original Fixed Effects", "Modified Fixed Effects"),
          add.lines = list(c("Income Fixed Effects", rep("Yes", 2)),
                           c("Region Fixed Effects", rep("Yes", 2))))
#>

As you can see, every listed number of the original regression coincides with the corresponding value of the modified regression. We can conclude that both commands perform the same regression. The only drawback is that we are not able to additionally compare the impact of urban on digestive cancer since this variable is projected out in reg_fix_mod. However, the function effectplot() is still very useful in order to get a better idea of the impact sizes of the remaining four control variables.

< info "Visualization of Regression Results with effectplot()"

The function effectplot() from the package regtools allows to compare effect sizes of explanatory variables in regressions. It is greatly useful if the control variables are measured in different units. The plot displays the magnitude of the impact of each independent variable on the dependent variable if the former changes, ceteris paribus, from its $10$%-quantile to its $90$%-quantile. Note that an $x$%-quantile is dividing the sorted data into two parts such that $x$% of the data is smaller than this value and $(100-x)$% of the data is greater (cf. Kohler and Kreuter (2012), pp. 173-174).

The call of the function is just effectplot(reg_results), where reg_results are the results from a regression call by lm() or felm().

More information about effectplot() and additional arguments can be found at https://github.com/skranz/regtools/blob/master/man/effectplot.Rd.

>

Now, we have all information in order to generate the plot of effect sizes.

Task: Visualize the regression result of reg_fix_mod using the function effectplot(). Load the package first.

#< task
# Load the package
#>
library(regtools)

#< hint
display("Call the function library(regtools).")
#>

#< task
# Generate the plot
#>
effectplot(reg_fix_mod)

#< hint
display("You just have to specify the name of the regression as argument of the function effectplot().")
#>

The structure of the plot is as follows. The explanatory variables are displayed on the y-axis and their effects on the dependent variable are shown on the x-axis. Below each explanatory variable you can find three values which form the $10$%-, $50$%- and $90$%-quantiles of the variable, respectively. The bars are colored according to the sign of the corresponding coefficient estimate. As a small introduction, answer the following question.

< quiz "positive impact"

question: Which of the explanatory variables have a positive impact on digestive cancers? mc: - water_grade - ln_air_pol - educ - farmer* success: Great, all answers are correct! failure: Not all answers are correct. Try again.

>

According to the legend, the turquoise bars have a positive effect on the dependent variable. Therefore, increasing the variables water_grade, ln_air_pol and farmer lead to an increase in digestive cancers. The only bar in red is the one belonging to educ, which has a negative impact on ln_digestive. Now, let us focus on the effect sizes. We will begin with the impact of water grade on digestive cancers. If water grade changes from the value of $2$, which is its $10$%-quantile, to the value of $6$ (its $90$%-quantile), the digestive cancer death rate increases by $39$%. For the interpretation of ln_ai_pol, solve the following quiz.

< quiz "impact of ln_air_pol"

question: Check the correct statement. sc: - If ln_air_pol changes from -1.42 to -0.33, the digestive cancer rate decreases by 25%. - The variable ln_air_pol increases by 25% if the digestive cancer rate changes from -1.42 to -0.33. - The digestive cancer rate increases by 25% if ln_air_pol changes from -1.42 to -0.33.* success: Great, your answer is correct! failure: Try again.

>

< award "Effect Plotter"

Excellent! You are able to visualize the effect sizes of explanatory variables on the dependent variable using effectplot() and interpret the resulting plot correctly.

>

Compared to all other explanatory variables, the effect of water quality on digestive cancers is strong. As can be seen in the plot, also ln_air_pol influences the digestive cancer rate considerably. The effects of the variables educ and farmer on digestive cancers are comparably small. Since both coefficients are highly insignificant (see reg_fix or reg_fix_mod), their impact will not be analyzed in detail.

Exercise 4 -- Falsification Tests

In Exercise 3.3 we found a regression equation that maps the causal relationship between digestive tract cancer rates and overall water quality in China's rivers and lakes. We performed a weighted least squares estimation making use of additional control variables and fixed effects while controlling for clustered standard errors. The results that we achieved were promising. However, we should engage in further analysis in order to rule out that the received OLS result can be explained by correlation of water_grade with further omitted variables. As a result, the variable for water quality would be endogenous, which causes the estimator $\hat{\beta}$ to be biased and inconsistent. In this exercise we want to test if an omitted variable falsifies the size of water quality's estimator.

Before we can start with the investigation, we load the data frame test_dat.dta that is a superset of the already known data set we used for our regressions in Exercise 3.

Task: Press check in order to load the data test_dat.dta into R and show a part of it. Press edit first.

#< task
# Read in the data
dat = read.dta("test_dat.dta")

# Display a random part of it
sample_n(dat, 6)
#>

Along with the variables we already used in the previous exercises the data frame exhibits the following ones: ln_digestive_m and ln_digestive_f store the digestive cancer death rates separately for men and women. The variable ln_all indicates death rates at China's DSP sites in general, ln_cancer is the death rate for all kinds of cancer (not only digestive tract cancers as in ln_digestive) and ln_noncancer stores the deaths that are not caused by cancer. All five variables are logarithmized. The variable tapwater stores the share of households at the DSP site that have access to tapwater.

What if water quality does not influence digestive cancer directly (through drinking water) but through another variable, that we did not include in our regression equation? One could imagine that, for example, the correlation between digestive cancer and water quality derives from occupational exposure to carcinogens. We want to investigate if omitted factors can be responsible for the correlation between water quality and digestive cancer death rates.

Analysis by Sex

At first, we will examine if the link between water_grade and digestive cancers differs by sex. This makes sense since professional fields typically differ for men and women and the former are more likely to work in extreme conditions. We would expect that men are exposed to carcinogens to a greater extent than women. We will analyze this circumstance using OLS twice: first, we measure the impact of water quality on digestive cancers for men only, and second for women only. In other words, we will make use of the variables storing male digestive cancer rates and female digestive cancer rates in place of ln_digestive. Our basis is the already in Exercise 3.3 performed multiple linear regression model.

Before performing the regression, answer the question below. To this end, assume that water quality's impact on digestive cancers results from its correlation with occupational exposure to carcinogens. What would you expect for the size of the estimated coefficients $\hat{\beta}_1$ after having performed the regressions separately for women and men?

< quiz "Impact of water_grade 3"

question: Which statement is true? sc: - The impact of water quality on digestive cancers is considerably higher for men than for women.* - Water quality influences the occurrence of digestive cancers among women in a higher degree than among men. success: Great, your answer is correct! failure: Please try again.

>

If men are exposed to carcinogens in a higher degree than women, we would expect that the correlation between water quality and digestive cancers is higher for men since water quality itself is assumed to be correlated with exposure to carcinogens. A higher correlation means simultaneously a greater estimated coefficient $\hat{\beta}_1$ for men.

Now, we will perform the two regressions successively.

Task: Perform a regression of the measure of male digestive cancer death rates on the well-known set of explanatory variables using the methods discussed in Exercise 3. Just uncomment the line and enter the correct variable name.

#< task
# Perform a regression for men only
# reg_m = felm(??? ~ water_grade + ln_air_pol + farmer + educ + urban +
#                   income | region | 0 | province, weights=dat$pop, data=dat)
#>
reg_m = felm(ln_digestive_m ~ water_grade + ln_air_pol + farmer + educ + urban +
               income | region | 0 | province, weights=dat$pop, data=dat)

#< hint
display("The ??? are placeholders for the dependent variable of the regression. In this case, it is the digestive cancer death rate for men which is stored in ln_digestive_m.")
#>

Well done! Just repeat the procedure for women in the next task.

Task: Regress the measure of the female digestive cancer death rate on the given set of explanatory variables.

#< task
# Replace the ??? by the correct variable name
# reg_f = felm(??? ~ water_grade + ln_air_pol + farmer + educ + urban +
#                   income | region | 0 | province, weights=dat$pop, data=dat)
#>
reg_f = felm(ln_digestive_f ~ water_grade + ln_air_pol + farmer + educ + urban +
               income | region | 0 | province, weights=dat$pop, data=dat)

#< hint
display("Insert the measure of the female digestive cancer death rate instead of the ???.")
#>

Now that we have performed both regressions we can easily compare them using stargazer().

Task: Press check in order to get a table of regression results for male and female digestive cancer death rates.

#< task
# Compare the regressions of the cancer rate for men and women
stargazer(reg_m, reg_f,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcp*",
          omit.table.layout="s",
          model.numbers=FALSE,
          column.labels=c("Men", "Women"),
          add.lines = list(c("Region Fixed Effects", rep("Yes", 2)),
                           c("Income Fixed Effects", rep("Yes", 2))))
#>

Have a look at the table. Both estimates of water_grade are significant at a level of $5$% since their p-values are smaller than $0.05$. A deterioration of water quality by one grade results in a $9.1$% increase in the digestive cancer death rate for men and in a $11.1$% increase for women. The sizes of both estimated coefficients are similar. Consequently, we find a comparable impact of water quality on digestive cancers for men and women. The coefficient of the first regression is even smaller than the second, suggesting that men are not affected by carcinogens in a greater extent compared to women. We can conclude that there are no omitted factors such as occupational exposure to carcinogens that bias the estimated effect of water quality on digestive cancer.

Analysis by Tapwater Share

Similarly to the investigation so far, we now analyze if the relationship between overall water quality and digestive cancers differs by the share of households having access to tapwater. For this end, we have to transform our data set first. Recall that the variable tapwater is a ratio indicating the share of households with access to tapwater. Thus, it takes on values between $0$ and $1$. We divide the whole data set dat into two distinct ones. The first data frame should represent areas with low tapwater shares whereas the second one should contain all DSP sites where the majority of households have access to tapwater. The border between low and high tapwater share is chosen to be the middle of the interval, that is 0.5. That means, sites with tapwater shares greater than $0.5$ belong to the group of DSP sites with high tapwater share, whereas values smaller than $0.5$ refer to sites with low tapwater share.

In a small task we want to check the number of DSP sites that belong to the group of households with low and high tapwater share, respectively. Its editing is optional.

Task: Find out how many DSP sites belong to the group identified by low tapwater share and to the one with high tapwater share. You can enter whatever you think is needed to get the answers of the following quizzes.

#< task_notest
# You can enter arbitrary code here...
#> 

#< notest
sum(dat$tapwater<0.5)
#>

#< notest
sum(dat$tapwater>0.5)
#>

#< hint
display("For example, you receive a compact solution using the function sum(). As argument, specify the condition that tapwater should be smaller or greater than 0.5, respectively.")
#>

< quiz "Low tapwater share"

question: How many DSP sites have a tapwater share smaller than 0.5? answer: 74

>

< quiz "High tapwater share"

question: At how many DSP sites do the majority of households have access to tapwater? answer: 71

>

Both groups are represented by an almost equal number of DSP sites ($74$ sites with low tapwater share and $71$ with high tapwater share). Now, we will generate a separate data frame for each group. You will make use of the dplyr package again. For a little revision have a look at the info box.

< info "Data Manipulation with dplyr Functions"

Here you can find a summary of the most important dplyr functions that you already got to know in the course of the problem set.

The detailed explanation with code examples can be found in the info boxes in Exercise 1.1 and in Exercise 1.2.

>

Task: Generate a data frame low_tap storing only the sites of dat for which the tapwater share is smaller than $0.5$. Similarly, assign the DSP sites with high tapwater share to a new data set high_tap. Use the correct function of the dplyr package. After each command, show a part of the data set using the command head().

#< task_notest
# Create a data set for DSP sites with low tapwater share and display the first rows
#>
low_tap = filter(dat, tapwater<0.5)
#< hint
display("Your command should be structured as follows: low_tap = filter(???). Relace the ??? by the data frame and the condition by which the data should be filtered.")
#>

head(low_tap)
#< task_notest
# Generate a data frame for sites with high tapwater share and show its beginning
#>
high_tap = filter(dat, tapwater>0.5)
#< hint
display("Your command should be structured as follows: high_tap = filter(???). Relace the ??? by the data frame and the condition by which the data should be filtered.")
#>

head(high_tap)

You can see that the first data frame consists of all rows where tapwater's values are smaller than $0.5$, whereas high_tap stores the DSP sites with higher tapwater shares. Now, we can begin with our analysis. In the following, we perform a regression for DSP sites with low tapwater share and one for sites with high tapwater share. Afterwards, we will discuss the results.

Task: Complete the code in order to get the correct regression results reg_low for DSP sites with low tapwater share only. Just remove the ??? and fill in the correct data sets. Repeat the procedure for DSP sites with high tapwater share and save the result in reg_high. Press check afterwards.

#< task_notest
# Perform a regression for DSP sites with low tapwater share 
# reg_low = felm(ln_digestive ~ water_grade + ln_air_pol + farmer + educ +
#                   urban + income | region | 0 | province,
#                   weights=???$pop, data=???)
#>
reg_low = felm(ln_digestive ~ water_grade + ln_air_pol + farmer + educ +
                 urban + income |region | 0 | province,
                 weights=low_tap$pop, data=low_tap)

#< hint
display("For reg_low, replace the ??? by the same data frame.")
#>

#< task_notest
# Perform a regression for high tapwater share regions
# reg_high = felm(ln_digestive ~ water_grade + ln_air_pol + farmer + educ +
#                   urban + income | region | 0 | province,
#                   weights=???$pop, data=???)
#>
reg_high = felm(ln_digestive ~ water_grade + ln_air_pol + farmer + educ +
                  urban + income | region | 0 | province,
                  weights=high_tap$pop, data=high_tap)

#< hint
display("For reg_high, you have to insert the same data frame twice instead of the ???.")
#>

Good job! To enhance comparability, we display the results in one table using stargazer(). As argument, we specify keep.stat instead of omit.stat since we only want to keep one model statistic, that is the number of observations. Its abbreviation is n.

Task: Press check in order to get one table of regression results.

#< task
# Display the regression results for regions with
# low and high tapwater shares side by side
stargazer(reg_low, reg_high,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcp*",
          keep.stat="n",
          model.numbers=FALSE,
          column.labels=c("Low Tapwater Share", "High Tapwater Share"),
          add.lines = list(c("Region Fixed Effects", rep("Yes", 2)),
                           c("Income Fixed Effects", rep("Yes", 2))))
#>

We find that a water grade increase by one unit increases the digestive cancer rate by $13.1$% for DSP sites with low tapwater share. The estimate of water quality's coefficient in areas with access to tapwater is considerably lower. However, only the first estimate is significant (at a level of $5$%). These results reflect that water quality and digestive cancers are correlated in areas without tapwater. This link does not exist for DSP sites with high tapwater share.

The explanation is not straightforward. One could imagine the following. Firstly, in areas without tapwater the population has to rely on river water as drinking source. The significant estimate of $0.131$ shows that in these areas water quality affects the occurrence of digestive cancers. Second, in areas with tapwater the water is cleaner since tapwater is more likely to be treated. Here, we found no significant relationship between water quality and digestive cancers. A reason for both findings could be that drinking water is responsible for the connection between water quality and digestive cancers. This would explain the link between water quality and cancer in areas without tapwater since worse drinking water quality favors diseases. Also the lack of effect in areas with tapwater suggests drinking water as the source of their link since there, water is cleaner and thus is not likely to cause cancer.

So far, we investigated if the link between water quality and digestive cancers differs by sex and by the share of households that have access to tapwater. Both results suggest that environmental factors are responsible for the correlation between water quality and digestive cancers. However, what if the relationship is a result of higher death rates in general?

Analysis by Death Rates

Instead of being responsible for higher cancer death rates, bad water quality could cause fatalities that result from other diseases than cancer. In this section, we compare the relationship of water quality and different death rates. As before, we perform regressions using the function felm(). We just have to modify the dependent variable. We will consider death rates in general (stored in ln_all), cancer death rates (ln_cancer) and noncancer death rates (ln_noncancer). For comparison purposes, we include the well-known regression of digestive cancer death rates stored in ln_digestive.

Task: Press check in order to perform the regressions of the above mentioned death rates.

#< task
# Regression of death rates in general
reg_all = felm(ln_all ~ water_grade + ln_air_pol + farmer + educ +
                 urban + income | region | 0 | province,
                 weights=dat$pop, data=dat)

# Regression of cancer death rates in general
reg_cancer = felm(ln_cancer ~ water_grade + ln_air_pol + farmer + educ +
                    urban + income | region | 0 | province,
                    weights=dat$pop, data=dat)

# Regression of digestive cancer death rates
reg_digestive = felm(ln_digestive ~ water_grade + ln_air_pol + farmer + educ +
                       urban + income | region | 0 | province,
                       weights=dat$pop, data=dat)

# Regression of deaths that are not caused by cancer
reg_noncancer = felm(ln_noncancer ~ water_grade + ln_air_pol + farmer + educ +
                       urban + income | region | 0 | province,
                       weights=dat$pop, data=dat)
#>

In order to compare all regressions easily, we show their results by means of the function stargazer().

Task: Use the function stargazer() in order to display the output of the four performed regressions in one table. The main body of the code is already given. Uncomment the lines and add the missing options (indicated by ???) by the correct code. The table should only display the names, coefficient estimates, p-values and significance levels (by the * symbol) of all explanatory variables. Additionally, the standard error of regression should be omitted. Press hint() if you need an advice.

#< task
# Replace the ??? by the correct arguments
# stargazer(reg_all, reg_cancer, reg_digestive, reg_noncancer,
#           type="html",
#           digits=3,
#           omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
#           report=???,
#           ???,
#           model.numbers=FALSE,
#           column.labels=c("All Causes", "Cancer", "Digestive Cancer", "Noncancer"),
#           add.lines = list(c("Region Fixed Effects", rep("Yes", 4)),
#                            c("Income Fixed Effects", rep("Yes", 4))))
#>
stargazer(reg_all, reg_cancer, reg_digestive, reg_noncancer,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcp*",
          omit.table.layout="s",
          model.numbers=FALSE,
          column.labels=c("All Causes", "Cancer", "Digestive Cancer", "Noncancer"),
          add.lines = list(c("Region Fixed Effects", rep("Yes", 4)),
                           c("Income Fixed Effects", rep("Yes", 4))))

#< hint
display("The option report should denote a list of characters for variables, coefficients, p-values and significance stars. The last ??? should be replaced by the argument omit.table.layout and its correct option. Note that all options have to be wrapped in quotes.")
#>

< award "stargazer Expert"

Well done! You master the most important arguments of the function stargazer(). Hence, you are able to present regression results of several regressions in a clearly structured and neat form.

>

Have a look at the table and answer the following question.

< quiz "Correlation"

question: What can we say about the relationship between water quality and death rates? Check the correct statement. sc: - Water quality seems to be correlated with other causes of death than cancer. - We cannot find a link between water quality and deaths that are not caused by cancer.* success: Great, your answer is correct! failure: Try again.

>

With an estimate of $0.009$ and a p-value of $0.72$ water quality seems to be unrelated to causes of deaths other than cancer. Furthermore, also regressing water quality on death rates in general produces no significant result (its p-value is $0.37$). On the other hand, water quality is highly significant for cancer death rates and, as we already know, for digestive cancer death rates in particular. An increase of the water grade by one unit increases the cancer death rate by roughly $9.0$%. Besides, this estimated coefficient is close to the one we find for digestive cancers which is $9.7$%. The only small deviation is not surprising since digestive cancers represent about two-thirds of all cancers and the survival rate is low. Given these findings we can conclude that water quality's correlation with digestive cancer does not result from higher death rates in general.

Smoking Rates and Dietary Patterns

Another possible explanation of the OLS result that we received in Exercise 3.3 could be an unobserved correlation of water quality and other potential risk factors for digestive cancer like smoking rates and dietary patterns. Ebenstein investigated if water quality is related to smoking or diet by regressing water quality on smoking rates and dietary patterns, respectively. Since the data is not available, we cannot replicate these results. You can have a look at some statistics of the data at Table 5 on page 195 of the paper, however. According to Ebenstein, the regressions do not exhibit statistically significant relationships. We can say, that smoking rates are similar for all DSP sites and do not have significant impact on the occurrence of digestive cancers. Dietary habits are known as risk factor for digestive cancers as well. However, the analysis that Ebenstein conducted indicates that diet is uncorrelated with water quality. Consequently, neither smoking patterns nor dietary habits are likely to bias the estimated effect of water quality on digestive cancer death rates.

Summing up, the analysis by sex, tapwater share and death rates did not reveal another factor that biases the causal effect of water quality on digestive cancers. Also known potential risk factors for digestive cancers, like smoking and diet, are uncorrelated with digestive cancers. We were able to rule out many factors to be omitted variables in our regression model. However, what if there are unobservable variables that are responsible for the correlation of water quality and digestive cancers? We will discuss this issue in the next chapter.

Exercise 5 -- Robustness Check

In many empirical studies, robustness checks constitute an important part. Their aim is to examine if the regression model is robust to changes (cf. Lu and White (2014), pp. 194-195). There are different procedures that can be chosen. In his paper, Ebenstein performed an instrumental variable estimation. The results are summarized in Table 8 on page 197 of the paper. We will proceed analogously.

In Exercise 4 we already examined a few variables that might be correlated with water quality and effect digestive cancers. It might be possible that there are still unobservable factors that influence the occurrence of digestive cancers and are correlated with water quality. Since they are unobserved, we cannot conduct a falsification test as seen in Exercise 4, however. Instead, we will perform an instrumental variable (IV) estimation. As Kennedy (2008, Chapter 9) explains, the IV estimation procedure determines a consistent estimator despite the presence of correlation between an explanatory variable and the disturbance term. Recall that if this is the case, the explanatory variable is said to be endogenous. As an alternative to OLS, we conduct IV estimation.

In order to introduce the theory, let us assume that we have the following multiple linear regression model $$ y = \beta X + \varepsilon $$ where $y$ is the response variable, $X$ a matrix containing a constant vector and the exogenous and endogenous $k$ explanatory variables $x_1, \ldots, x_k$, $\beta$ a vector of the true coefficients and $\varepsilon$ a vector of residuals.

Assume that $x_1$ is endogenous. For the instrumental variable estimation, we need so-called instruments. Instruments are variables that are not part of the original model equation but have to fulfill specific conditions. The basic idea of IV estimation is to identify the part of variation in the endogenous variable that coincides with the variation in the instruments and to use only this part of variation for estimating its coefficient (cf. Wooldridge (2016), Chapter 15).

An observable variable $z$ is called instrumental variable for $x_1$ if it satisfies the following conditions:

  1. It is correlated with the endogenous variable $x_1$, that is, $Cov(z,x_1) \neq 0$ and

  2. z is uncorrelated with $\varepsilon$, that is, $Cov(z,\varepsilon)=0$.

The first condition is often called instrument relevance, which can be verified by regressing $z$ on $x_1$. Condition 2 is known as instrument exogeneity. Showing the compliance of the second condition is not straightforward. We can only try to get a deeper insight by reflecting. For each endogenous variable we need at least one instrumental variable (cf. Kennedy (2008), Chapter 9).

As can be found in Verbeek (2012, Chapter 5.5) the instrumental variable estimator of the model above is given by $$ \hat{\beta}_{IV}= \left( X^T Z \left(Z^T Z \right)^{-1}Z^T X \right)^{-1}X^T Z \left(Z^T Z \right)^{-1}Z^T y \tag{7} $$ where $Z = (I, X^{exo})$ is the matrix of all instrumental variables. $I$ refers to all instruments and $X^{exo}$ contains all exogenous explanatory variables. Note that all exogenous variables act as instruments for themselves. Accordingly, they are canceled out of the endogenous category.

In this problem set, we will not compute $\hat{\beta}_{IV}$ directly by formula $(7)$ but use the two-stage least squares approach that will be presented below.

Instrumental Variables for Water Quality

In our setting, water quality is possibly endogenous. In order to apply IV estimation, we are in need of instruments. The question now is: How do we find them? This task is not straightforward but expects some theoretical knowledge and experience in solving economic problems. There are also textbooks that suggest instruments for particular scenarios (for example Kennedy (2008, Chapter 9.2)). For our scenario, Ebenstein proposes the following two instruments: rainfall and distance from a river's headwaters. We will discuss their validity in this section. At first, we have to load a new data frame.

Task: Read in the data frame iv_dat.dta and save it as dat. Display six randomly selected rows of the obtained data set. Just press edit and check afterwards.

#< task
# Load the data
dat = read.dta("iv_dat.dta")

# Show a part of dat
sample_n(dat, 6)
#>

The data set is a superset of dsp_dat.dta (loaded in Exercise 1 and Exercise 2) and contains a few variables of test_dat.dta (used in Exercise 4). Hence, except for the variable distance, the variables are already known. The variables rainfall and distance will be used as instrumental variables. Recall that the former measures the average monthly rainfall in millimeters in the river basin containing the corresponding DSP site. The variable distance specifies the distance from the river's source in thousands of kilometers.

Rainfall as Instrumental Variable

First, we investigate the variable rainfall. As already mentioned above, an instrumental variable has to meet two conditions: instrument relevance and instrument exogeneity.

1. Instrument Relevance of Rainfall

We start with the analysis of the instrument relevance condition since this property is easier to check. It says that the instrument has to have non-zero impact on the endogenous variable. What do you think is true?

< quiz "Impact of rainfall 1"

question: Does rainfall have a significant impact on water grade? If yes, which sign does the correlation have? Check the correct answer. sc: - Yes, rainfall influences the overall water grade positively. - Yes, rainfall is negatively correlated with water grade.* - No, there is no significant impact. success: Great, your answer is correct! failure: Try again.

>

More rainfall will probably improve the water quality in China's rivers and lakes. Hence, this means a decrease in the water grade since smaller numbers represent cleaner water. We can think about three different channels through which rainfall influences water quality:

Hence, we have strong evidence that the property of instrument relevance is fulfilled. We can investigate their relationship practically by regressing the endogenous variable on the instrument.

Task: Analyze the relationship between water_grade and rainfall by a weighted linear regression and use clustered standard errors. Afterwards, show the results. Uncomment the lines and replace the ??? by correct code.

#< task_notest
# Regress water_grade on rainfall
# reg_rain = felm(??? ~ ??? | 0 | 0 | province,
#                    weights=dat$pop, data=dat)
#>
reg_rain = felm(water_grade ~ rainfall | 0 | 0 | province,
                weights=dat$pop, data=dat)

#< hint
display("For the first command, place water_grade and rainfall on the correct sides of ~. The variable water_grade is the dependent variable and rainfall the independent one.")
#>

#< task_notest
# Show the regression result
# ???
#>
summary(reg_rain)

#< hint
display("In order to display the regression results, call summary(reg_rain).")
#>

After obtaining these results, what do you conclude?

< quiz "Impact of rainfall 2"

question: Does rainfall have a significant impact on water quality? sc: - yes* - no success: Great, your answer is correct! failure: Please try again.

>

In order to determine significance, have a look at the p-value of rainfall which is $0.027$. Since it is smaller than $0.05$ it is significant at a level of $5$%. Above that, the value of its coefficient is not equal to zero. Therefore, we can conclude that rainfall is significantly related to water quality. Now, turn your attention to the size of rainfall's impact. Which of the following statements are true?

< quiz "Coefficient of rainfall"

question: Complete the sentence. If rainfall is increasing by 100 milliliters,... mc: - ... the water grade increases by approximately 0.01 levels. - ... the water quality improves by 1.03 units. - ... the water quality deteriorates by 0.01 units. - ... the water grade decreases by 1.03 levels. success: Great, all answers are correct! failure: Not all answers are correct. Try again.

>

With a value of roughly $-0.0103$ rainfall is negatively related to water quality. In case of increasing rainfall this results in a decreasing water grade or, equivalently, in an improving water quality. For the size of its impact you have to multiply the coefficient by $100$. Hence, $100$ millimeter more rainfall result in a by $100 \cdot 0.0103=1.03$ lower water grade or, with other words, in a by $1.03$ units improved water quality.

The regression output allows us to conclude that rainfall has significant influence on water quality in China's rivers. Hence, condition 1 is fulfilled.

2. Instrument Exogeneity of Rainfall

We still have to check if rainfall is uncorrelated with the disturbance term. The instrument exogeneity condition is more difficult to verify. Rainfall might be unsuitable as instrument since it influences other factors like income and crop types which could directly affect cancer rates. Consequently, rainfall would be correlated with the error term. Ebenstein's strategy is to test if rainfall influences digestive cancers exclusively through water quality. As in Exercise 4, he conducted an analysis by tapwater share since areas without access to tapwater are more likely harmed by polluted river water. We will proceed analogously. First, we generate the two data sets we will work with.

Task: Press check in order to get a separate data frame for DSP sites where less than 50% of households have access to tapwater (low_tap) and another data set for DSP sites with the majority of households using tapwater (high_tap).

#< task
# Data frame of DSP sites with low tapwater share
low_tap = filter(dat, tapwater<0.5)

# Data set of sites with high tapwater share
high_tap = filter(dat, tapwater>0.5)
#>

For each group of DSP sites, we will perform several linear regressions of logarithmized cause-specific death rates on rainfall including the well-known set of control variables and income and region fixed effects. As before, we will weight the DSP sites by their population and cluster the standard errors by province. As independent variable we consider cause-specific death rates, such as all causes, cancer, particularly digestive cancer, and noncancer. In the following chunk we investigate the relationship between these death rates and rainfall for DSP sites which do not have access to tapwater.

Task: Press check in order to get the regressions of cause-specific death rates on monthly rainfall for DSP sites with low tapwater share.

#< task
# Regression of death rates in general
reg_low_all = felm(ln_all ~ rainfall + ln_air_pol + urban + educ +
                     farmer + income | region | 0 | province,
                     weights=low_tap$pop, data=low_tap)

# Regression of cancer death rates in general
reg_low_cancer = felm(ln_cancer ~ rainfall + ln_air_pol + urban + educ +
                        farmer + income | region | 0 | province,
                        weights=low_tap$pop, data=low_tap)

# Regression of digestive cancer death rates
reg_low_digestive = felm(ln_digestive ~ rainfall + ln_air_pol + urban + educ +
                           farmer + income | region | 0 | province,
                           weights=low_tap$pop, data=low_tap)

# Regression of deaths that are not caused by cancer
reg_low_noncancer = felm(ln_noncancer ~ rainfall + ln_air_pol + urban + educ +
                           farmer + income | region | 0 | province,
                           weights=low_tap$pop, data=low_tap)
#>

For reasons of comparability, we will use stargazer() in order to visualize all regression results in one table. In order to distinguish the results from the ones for DSP sites with high tapwater share, we add a caption for the dependent variable via the option dep.var.caption.

Task: Press check in order to get a table of coefficient estimates from regressions of each of the four cause-specific death rates using data for DSP sites with low tapwater share only.

#< task
# Compare the regression results of cause-specific death rates
# for sites with low tapwater share
stargazer(reg_low_all, reg_low_cancer, reg_low_digestive, reg_low_noncancer,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcp*",
          keep.stat="n",
          model.numbers=FALSE,
          column.labels=c("All Causes", "Cancer", "Digestive Cancer", "Noncancer"),
          dep.var.caption="Low Tapwater Share",
          add.lines = list(c("Region Fixed Effects", rep("Yes", 4)),
                           c("Income Fixed Effects", rep("Yes", 4))))
#>

The number of considered observations in these regressions is $74$, reflecting the number of DSP sites without access to tapwater. For areas with low tapwater share, we have significant results for rainfall only for the response variables ln_cancer and ln_digestive (with p-values $0.002$ and $0.005$). Hence, an increase of monthly rainfall by $1$ milliliter results in a by $0.6$% lower cancer mortality and in a decrease of $0.8$% in the digestive cancer death rate. We find no significant relationship between rainfall and other causes of death (since the p-values are larger than $0.1$).

We perform the same regressions for DSP sites with access to tapwater and display them.

Task: We want to perform regressions of cause-specific death rates on monthly rainfall for regions with high tapwater share. The command is already given but the code of reg_high_all is not correct yet. Uncomment the lines and correct the error. Press check afterwards.

#< task
# Regression of death rates in general
# Correct the mistake in the code
# reg_high_all = felm(ln_all ~ rainfall + ln_air_pol + urban + educ + 
#                       farmer + income | region | 0 | province,
#                       weights=high_tap$pop, data=low_tap)
#>
reg_high_all = felm(ln_all ~ rainfall + ln_air_pol + urban + educ + 
                      farmer + income | region | 0 | province,
                      weights=high_tap$pop, data=high_tap)

#< hint
display("Pay attention to the data frames that are given in the command of reg_high_all. One specification is not correct.")
#>

#< task
# Regression of cancer death rates in general
reg_high_cancer = felm(ln_cancer ~ rainfall + ln_air_pol + urban + educ +
                         farmer + income | region | 0 | province,
                         weights=high_tap$pop, data=high_tap)

# Regression of digestive cancer death rates
reg_high_digestive = felm(ln_digestive ~ rainfall + ln_air_pol + urban + educ +
                            farmer + income | region | 0 | province,
                            weights=high_tap$pop, data=high_tap)

# Regression of deaths that are not caused by cancer
reg_high_noncancer = felm(ln_noncancer ~ rainfall + ln_air_pol + urban + educ +
                            farmer + income | region | 0 | province,
                            weights=high_tap$pop, data=high_tap)

#>

Task: Press check in order to receive the regression results for DSP sites with high tapwater share only.

#< task
# List the regression results of cause-specific death rates for DSPs with tapwater
stargazer(reg_high_all, reg_high_cancer, reg_high_digestive, reg_high_noncancer,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcp*",
          keep.stat="n",
          model.numbers=FALSE,
          column.labels=c("All Causes", "Cancer", "Digestive Cancer", "Noncancer"),
          dep.var.caption="High Tapwater Share",          
          add.lines = list(c("Region Fixed Effects", rep("Yes", 4)),
                           c("Income Fixed Effects", rep("Yes", 4))))
#>

The regressions were performed for $71$ DSP sites which are identified by households that have access to tapwater. Considering the regression results, we find no link between rainfall and cause-specific death rates. With p-values greater than $0.34$ the estimates are highly insignificant.

Summing up the regression results obtained for areas with and without access to tapwater, we find only a significant link between rainfall and cancers (in particular digestive cancers) in areas with low rates of tapwater use. A reason for these findings may be that rainfall affects cancers through drinking water since only in areas without tapwater the residents rely on river water as drinking source. Moreover, rainfall influences the concentration of pollutants in river water. Therefore, we find evidence that rainfall's impact on digestive cancers results through water quality, not through other excluded factors. Consequently, also the condition of instrument exogeneity is met.

Distance from a River's Headwaters as Instrumental Variable

Beside rainfall, Ebenstein proposes distance from the river's headwaters (which is stored in the variable digestive) to serve as instrument. Recall that digestive is measured in thousands of kilometers. By means of the flow length to the river's source we can conclude if each DSP site lies at a tributary stream or at a main river segment. Note that tributaries are typically characterized by shorter distances from the river's headwaters whereas main streams are longer. We have to check if the mentioned candidate variable fulfills the conditions of instrument relevance and instrument exogeneity.

1. Instrument Relevance of Distance

At first, we will discuss if distance from the river's headwaters is correlated with water quality. Intuitively, what do you think is true?

< quiz "Impact of distance"

question: Does distance from the river's source have an impact on the overall water grade? If yes, which sign does the correlation have? Check the correct answer. sc: - Yes, distance influences the overall water grade positively. - Yes, distance is negatively correlated with water grade.* - No, there is no significant impact. success: Great, your answer is correct! failure: Try again.

>

Since the answer may be surprising it will be explained in the following. As noted above, rivers with shorter distances to the river's source usually are tributaries. These smaller streams are characterized by slower flows, which are typically more polluted since contaminants in the water are not transported away quickly. Hence, smaller distances imply worse water quality or, with other words, higher water grades. We can conclude that distance from the river's source is negatively correlated with water grade. We will verify our idea by regressing water quality on distance.

Task: Perform a weighted linear regression of water_grade on distance using clustered standard errors.

#< task_notest
# Replace the ??? by the correct variables
# reg_distance = felm(??? ~ ??? | 0 | 0 | province, 
#                     weights=dat$pop, data=dat)
#>
reg_distance = felm(water_grade ~ distance | 0 | 0 | province,
                    weights=dat$pop, data=dat)

#< hint
display("In order to perform the regression, choose the correct positions for water_grade and distance in the formula.")
#>

#< task
# Show the regression result
# ???
#>
summary(reg_distance) 

#< hint
display("Type in summary(reg_distance) in order to display the regression results.")
#>

< quiz "Impact of distance 2"

question: Does distance to the river's headwaters have a significant impact on water quality? sc: - yes* - no success: Great, your answer is correct! failure: Please try again.

>

The p-value of distance's coefficient estimate is $0.007$. Hence, distance to the river's source is highly significant even at a level of $0.1$%. Together with the fact that the coefficient is non-zero, we can say that distance to the river's headwaters is significantly related to water quality. Have a look at the size of the estimated coefficient and answer the question. Fill in the gap with the correct answer. Just type in an integer value without commas and points.

< quiz "Coefficient of distance"

question: Each additional ... kilometers reduces the water grade by approximately 0.21 units. answer: 1000

>

The estimate of distance's coefficient is roughly $-0.21$. Since distance is measured in thousands of kilometers, water grade decreases by $0.21$ units for each additional $1,000$ kilometers.

Our results allow us to assume a significant correlation of distance to the river's headwaters and water quality. Hence, condition 1 is satisfied.

2. Instrument Exogeneity of Distance

It remains to show the fulfillment of the instrument exogeneity condition of distance to the river's headwaters. Is there some factor which is correlated with distance and simultaneously influences digestive cancers? It is hard to imagine that the distance from a river's source influences the occurrence of digestive cancers through another variable than water quality. In the case of distance, a precise investigation proves to be difficult. We have to rely on our common sense and assume that also the second condition is met.

Since we found valid instruments, we are able to perform an instrumental variable estimation.

Instrumental Variable Estimation

Instead of computing the IV estimator directly by the formula given in $(7)$, we make use of an alternative method which is called two-stage least squares (2SLS).

Again, we will begin with the theory behind the concept. Let $X = (X^{exo}, X^{endo})$ be the matrix of all explanatory variables with $X^{exo}$ representing all exogenous variables and $X^{endo}$ storing all endogenous variables. $Z= (I, X^{exo})$ is the matrix of all instruments. Note that beside all instruments stored in $I$ also all exogenous variables act as instruments for themselves. In the presence of endogeneity, the IV estimate of $\beta$ in the linear multiple regression model $y = X \beta + \varepsilon$ can be obtained by the following two stages:

  1. Using OLS, regress the matrix of all explanatory variables $X$ on the set of instrumental variables $Z$. Mathematically, we get $$X = Z \gamma + v$$ where $\gamma$ is the vector of coefficients and $v$ the residual vector. Consequently, the predicted values $\hat{X}$ of $X$ are given by $$\hat{X} = Z \left(Z^T Z \right)^{-1} Z^T X.$$

  2. Perform the regression of the original model but use the matrix of predicted values $\hat{X}$ obtained in the first stage of the procedure instead of the original matrix $X$. The regression equation is $$y = \hat{X} \beta + u.$$ Then, the IV estimator $\hat{\beta}{IV}$ is given by $$ \hat{\beta}{IV}= \left(\hat{X}^T \hat{X} \right)^{-1} \hat{X}^T y. \tag{8} $$

Note that we obtain the IV estimation formula for $\hat{\beta}_{IV}$ shown in $(7)$ if we insert $\hat{X}$ into $(8)$. The explanation of 2SLS and more details can be found in Kennedy (2008, Chapter 9) and Verbeek (2012, Chapter 5.5).

Now that we have the background information, we will begin with the first stage of the two-stage procedure.

Task: Perform the first stage of 2SLS, that is, regress the endogenous variable on all instruments and exogenous variables. Afterwards, show the regression results. Uncomment the lines and replace the ??? by the correct code. Thereafter, press check.

#< task_notest
# Perform the first stage of 2SLS
# first_stage = felm(??? ~ ??? + distance + ln_air_pol + urban + educ + 
#                        farmer + income | region | 0 | province,
#                        weights=dat$pop, data=dat)
#>
first_stage = felm(water_grade ~ rainfall + distance + ln_air_pol + urban + educ +
                    farmer + income | region | 0 | province,
                    weights=dat$pop, data=dat)

#< hint
display("For the first command, the ??? should be replaced by the variables rainfall and water_grade. Arrange them in the correct order.")
#>


#< task_notest
# Show the regression result
# ???
#>
summary(first_stage)

#< hint
display("Make use of the function summary() in order to show the regression results.")
#>

The first stage of the 2SLS procedure allows us to examine the relationship between water grade and monthly rainfall on the one hand and water quality and distance from a river's headwaters on the other. Both instruments are highly significant since their p-values are considerably smaller than $0.01$. Rainfall's coefficient estimate is roughly $-0.016$, which implies that $100$ milliliters more rainfall cause a by $1.6$ levels lower water grade. The estimate of distance's coefficient can be interpreted as follows. Each $1,000$ kilometers of current flow lowers the water grade by approximately $0.40$ units.

Instead of performing the second stage explicitly, we will use an in R pre-defined function. In fact, you should even avoid its manual computation since the standard errors and test statistics you obtain will not be valid. The reason for this is that the residual $u$ in the second stage includes the error term $v$ of the first stage, but the standard errors should only involve the variance of the residual $\varepsilon$ of our original model (cf. Wooldridge (2016), pp. 475-477). In order to perform 2SLS in R, we will make use of the function felm() from the package lfe.

< info "Instrumental Variable Estimation with felm()"

By means of the function felm() from the package lfe it is possible to perform instrumental variable estimation. The IV formula iv_est has to be specified in the third part of the formula, that is after the fixed effects and before the variable by which the observations should be clustered. You can see the structure of the command in the following code chunk.

# Perform a 2SLS estimation including fixed effects,
# clustered standard errors and weights
felm(y ~ x1 + x2 | fixed_eff | iv_est | clust_var, weights=wgt_var, data=dat)

The option iv_est should be of the form (endo_var ~ instr1 + instr2), where you have to specify the endogenous variable endo_var on the left hand side and the instruments on the right hand-side of the tilde symbol. In this example, we have two instruments instr1 and instr2 for the endogenous variable. Note that the whole argument must be placed inside brackets. The exogenous variables from the first and second part of the formula are included automatically in the first stage regression.

Via felm() it is possible to perform an IV estimation with fixed effects, clustered standard errors and weights with one command only. If you do not want to consider all options, you can insert a 0, which acts as placeholder.

The details are specified at https://cran.r-project.org/web/packages/lfe/lfe.pdf.

>

Now, we are able to regress the logarithmized digestive cancer rate on the predicted water grades from the first stage of 2SLS using the well-known independent variables, fixed effects, clustered standard errors and weights.

Task: Perform the two-stage least squares regression of the logarithmized digestive cancer rate on water grade using the instruments rainfall and distance. The main body of the code is already given, so just replace the ??? by the correct code. Press check afterwards.

#< task_notest
# Perform the 2SLS estimation
# reg_2sls = felm(??? ~ ln_air_pol + urban + educ + farmer + income |
#                  region | (??? ~ rainfall + ???) | province,
#                  weights=dat$pop, data=dat)
#>
reg_2sls = felm(ln_digestive ~ ln_air_pol + urban + educ + farmer + income |
                  region | (water_grade ~ rainfall + distance) | province,
                  weights=dat$pop, data=dat)

#< hint
display("The given code exhibits three placeholders. The first is reserved for the dependent variable which is the digestive cancer death rate. The remaining ones are part of the formula (??? ~ rainfall + ???). On the left hand-side of ~ the endogenous variable should be indicated whereas the right hand-side is occupied by the instruments.")
#>

#< task
# Show the regression result
# ???
#>
summary(reg_2sls)

#< hint
display("Type in summary(reg_2sls) in order to show the regression results.")
#>

< award "Regression Expert Level 5"

Congratulations! You are an expert in handling the felm() function. Now, you are able to perform two-stage least squares estimation in R.

>

Note that we excluded the variable for water quality from the first part of the regression. Instead, we specified it in the third part where the endogenous variables and instruments have to be indicated separately.

The regression output differs slightly from the previous ones. The table displays an NA for the estimates of urban, which is also the reason for warnings that might be produced by R. This happens due to perfect multicollinearity since the coefficients' estimates of the variable urban and the dummy variable incomeurban exhibit the same values. Hence, the estimated coefficient of urban is approximately $-0.213$. However, we should not over-interpret the estimate since it is not significant. Another difference is that we now obtain the estimated coefficient of the predicted water grade from the first stage, which is stored in water_grade(fit). With a p-value of $0.02$ the estimate is significant at a $5$% level. Increasing the water grade by one level results in an increase of the digestive cancer rate by $22.1$%. None of the control variables are significant at a reasonable significance level (except for two income dummies).

All in all, the 2SLS estimation results confirm that a causal relationship between water quality and digestive cancers exists. However, in order to assess the quality of IV estimation in our scenario, we will compare these results to the ones we already received using OLS.

Comparison of 2SLS and OLS

In the next task, we will perform OLS again in order to get a nice presentation of both regression results by means of the stargazer package.

Task: Perform a weighted least squares regression of ln_digestive on all control variables using income and region fixed effects. Additionally, cluster the standard errors by province. After that, both regression results will be displayed in one table. The code is already given, so just press check here.

#< task
# Perform the multiple linear regression
reg_ols = felm(ln_digestive ~ water_grade + ln_air_pol + urban + educ + farmer +
                 income | region | 0 | province,
                 weights=dat$pop, data=dat)

# Compare IV and OLS
stargazer(reg_2sls, reg_ols,
          type="html",
          digits=3,
          omit=c("incomerural2", "incomerural3", "incomerural4", "incomeurban"),
          report="vcp*",
          omit.stat="ser",
          model.numbers=FALSE,
          column.labels=c("IV", "OLS"),
          add.lines = list(c("Region Fixed Effects", rep("Yes", 2)),
                           c("Income Fixed Effects", rep("Yes", 2))))
#>

The comparison of the coefficients of water_grade and water_grade(fit) reveals high differences in their size. The estimated coefficient using 2SLS is almost $2.5$ times higher than the OLS estimate. One possible reason for this could be that OLS understates the true effect of water quality on digestive cancers. This problem could be caused by measurement errors in water_grade.

For the benefit-cost analysis however, we will rely on the OLS estimates instead of the considerably larger 2SLS results. Since the former are more conservative, the true health benefit to cleanup will not be overstated which is important in order to ensure effective policy decisions. However, the result obtained by 2SLS supports our finding that water quality directly influences digestive cancer rates. The benefit-cost analysis in next exercise will be based on the OLS results. Hence, we suppose that a deterioration of river water by one grade increases the digestive tract cancer death rate by $9.7$%.

Exercise 6 -- Benefit-Cost Analysis

Digestive cancer is one of the most frequent causes of deaths in China claiming the lives of nearly one million people every year. Hence, reducing the incidence of digestive cancers might have a great impact on life expectancy and health in general. Our conservative estimate indicates that improving the water quality by one grade results in a by $9.7$% lower digestive cancer death rate. For an average of $980,000$ deaths per year due to digestive cancers it is associated with approximately $95,000$ averted deaths per year (cf. World Health Organization, 2002). This means that improving the water quality in China's waterways will result in considerably less fatalities. For policy decisions, it is important to know the cost of averting a death through cleanup.

In this section, we will estimate the cost of improving the water quality in China's rivers and lakes, which concurrently means the prevention of deaths. Since 1982 there already exists a nationwide system of levies in China for firms that discharge untreated wastewater into waterways. Since local administrators are primarily dependent on the industrial sector for tax revenue, the enforcement of the levy system is limited. However, by 1998, \$$4.9$ billion had been collected by levies resulting in reduction of chemical dumping and higher spending on wastewater treatment by firms (cf. Wang and Wheeler, 2005). In the following, we want to estimate the effect of raising the wastewater levy rates in China. Therefore, we will investigate the four following relationships:

The four pairs are connected in the following way: Digestive cancer rates are influenced by the water quality of China's waterways which in turn depends on the level of industrial dumping. The level of cleanup of the discharged wastewater is influenced by the level of the levy rate. Additionally, firm spending on cleanup is controlled by levy rates as well. Estimating the size of each relationship and combining them will give us the cost of averting a death.

Each of those relationships can be estimated using regressions. Since we lack the data for investigating the second relationship and we already had a lot of practice in previous exercises, we will not perform these regressions. Instead, you will guess the sign of the explanatory variable's coefficient of each regression, respectively. After that, the estimates are revealed.

So far, we got an idea about the links between digestive cancers, water grades, industrial dumping, industrial cleanup and levy rates. In a first step, we want to compute the decrease in the number of deaths per year when raising the levy rates.

Number of Averted Deaths per Year

For reasons of simplicity, we introduce a new notation. For each site $i$ let $DeathRate_i$ be the digestive cancer death rate, $WaterGrade_i$ the water grade measured on a six point scale and $Dumping_i$ the tonnage of industrial dumping. The levy rate applied to dumping should be given by $LevyRate_i$. The variable $TotalDeaths_i$ stores the number of deaths due to digestive cancers. Note that for the sake of clarity, we refer to the data as being observed at site $i$ although the sample sizes differ for every pair of variables.

The change in total deaths with respect to a change in the levy rate can be computed using elasticities. It is given by $$ \frac{ \partial \ TotalDeaths_i}{\partial \ LevyRate_i} = \frac{ \partial \ ln \ DeathRate_i}{\partial \ ln \ LevyRate_i} \cdot \frac{TotalDeaths_i}{LevyRate_i} \ . \tag{9} $$

Its stepwise derivation is shown in the following info box.

< info "Derivation of Elasticity of Total Deaths to Levy Rate"

As described in Wooldridge (2016, pp. 636-639), the elasticity of $y$ with respect to $x$ is defined as $$ \frac{ \partial \ ln \ y}{\partial \ ln x} = \frac{ \frac{1}{y} \cdot \partial \ y} {\frac{1}{x} \cdot \partial \ x} $$ where we use the derivation rule of the logarithm. In words, the elasticity of $y$ with respect to $x$ is the percentage change in $y$ when increasing $x$ by one percent.

In our scenario, we want to calculate the elasticity of total deaths with respect to the levy rate. Following the definition shown above, we obtain $$ \frac{ \partial \ ln \ TotalDeaths_i}{\partial \ ln \ LevyRate_i} = \frac{\frac{1}{TotalDeaths_i} \cdot \partial \ TotalDeaths_i} {\frac{1}{LevyRate_i} \cdot \partial \ LevyRate_i} \ . $$ Reordering the equation results in the following equation.

$$ \frac{ \partial \ TotalDeaths_i}{\partial \ LevyRate_i} = \frac{ \partial \ ln \ TotalDeaths_i}{\partial \ ln \ LevyRate_i} \cdot \frac{TotalDeaths_i}{LevyRate_i} $$

which coincides with $(9)$ but for the numerator of the first term on the right hand-side of the equation. For the last computation step, we need the following definition. The total number of deaths $TotalDeaths$ can be obtained by the death rate multiplied by the total population. In our case, the death rate indicates the number of deaths per 100,000 people. Hence, $TotalDeaths$ is given by $$ TotalDeaths = DeathRate \ \cdot \ \left( \frac{N}{100,000} \right) $$ where $N$ is the total population. During derivation, the second factor vanishes since it is a constant. Consequently, we can substitute $TotalDeaths_i$ by $DeathRate_i$ and get the equation

$$ \frac{ \partial \ TotalDeaths_i}{\partial \ LevyRate_i} = \frac{ \partial \ ln \ DeathRate_i}{\partial \ ln \ LevyRate_i} \cdot \frac{TotalDeaths_i}{LevyRate_i} $$ which is the same formula as stated in $(9)$.

>

By means of the chain rule, it is possible to express the elasticity of total deaths with respect to the tax rate as product of partial derivatives. We get the following equation.

$$ \frac{ \partial \ TotalDeaths_i}{\partial \ LevyRate_i} = \frac{ \partial \ ln \ DeathRate_i}{\partial \ WaterGrade_i} \cdot \frac{ \partial \ WaterGrade_i}{\partial \ ln \ Dumping_i} \cdot \frac{ \partial \ ln \ Dumping_i}{\partial \ ln \ LevyRate_i} \cdot \frac{TotalDeaths_i}{LevyRate_i} \tag{10} $$

In order to get the number of averted deaths when increasing levy rates, we have to compute the four terms on the right hand-side of the equation. The first three terms will be estimated using regressions.

We begin with the first pair, that is digestive cancer death rates and water quality of China's rivers and lakes. Recall that the death rate specifies the number of deaths per $100,000$ people and water quality is measured on a six point scale, where higher grades indicate worse water quality. As already known, the elasticity of the digestive cancer death rate to water grade can be estimated as follows.

$$ ln(DeathRate_i) = \beta_0 + \beta_1 \cdot WaterGrade_i + \beta_2 \cdot X_i $$

where $X_i$ are demographic characteristics for each DSP site $i \in {1, \ldots,145 }$. The vector $X_i$ stores a measure for air quality, a dummy for whether the site is urban, the average education of decedents and the share of people working in the farming sector. Additionally, region and income fixed effects were included. Every DSP site is weighted by its population and the standard errors are clustered at the province level. This relationship is already well-known since we estimated its coefficients in Exercise 3.3. We found out that $\beta_1$ is equal to $0.097$ and is significant at a $5$% level.

We continue with the second pair, that is the water quality of China's rivers and lakes and industrial dumping. Industrial dumping is measured by the total untreated wastewater in tons that is discharged into China's waterways. The second term of $(10)$ is estimated by regressing water grade on the logarithm of industrial dumping, monthly rainfall $R_i$ in millimeters and the total size of the river basin $S_i$ in billions of square kilometers. Furthermore, the standard errors are clustered at the province level. The regression equation is given by

$$ WaterGrade_i = \gamma_0 + \gamma_1 \cdot ln(Dumping_i) + \gamma_2 \cdot R_i + \gamma_3 \cdot S_i. $$

< quiz "Coefficient of industrial dumping"

question: After performing the regression, which sign will the coefficient of industrial dumping (\gamma_1) probably have? sc: - Positive sign* - Negative sign success: Great, your answer is correct! failure: Try again.

>

The explanation is straightforward. If firms discharge more untreated wastewater into China's waterways, the water quality deteriorates and the water grade increases. Hence, we expect the coefficient of industrial dumping to be positive.

Now, focus your attention on the relationship between industrial dumping and levy rates. Here, $Dumping_i$ indicates the tonnage of industrial cleanup. The data is given for every province and year for the period from 1992 to 2002. Year and province fixed effects are included as well as the sites are weighted by their population. Additionally, the standard errors are clustered at the province level. We regress the logarithm of industrial dumping on the logarithmized levy rates as follows.

$$ ln(Dumping_i) = \delta_0 + \delta_1 \cdot ln(LevyRate_i) $$

< quiz "Coefficient of levy rate"

question: Complete the sentence. Raising the levy rates... sc: - ... decreases the tonnage of industrial cleanup. - ... increases the cleanup of industrial wastewater.* success: Great, your answer is correct! failure: Try again.

>

When levy rates are raised considerably, firms suffer higher expenses if they continue with dumping untreated wastewater into China's rivers and lakes. In order to reduce costs, firms invest more financial resources into cleaning wastewater. The consequence is an increasing tonnage of cleanup. That is why we expect $\delta_1$, the coefficient of levy rate, to be positive.

So far, we discussed the signs of the first three terms of the right hand-side of $(10)$. We named them $\beta_1$, $\gamma_1$ and $\delta_1$. The size of the coefficients will be needed for the estimation of the number of averted deaths. For the sake of clarity, the estimated coefficients are displayed in the following table.

Deaths w.r.t. Water Grade Water Grade w.r.t Dumping Dumping w.r.t Levy Rate
Elasticity \( \frac{ \partial \ ln \ DeathRate_i}{\partial \ WaterGrade_i} \) \( \frac{ \partial \ WaterGrade_i}{\partial \ ln \ Dumping_i} \) \( \frac{ \partial \ ln \ Dumping_i}{\partial \ ln \ LevyRate_i}\)
Overall Grade \( \beta_1 = 0.097 \)** \( \gamma_1 = 0.220 \)*** \( \delta_1 = 0.815 \)***.

As we expected, all of the estimates have a positive sign. We will shortly interpret these findings from the right to the left. Raising China's levy rates by $100$% reduces dumping of untreated wastewater by $81.5$% (or equivalently, increases cleanup of wastewater by $81.5$%), which in turn improves the water quality by $22.0$% of $81.5$%. Hence, the water grade of China's waterways increases by approximately $0.22 \cdot 0.815=0.18$ units. This is associated with an increase in the water quality and a decreasing digestive cancer death rate.

By now, we estimated all elasticities. The last term of $(10)$ is just the number of annual deaths due to digestive cancer diseases. According to the World Health Organization, in 2002 the number of these decedents amounted to approximately 980,000.

Now, it is your turn to compute the number of averted deaths per year as a result of an increase of the levy rate by $100$%. Just use the next code chunk as calculator in order to answer the following question.

Task: Compute the number of averted deaths when increasing the levy rates by $100$%. Make use of all estimates as they are indicated in this exercise, that is decimal numbers should be indicated with three decimal places. Before starting, you have to press edit.

#< task_notest
# Calculate the number of averted deaths
#>
0.097*0.220*0.815*980000

#< hint
display("Multiply all three elasticities given in the table by the number of annual decedents in China.")
#>

< quiz "Number of Averted Deaths"

question: How many deaths can be averted each year after doubling the levy rates? Round the estimate to whole numbers. answer: 17044

>

< award "Calculator Expert"

Excellent! Beside programming in R you know how to use R as a calculator as well.

>

According to $(10)$, we have to multiply $\beta_1$, $\gamma_1$ and $\delta_1$ with the number of total deaths per year in order to get the number of averted deaths. We obtain an estimate of roughly $17,000$. Hence, raising the levy rates by full $100$% results in a decrease of the total number of deaths by $17,000$.

So far, we calculated the benefit of raising levy rates. In the next section, we want to find out the additional compliance costs that are caused by higher levy rates. Finally, we will find out the additional cost to avert one death.

Cost per Averted Death

In order to estimate the costs for every life that can be saved, we consider the fourth pair of relationships, that is levy rates and firm spending on cleanup. Let $TotalCost_i$ be the yearly spending on wastewater treatment of all firms in an arbitrary province of China. Using elasticities again, the change in total costs with respect to a change in the levy rate can be received by

$$ \frac{ \partial \ TotalCost_i}{\partial \ LevyRate_i} = \frac{ \partial \ ln \ TotalCosts_i}{\partial \ ln \ LevyRate_i} \cdot \frac{TotalCost_i}{LevyRate_i} \ . \tag{11} $$

Estimating the first term on the right hand-side of the equation can be done by investigating the relationship of total costs for cleanup and levy rates. Firm spending on cleanup and levy rates were observed for every province and and year from 1992 to 2002. Note that the index $i$ consequently denotes the costs and levy rates of one of China's provinces per year, respectively. Province and year fixed effects are included. Additionally, the standard errors are clustered at the province level and the sites are weighted by their population. The elasticity of the total costs due to cleanup with respect to the levy rates is given by

$$ ln(TotalCost_i) = \epsilon_0 + \epsilon_1 \cdot ln(LevyRate_i) $$

How do you expect the levy rates to influence firm spending on cleanup?

< quiz "Coefficient of levy rate 2"

question: Higher levy rates result in... sc: - ... lower expenditures for cleanup. - ... higher spending on wastewater's cleanup.* success: Great, your answer is correct! failure: Try again.

>

Of course, raising the wastewater levy rates will imply higher expenditures for firms if they continue in discharging untreated wastewater. In order to avoid large costs, firms are likely to spend more money on cleaning wastewater. Hence, higher levy rates result in higher spending on cleanup. Performing the regression reveals that $\epsilon_1=0.137$ with a significance at a $10$% level. Hence, China's firms increase their spending on wastewater treatment by $13.7$% if the levy rates are doubled. The second term on the right hand-side of $(11)$ is just the sum of firm spending on wastewater treatment. For this analysis, we rely on collected data of 2001. At that year, the level of spending on wastewater treatment amounted to \$$3.7$ billion (cf. China's Environmental Yearbook, 2001).

How large is the increase in compliance costs after raising the levy rates by $100$%? Use the next code chunk in order to answer the question.

Task: Compute the additional costs that firms have to suffer when the levy rates are doubled. Make use of all estimates as they are indicated in this exercise, that is decimal numbers should be indicated with three decimal places.

#< task_notest
# Calculate the additional compliance costs for firms
#>
0.137*3700000000

#< hint
display("In order to receive the additional compliance costs for firms, multiply the elasticity of total costs on the levy rate by the spending on wastewater treatment.")
#>

< quiz "Increase in Compliance Costs"

question: Complete the sentence. Increasing the levy rates by 100% results in additional compliance costs of ... dollar. answer: 506900000

>

Our desired result is obtained by multiplying $\epsilon_1$ with the total costs. Hence, raising the levy rates by $100$% will increase firm spending on wastewater treatment by $13.7$% from \$$3.7$ billion to roughly \$$4.2$ billion implying an additional compliance cost of \$$506,900,000$. As we have already seen in the previous section, this investment saves approximately $17,044$ lives.

Recall that we are interested in the cost for one averted death. You now have enough information for its computation. Use the following code chunk as calculator.

Task: Compute the cost per averted death. Make use of the rounded numbers that were mentioned in the text.

#< task_notest
# Calculate the cost per averted death
#>
506900000/17044

#< hint
display("The cost per averted death can be calculated as the additional compliance costs divided by the number of averted deaths.")
#>

< quiz "Cost per averted Death"

question: How much money should be invested by China's firms in order to save one life? Round the estimate to whole numbers. answer: 29741

>

From a previous code chunk we already know that for an additional expense of \$$506,900,000$ approximately $17,044$ lives can be saved. If we divide the total compliance cost per year by the number of saved lives, we obtain a value of $29,741$. In words, the cost for averting one death amounts to the rounded value of \$$30,000$.

In order to judge the effect of the levy increase correctly, we have to compare this number to the estimated statistical value of a human life in China. Of course, it is not straightforward to specify this number. Surveys that were conducted in China by the World Bank reveal that one statistical life is worth \$$175,000$ among the participants on average (cf. Krupnick et al., 2006 and World Bank, 2007).

< quiz "Justification of Higher Expense"

question: Comparing the cost per averted death to the estimated statistical value of a life in China, is the height of the additional compliance cost justified? sc: - yes* - no success: Great, your answer is correct! failure: Try again.

>

The cost of averting a death is considerably smaller than the estimated statistical value of a life of a Chinese citizen. We can conclude that the cost of saving a life through wastewater cleanup is justified due to the benefit in improved health conditions.

Even if the cost per averted death in truth is much higher than the estimated \$$30,000$, raising the levy rates would still be justified due to the positive impact on the citizens' state of health. Furthermore, due to several reasons, the true benefit of raising the levy rates may even be underestimated. First of all, we made use of the conservative OLS estimate of $9.7$% instead of the one from 2SLS with $22.1$%. Hence, the impact of improving China's water quality may be understated. Secondly, the fact of improving the water quality in China's waterways may derive not only benefit to fight digestive cancers but also to reduce the occurrence of many other diseases. Thirdly, the cost of years of serious medical condition preceding death is not included in the calculations before. Lastly, the reduction of digestive cancer occurrences will save an increasing number of lives in the near future since the aging population in China grows due to currently declining infant mortality rates.

Exercise 7 -- Conclusion

The aim of this interactive problem set was to find out if the deterioration of China's water quality is causally related to the increase in digestive tract cancer rates. We started with descriptive statistics and graphical analysis in order to introduce the data, which is provided for 145 DSP sites. By means of two-sample t-tests we found out that digestive cancer death rates and water quality criteria differ significantly for polluted and cleaner river systems. We began with a simple regression model, involving China's digestive cancer death rate and water grades only. We estimated the coefficients using ordinary least squares. Thereafter, we corrected the standard errors for clustering and heteroskedasticity. Step by step, we reduced the endogeneity problem, which is measured by the omitted variable bias, by including additional control variables in our regression equation. Due to the consideration of region and income fixed effects the bias continued to decrease. Altogether, we found that water grade has a significant positive impact on digestive cancers.

We conducted falsification tests in order to rule out that water quality's impact on digestive cancers is derived from unobservable factors that are correlated with water quality. The robustness check via instrumental variable estimation emphasizes the significant result we obtained using ordinary least squares. However, the coefficient that was estimated using two-stage least squares is $2.5$ times larger than the OLS estimate, suggesting that it overestimates the true causal effect of water quality and digestive cancers. Relying on the more conservative OLS results, we assume that a deterioration of water quality by one grade increases the digestive cancer death rate by $9.7$%.

In the last part of the problem set we estimated the cost of wastewater treatment and its associated advantages in public health and life expectancy. Doubling the current levy rates for wastewater dumping is associated with an additional compliance cost of \$$500$ million per year for firms, which results in $17,000$ averted deaths annually. This implies a cost of \$$30,000$ per averted death. Compared to the estimated value of a statistical life of \$$175,000$ in China, the cost is low. Additionally, improved water quality also prevents the occurrence of other diseases than digestive cancers and enhances quality of life. Hence, higher expenditures for wastewater treatment are justified due to the large human health benefits.

Finally, you can display all awards that you collected in the course of this problem set. Just press edit and check afterwards. The maximum number of achievable awards is $16$.

#< task
awards()
#>

Exercise 8 -- References

Bibliography

R Packages



brigittepeter/RTutorWaterPollutionChina documentation built on May 17, 2019, 9:12 a.m.