user.name = '' # set to your user name

library(RTutor)
check.problem.set('RTutorEnvironmentalRegulations', ps.dir, ps.file, user.name=user.name, reset=FALSE)

# Run the Addin 'Check Problemset' to save and check your solution

Who Benefits from Environmental Regulations?

Welcome to this problem set which is part of my master thesis at the University of Ulm. In this problem set we want to discuss how environmental policy affects different groups of society. Particularly, we want to examine who benefits from an improvement in the air quality which is a direct consequence of the Clean Air Act Amendments in 1990 (CAAA). There is a corresponding empirical analysis in the article "Who Benefits from Environmental Regulation? Evidence from the Clean Air Act Amendments", written by Antonio Bento, Matthew Freedman and Corey Long to which we refer by BFL (2014) during this problem set. This article was published in 2014 in the Review of Economics and Statistics. The corresponding StataCode and the Data Set can be found on the Review of Economics and Statistics Dataverse Homepage.

Exercise Content

The traditional literature usually claims that environmental policies are regressive (Banzahf (2011), Fullerton (2011), Bento (2013)). This is in part because the costs of these policies tend to be higher for lower income households. They spend a higher share of their income for energy goods and often are employed in energy-related industries. In addition to that, households with a higher income are more likely to be homeowners and therefore particularly benefit from an increase in the house values which is caused by an improvement in the air quality. Despite these arguments the article "Who Benefits from Environmental Regulation? Evidence from the Clean Air Act Amendments" aims at showing that the benefits of the CAAA in 1990 are progressive. BFL (2014) use geographically disaggregated data and exploit an instrumental variable approach.

This problem set will lead you through this article and reproduce its results in R. In addition to that we will question the findings of BFL (2014). Particularly, we will discuss their assumption that the obtained results for a special subgroup of the population can be applied to the whole population.

Therefore this problem set is built as follows:

  1. Overview

1.1 Overview of the CAAA

1.2 Overview of the data set

  1. Factors causing specific trends in the data set

  2. OLS: A first attempt to analyze the question

  3. A further approach: The IV regression

4.1 OLS versus IV

4.2 Two-Stage Least Squares

  1. The effects of the 1990 CAAA-induced air quality improvements

  2. Robustness checks

  3. Sorting

  4. Distributional implications and related literature

  5. Conclusion

  6. References

  7. Appendix

You do not need to solve the exercises in the given order but it is recommended to do so because in this way it is easier for you to understand the economic story of this problem set. Moreover, later exercises build upon earlier received knowledge. Within one tab you have to solve the tasks in the given order apart from the ones that are excluded explicitly with a note (like all quizzes that you will find and some additional code blocks).

Exercise 1 -- Overview

In this problem set we analyse the effects of a policy-induced reduction in pollution on different groups of society. Thereby we regard especially the reduction which was caused by the Clean Air Act Amendments in 1990. That's why first of all we would like to present the temporal development of the Clean Air Act to you. In doing so we explain the environmental regulations associated with it and how they have affected the pollution level. Afterwards we will introduce you to the data set which is the basis of the analysis in this problem set.

Exercise 1.1 -- Overview of the CAAA

In 1963 the Clean Air Act passed the first regulations which regard air pollution control. It established a federal program within the U.S. Public Health Service and authorized research into techniques for monitoring and controlling air pollution. A few years later in 1970 an extension of this original Clean Air Act of 1963 took place. Thereby a nationwide network of monitors was installed to measure the total suspended particulates (TSP) in the air. This network allowed the U.S. Environmental Protection Agency (EPA) to monitor the National Ambient Air Quality Standards (NAAQS) which were also passed by these amendments in 1970. The NAAQS include two different types of regulations. On the one hand there are the regulations that set primary standards. They should protect the health of the people, especially the health of the vulnerable population e.g. asthmatics or children. On the other hand the secondary standards should ensure the public welfare by protecting animals, vegetation or buildings. To make sure that the majority of the population is affected by these new regulations the EPA requires that these monitors have to be located in densely populated areas. At that time the regulations didn't distinguish the diameter of the particulates. Besides subsidizing states, which challenge the problem of the ozone cracks or establish new auto gasoline regulations, the CAAA started in 1990 to regulate especially the particulates smaller than 10 micrometer. These particulates are designated as PM$_{10}$. Because of their small diameter they are considered as extremely harmful (U.S. EPA (2005)). In the info box below you can find a more detailed description of these special particulates. (U.S. EPA (2015))

info("Particulate Matter") # Run this line (Strg-Enter) to show info

The standards concerning the PM${10}$ concentration are monitored by the 1970 installed nationwide network of monitors. In 1990 the EPA also determined that if only one monitor within a county exceeds these standards, this county is named as non-attainment county. As a consequence it has to present a plan how to reduce the pollution in order to fulfill the NAAQS. If the pollution values continue to exceed the standards, or the presented plan isn't observed, then the EPA can impose sanctions on the county. For example they can retain budgets which were intended for an enlargement of the infrastructure or impose additional requirements concerning the emission. (National Archives and Records Administration (2005)) The attainment status for each monitor is assigned by the so called EPA's rule: If in year t the annual PM${10}$ concentration is greater than 50 $\mu g/m^{3}$ or the 24-hour-concentration surpasses 150 $\mu g/m^{3}$, then the monitor is classified as non-attainment monitor in year t+1. (BFL(2014))

These regulations of the 1990 CAAA affect the behavior of the local regulators in a considerable way. In counties that include several monitors local regulators focus on reducing the PM${10}$ concentration around monitors that are in danger to exceed the thresholds because, as described above, these monitors put the whole county at risk to be named as non-attainment county. The EPA, the South Coast Air Quality Management District and other researches like Auffhammer et al. (2009) confirm that the regulators focus on areas around non-attainment monitors. Applying more aggressive action to them, the local regulators want to minimize the future expected costs for the whole county. In doing so they try to push through the air quality standards by policies that lead to geographically uneven reductions in PM${10}$. For example they pass additional inspections especially in the "dirty" areas. The geographical variance of the local regulators' behavior within their counties and the consequences will be illustrated in Exercise 2. Afterwards we exploit this geographical heterogeneity to estimate the causal effects of a 1990 CAAA-induced PM$_{10}$ reduction on the different groups of society. (BFL (2014))

If you are interested in additional information about the Clean Air Act (1970) and its amendments click here.

Exercise 1.2 -- Overview of the data set

Now in order to establish a starting point for the following analysis we want to take a look at the observations which we will use to estimate the effects of a policy-induced reduction in pollution on different groups of society. In particular we regard different areas and their respective characteristics. The corresponding data set is named as BFL.dta. So the first step should be to read in this data set. As it is in Stata form we have to use the command read.dta() out of the foreign package. For more information about this function take a look at the info box below.

info("read.dta()") # Run this line (Strg-Enter) to show info

Before you start entering your code, you need to press the edit button. This must be done in every first exercise of a chapter and after every optional exercise which you skipped.

Task: Use the command read.dta() to read in the downloaded data set BFL.dta. Store it into the variable dat. If you need help how to use read.dta(), check the info box above. If you need further advice, click the hint button, which contains more detailed information. If the hint does not help you, you can always access the solution with the solution button. Here you just need to remove the # in front of the code and replace the dots with the right commands. Then click the check button to run the command.

# ...=read.dta("...")

The data set includes 1827 observations. Here it is sufficient to regard only the first ones listed in the data set. In R this selection can be performed with the function head().

Task: Take a look at the first observations of the data set. To do this just press check.

head(dat)

Notice that if you move your mouse over the header of a column, you will get additional information describing what this column stands for. In general you always have the possibility to look up these descriptions in this problem set. You just have to press data, this will get you to the Data Explorer section. If you press Description in the Data Explorer, you will get more detailed information about all variables in the data set.

Regarding these examples from the data set you see that each row represents one specific area. These areas were selected because they are located within a radius of twenty miles around a monitor, which fulfills special requirements. These requirements will be explained later. The different columns include values for the characteristics of a corresponding area.

! addonquizVariable description

Air quality data

As explained before, the aim of this problem set is to find the effect of PM${10}$ reductions induced by the 1990 CAAA on different groups of society. So the variables of major importance are the variables representing the PM${10}$ concentrations. In the data set these variables are named as pol_90 and pol_dif. pol_90 is the average PM${10}$ concentration in the year 1990 and pol_dif the PM${10}$ change between 1990 and 2000. These values are adapted from the Air Quality Standards database (2016). For each monitor this database includes the average PM$_{10}$ concentration of one year, the coordinates of the location plus several more measures. We adopt the procedure of BFL (2014) and consider only those monitors of the database that fulfill the requirements of timing and reliability. That's why there are counties that can't be associated to a monitor and therefore aren't considered in our data set. A detailed description of these requirements can be read up in the info box below.

info("Requirements of timing and reliability") # Run this line (Strg-Enter) to show info

Because of these requirements the sample of monitors is decimated from 3080 out of the database to 375 in only 230 counties. But these counties have a relatively high population density and so contain one-third of the total U.S. population (BFL (2014). Despite this decimation in our sample, BFL (2014) claim that the observed changes in pollution are still consistent with other works, which rely on remarkable larger samples. In the next task we want to show that this is really the case. To compute a comparable value for the decline in the average PM$_{10}$ concentration we have to divide the average difference in the pollution between 1990 and 2000 through the average pollution in 1990. The corresponding variables are pol_dif and pol_90. Thereby we have to consider all areas stored in dat.

Task: Compute the decline in the average PM$_{10}$ concentration, as described above. Use the mean() function in R to compute the average of a variable. If you don't know how to apply this function, click the hint button.

# Enter your command here

! addonquizChange in pollution

The value of approximately (-0.21) indicates that the average concentrations of PM$_{10}$ declined by 21 % in the 1990s. This is consistent with the findings of Aufhammer et al. (2009), which is an example for a work that relies on a much larger sample of monitors. Therefore we can apply our decimated sample without a loss of conclusiveness.

The characteristics of the areas

As you already could see in this overview, our data set also includes a lot of measured demographic and housing characteristics for every area. They are taken from the GeoLytics Neighborhood Change Database (2010). Like it was the case with the air quality, most of the these characteristics are represented by two variables. These include on the one hand the static value from 1990 and on the other hand the difference between the values in 1990 and 2000. For example the variable that represents the median family income in 1990 is median_family_income_90 while median_family_income_dif represents the difference in the median family income between 1990 and 2000. As a reminder, if you are interested in a more detailed description of all variables in the data set, press Description in the Data Explorer.

To become familiar with these variables representing the socio-economic characteristics of an area we want to have a closer look at some examples. Therefore we pick the two variables for the median income in an area, which we already mentioned above and try to interpret them by carrying out a quiz. The select() function which is out of dplyr allows us to select specific columns of a data set. For more information check the info box below. In addition to that we use again the head() function to consider only the first observations in our data set.

info("select()") # Run this line (Strg-Enter) to show info

Task: Take a look at the two variables representing the median family income in an area: median_family_income_90 and median_family_income_dif. Consider only the first areas stored in dat. To do this use head(). The required command is already entered. You just have to press check. Note that the corresponding county code of an area also is considered.

head(select(dat, county_code, median_family_income_90, median_family_income_dif))

To check if you get along with the variable designations, let's try to calculate the percentage increase in the median family income between 1990 and 2000 related to the median family income in 1990. In fact for the area which is located in county 55. To do this you have to divide the change in the family income between 1990 and 2000 through the median income in 1990. The corresponding values can be read from the output of the exercise above.

Use the below code chunk to run the calculation and then answer the following quiz.

Task: You can enter whatever you think is needed to solve the quiz here.

# Enter your command here

! addonquizIncome increase

We have already pointed out several times that the distance of an area to the next monitor and therefore the location of the area is quite important. To consider these aspects BFL (2014) matched each monitor, which satisfies the requirements, to an area. This means that not each area in the data set necessarily must include a monitor. After they did this, they calculated the distance between an area in the data set and the next area containing a monitor. This distance is represented by the variable ring in the data set. If the distance between an area i from the data set and the next area containing a monitor is between zero and one mile, the variable ring for area i has the value one. If the distance is between one and three miles, ring has a value of three and so on. In the end the variable ring can have the values 1, 3, 5, 10, 20. To facilitate the use of ring in the further course of this problem set, we say that the value of this variable simply represents the distance of an area to the next monitor.

Task: Press the check button to have a look at the different manifestations of the variable ring in the data set.

distinct(select(dat, ring))

The command distinct() from dplyr, which is wrapped around the select() command, prints out only unique rows. We need to do that here since we have several times the same entries.

! addonquizRings

Exercise 2 -- Factors causing specific trends in the data set

So far we know how the data set is built and which variables it includes. In this chapter we try to identify specific trends in these variables which we can exploit in the further course of this problem set to examine the causal effects of PM$_{10}$ reductions on different groups of society. When we discussed the development of the CAAA in Exercise 1 we have already mentioned two potential factors which could cause such trends. The first one is the distance of an area to the closest monitor and the second one is the attainment status.

Note that in this chapter we discuss the features of the data set. That's why the statements we make here only refer to the observations from the data and don't represent causal effects.

The distance of an area to the next monitor

In Exercise 1 we became acquainted with the EPA's requirement that monitors have to be located in specific areas, namely in areas with a high population density. These areas have specific socio-economic characteristics. This means the distance of an area to the next monitor gives you information about its qualities. As we learned, this distance is represented by the variable ring. So we can use this variable to examine if there is a correlation between the variables representing the pollution or the socio-economic characteristics and the distance of an area to the next monitor. To do this you should remember that the higher the value for ring, the larger the distance of an area to the next monitor is.

To investigate these correlations we have to read in the data set BFL.dta again.

Task: Read in the data set BFL.dta .Use the read.dta() function as you did in Exercise 1. Store the data set into the variable dat.

# Enter your command here

To get a first hint if there could be a variation across space in our data set, we create different plots for each manifestation of the variable ring. In doing so we plot pol_dif on the y-axis and median_family_income_90 on the x-axis. For this we use the ggplot() command from the package ggplot2. For more information check the following info box.

info("ggplot") # Run this line (Strg-Enter) to show info

Task: Plot the median income median_familiy_income_90 in an area on the x-axis and the in PM$_{10}$ reduction pol_dif in an area on the y-axis. Thereby you should create an extra graph for each group of area, which are clustered by the manifestation of the variable ring. Press check to see the plot.

ggplot(data=dat,aes(x=median_family_income_90,y=pol_dif)) + geom_point() + facet_wrap(~ring)

The plots suggest that at least the median income is correlated with the distance to the next monitor. To be exact, it seems to be the case that the median income in an area increases with an increasing distance to the next monitor. To illustrate this trend we apply the pirateplot() function out of the package yarrr. If you are interested in more details about this function, click here

Task: This exercise is optional. You should edit it if you are interested in a graphical illustration of how the median family income varies with the different manifestations of the variable ring. Just run the following code to display the graph.

pirateplot(formula = median_family_income_90 ~ ring ,
           data = dat,
           main = "Pirateplot Family Income",
           xlab = "ring",
           ylab = "median family income")

To verify the presumption that the socio-economic variables could be correlated with the distance of an area to the next monitor, we select some more variables representing the neighborhood characteristics. Then we compute the median of these variables for each of the different groups of areas. Therefore we use a combination of the summarise() and group_by() commands. If you are interested in a detailed description of how to use and combine these functions, check the info box below.

info("group_by() and summarise()") # Run this line (Strg-Enter) to show info

In particular we compute the median income, the median house price, the median rent, the median share of the houses owned by the inhabitants and the median unemployment rate for each of the different groups of areas which are divided according to their manifestation of the variable ring.

Task: Use a combination of the summarise() and group_by() commands to calculate the median of the socio-economic characteristics, mentioned above, for each group of areas. The relevant values are stored in the following variables: median_family_income_90, median_house_value_90, median_rent_90, owner_occupied_units_90 and share_unemployed_90. They are all included in dat. Store the respective results in the variables median_income, median_house_value, median_rent, median_owned_houses and unemployment_rate. As this is a quite extensive command you just have to press check to see the results.

dat %>%
  group_by(ring) %>%
  summarise(median_income = median(median_family_income_90),
            median_house_value = median(median_house_value_90),
            median_rent = median(median_rent_90),
            median_owned_houses = median(owner_occupied_units_90),
            unemployment_rate = median(share_unemployed_90)
            )

! addonquizSummarise

When we compare these values for the different groups of areas, we see that with an increasing value for ring the median income, the median house price, the median rent, and the median share of the owner-occupied houses increase. At the same time the median unemployment rate decreases. As we learned in Exercise 1 a small value for ring can be equated with a small distance of an area to the next monitor. So we can state that in our sample the population in areas located near a monitor seems to be poorer than the population in areas located further away. Therefore our presumption that there is a systematic variation across space for these socio-economic variables in our data set is confirmed.

In the next step we still have to consider the variables that are associated with the air quality. According to the plot above there doesn't seem to be a remarkable correlation between the reduction in pollution and the distance to the next monitor. To be sure we should have a closer look at this relationship, too. Therefore we use the variable representing the pollution in 1990 and the variable representing the reduction in pollution between 1990 and 2000. Using these variables we compute the respective median for each group of areas. Just like we did it before when we examined the socio-economic variables.

Task: Use a combination of the summarise() and group_by() commands to calculate the median of pol_90 and pol_dif for each group of areas. The respective results should be stored in the variables median_pol_90 and median_pol_dif. To get the results delete the #'s, fill in the gaps and then press the check button. If you don't know how to fill in the gaps have a look at the info box "group_by() and summarise()" or press hint.

#  dat %>%
#    group_by(ring) %>%
#    summarise(... = ...,
#              ... = ...
#              )

Regarding the results for the pollution in 1990 we can't identify any specific trend related to the distance of an area to the next monitor. But the values for the reduction in pollution slightly seem to increase with higher values for ring. Nevertheless these results are not obvious. This means for these variables we can't detect a correlation between the respective values and the distance to the next monitor in our data.

The attainment status of an area

In Exercise 1 we also learned that the attainment designation by the EPA nudges the local regulators to treat areas in different ways, even if they are located in the same county. So another factor, which could cause a systematic variation in the values for the PM$_{10}$ reduction, is the attainment status of an area. Previous studies that regard the enforcement of the 1990 CAAA also document that the EPA's attainment and non-attainment designations influence the behavior of the local regulators and therefore have a notable effect on the pollution levels of the counties (Henderson (1996), Nadeau (1997), Becker and Henderson (2000)). So let's check if this holds for our data set as well. To do this we have to cluster the areas into the following attainment groups.

(BFL (2014))

So according to the three definitions above, in order to define the attainment status of an area we have to consider the attainment status of the next monitor and the attainment status of the county. The specific requirements which a monitor or a county must fulfill to get the attainment status were presented in Exercise 1. Both the attainment status of the monitors and the attainment status of the relevant counties were observed by BFL (2014) and are represented by the variables cnty_stat and mntr_stat. These variables are dummies and have a value of zero if the corresponding county or the corresponding monitor are in attainment. Otherwise they have a value of one. Using these two variables BFL (2014) clustered the areas into the different attainment groups, as explained above. After they had done this, they calculated the median PM$_{10}$ concentration for each of these groups. In particular they did this for every year between 1990 and 2000. The corresponding results are stored in the data set pol_dif.dta.

Task: Read in the data set pol_dif.dta as you have already done it with the data set BFL.dta. Store it into the variable dat1.

# Enter your command here

Before we examine if the reduction in pollution is correlated with the attainment status, let's see if you get along with this new data set and the attainment designations.

Task: Press check to show every possible combination of the two variables cnty_stat and mntr_stat. That means show each possible attainment status of an area which we explained above. To do this press check. Remember: the distinct() command ensures that only unique entries of a data frame are printed out. For a detailed description of this select() command, have a look at the corresponding info box in Exercise 1.

distinct(select(dat1, cnty_stat, mntr_stat, attain))

! addonquizAttainment status

Now let's illustrate the PM$_{10}$ reduction within those three different groups of areas between 1990 and 2000. Therefore we use the ggplot() command again. If you can't remember this command, look at the beginning of this chapter. There you can find a corresponding info box.

Task: Plot the median PM${10}$ concentration on the y-axis and the year on the x-axis. In the data set pol_dif.dta the variable representing the median concentration of PM${10}$ is called pm. The years are stored in year. You should consider the different groups of attainment. The respective affiliation is presented by the variable attain. To show this plot delete the #'s, fill in the gaps with the three variables which were mentioned in this task and then run the code.

# ggplot(data=dat1,aes(x=...,y=..., color=...)) + geom_line()

In this plot we see that the three different attainment groups clearly differ with regard to the PM${10}$ reductions between 1990 and 2000. In particular the blue line shows the largest reduction in PM${10}$. It was around 15.4 $\mu g/m^{3}$. This is almost twice as much as for the two other groups of areas. The blue line represents the reduction in areas which were designated as NonAttainment. So in our data set it seems to be the case that the areas with the highest pollution in 1990 experience the largest PM$_{10}$ reduction between 1990 and 2000.

Conclusion

In summary we can say that there are specific trends in the data set, in particular for the values of the socio-economic variables and for the values of the PM${10}$ reductions. Regarding the socio-economic characteristics we found that the people living in areas next to a monitor seem to be poorer than the ones living in areas located further away from the next monitor. Furthermore we could observe that the values, representing the reduction in PM${10}$, reach their maximum in areas that are named as NonAttainment. This means that the areas with the highest pollution in 1990 seem to experience the largest reductions in PM$_{10}$ between 1990 and 2000.

In the following exercises, where we analyze the causal effect of PM$_{10}$ reductions on the population we will exploit these findings. Especially the one that the different groups of society in our data set can be represented by the variable ring.

Exercise 3 -- OLS: A first attempt to analyze the question

Until now we have done some descriptive statistics and therefore should have got an idea about the data. Now it is time to deal with the main issue of this problem set. In particular we want to detect the effects of an improvement in the air quality, induced by the 1990 CAAA. To do this we adopt the approach of BFL (2014) and use a linear regression model with house prices as dependent and the PM${10}$ concentration in the air as independent variable. This means we measure the CAAA's benefits by the capitalization of pollution reductions into house prices, whereby only those houses are taken into account which are owner-occupied. So in contrast to previous works, which examine the effects on different subgroups like homeowners and renters (Grainger (2012)), this approach examines especially the different effects of a PM${10}$ reduction within such a subgroup, namely the homeowners.

To get a first impression about the relationship explained above, we run an OLS regression examining the effect of the PM$_{10}$ concentration in 1990 on the house prices in 1990. The respective values for these factors are stored in the variables pol_90 and median_house_value_90 of the data set BFL.dta. To run the regression we could use the lm() function out of base R which you might know, but instead we want to use the felm() function out of lfe. We do so since we can run all regressions which we need in later exercises with this one function. To see how you can do linear regressions with felm() check the info box below.

info("Linear regressions with felm()") # Run this line (Strg-Enter) to show info

To run the regression we have to load the data set BFL.dta again.

Task: Read in the data set BFL.dta and store it into dat.

# Enter your command here

Task: Run a regression using median_house_value_90 as dependent variable and pol_90 as independent variable. Store the results in reg. Since this is your first regression you just have to remove # and fill in the gaps before you press check. If you can't remember the structure of the command, look at the info box above.

# reg = felm(... ~ ..., data=dat)

To show summary statistics of regressions we will make use of the function stargazer() from the stargazer package. In the next task you will see how this function looks like.

Task: To show the summary statistics of reg just run the following code.


Let's interpret our first regression results:

By the definition of median_house_value_90, which is the median house price in 1990 in one area and pol_90, which is the the PM${10}$ concentration in 1990, we know that the PM${10}$ concentration is measured in $\mu g/m^{3}$ and the house prices in U.S. dollars. So the regression tells us if the concentration of PM$_{10}$ increases by one $\mu g/m^{3}$, the house prices will increase by about 175.20 dollars. In contrast to these results, you actually would expect a negative sign for the coefficient of pol_90, meaning that a better air quality generally would cause higher house prices. As you may have noticed stargazer() prints in addition to the results the p-values of the regression coefficients. Furthermore it prints one star if the result is significant at the 5 % level, two stars if it is significant at the 1 % level and three stars if it is significant at the 0.1 % level. Most econometricians say that a result is significant if its significance level is below 5 %, meaning that the corresponding p-value is smaller than 0.5. The p-value of pol_dif in reg is about 0.873 and has no star behind. So it indicates that this result is not significant at standard levels. In the end this means that we should revise our OLS approach and check if it really is reasonable or if we have to make some adjustments.

To justify the use of such a linear regression model for the purposes of inference or prediction, there are five principal assumptions:

A1: The dependent variable can be written as a linear function of a specific set of independent variables, plus a disturbance term.
A2: The conditional expectation value of the disturbance term is zero, no matter which values of the independent variable we observe ($E[\varepsilon_i] = 0$).
A3: Disturbances have uniform variance and are uncorrelated ($Var(\varepsilon_i) = \sigma^2 \; \forall i$ and $Cov(\varepsilon_i , \varepsilon_j) = 0 \; \forall i \neq j$).
A4: Observations on independent variables can be considered fixed in repeated samples.
A5: There is no exact linear relationship between independent variables and there are more observations than independent variables.

(Kennedy (2008))

Regarding our regression model, applied above, we can assume that it fulfills A1, A4 and A5 at least in some sense. In order to make sure that the other two assumptions are also realized we will add different specifications to our previous approach in the following exercises.

First-difference approach

It is obvious that there are a lot of factors that influence the air quality and the house prices, which we haven't taken into account so far. Leaving them out in the regression skews our results for the effect of the PM${10}$ concentration on house prices. One possibility to consider the factors, which didn't change during the period of data collection, is the so called first-difference approach. It takes observable and unobservable time-invariant influences into account. To apply this first-difference approach in our regression we use the difference between the logarithmized median house prices in 1990 and 2000, represented by the variable ln_median_house_value_dif and the difference between the PM${10}$ concentration in 1990 and 2000 represented by pol_dif, instead of the actual values from 1990. This means that from now on we examine the effect of PM$_{10}$ reductions on changes in house prices. The usage of the logarithmized values for the difference in the house prices will become clear when we interpret the results.

Task: Run a regression using ln_median_house_value_dif as dependent variable and pol_dif as independent variable. Use the felm() command like before and store the results in reg1.

# Enter you command here

As seen above the stargazer() function has a lot of options. To keep the code simple we wrote a function reg.summary() for you. Just pass your regression object/objects to this function.

Task: Give a summary of the regressions reg1 and reg with the reg.summary() command. If you need help you can always, click on the hint button.

# Enter your command here

The results of these two regressions differ a lot. The most important difference for us is that our coefficient of interest is negative now. So using the first-difference approach, the results fulfill our expectations because they indicate a negative relationship between the pollution and the house prices. Furthermore the p-value for the coefficient of interest decreases from 0.873 in reg to 0.546 in reg1. Thus it is still not significant, but comes closer to the point.

As mentioned above, here we use the difference between the logarithmized median house prices in 1990 and 2000. Therefore we can interpret the effect on changes in house prices in a more compressed way. Running a regression where the dependent variable is logarithmized and the independent isn't one interprets $\beta_1$ as follows: A change in the independent variable of one unit leads to a change in the dependent variable of 100*$\beta_1$ per cent. According to a value of (-0.00058) for pol_dif an increase in the PM$_{10}$ concentration by one $\mu g/m^{3}$ leads to a decrease in the house prices by 0.058 %.

! addonquizRegression output 1

Ultimately, by using the first-difference approach and therefore considering time-invariant factors in our regression, we clearly improve the results of our model.

Control variables

So far we have learned that by using the first-difference approach we can consider all time-invariant factors which influence the house prices and the air quality. But you can imagine that there might also be a lot of time-variant factors that we haven't taken into account so far and which therefore could still distort the results of our model. You can take them into account by including so called control variables.

info("Control variables") # Run this line (Strg-Enter) to show info

In our case these control variables are all housing and neighborhood characteristics that could be observed for the different areas and therefore are included in the data set BFL.dta. Remember that if you are interested in more detailed information about these variables, press Description in the Data Explorer. Despite these additional information there is one special control variable that we should explain separately, namely the variable factor. If you assume that house price trends across regions are correlated with patterns of improvements in the air quality, it could bias our estimates for the effects of a PM${10}$ reduction on house prices. To address this issue, according to BFL (2014), we include the local home price index of Freddie Mac, usually known as CMHPI, as a control variable. If you are interested in more information about this index, you should have a look at Stephens et al. (1995). And if you are interested in specific measures of this index, click here. We take this index into account by including the variable factor in the vector of controls. Thus our estimates reflect the effects of PM${10}$ reductions on house price changes, beyond those that would be expected given regional price trends.

As we have to include one additional control variable for each time-variant factor this approach only controls for observable characteristics of areas. In the following exercise you can have a look at all employed control variables. Note that we have to apply the first-difference approach to all of them too.

Task: To define a vector which is called my.controls and includes all control variables, press the check button. This vector is stored in another file which is linked with this problem set. By doing this we don't have to define it again in the further course of this problem set.

my.controls = c("total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

As you can see we have to include really a lot of control variables in our regression. So the felm() command to run a regression becomes quite extensive. That's why we first want to create a regression formula which includes all relevant variables and store it in a variable which then can be passed to felm().

Task: To merge all control variables with the core part of our regression and to create an appropriate regression formula which can passed to felm(), press check.


info("A detailed description of the procedure creating the regression formula") # Run this line (Strg-Enter) to show info

Task: Perform a regression as described above. Instead of explicitly referring to one dependent variable and several independent variables you just have to pass the variable form to the felm() command. Store the result in reg2. If you need help you can press the hint button.

# Enter your command here

In the previous regressions we saw that the approach of reg1 clearly outperforms the one of reg. So here it is sufficient to compare the results of reg2 to the ones of reg1, in order to examine if the inclusion of the controls improves our model.

Task: Give a summary of the regressions reg2 and reg1 with the reg.summary() command. If you need help you can always click on the hint button.

# Enter your command here

How have the results changed by considering a bunch of control variables? Now the coefficient for pol_dif has a value of (-0.00233). This means a reduction in PM$_{10}$ by one $\mu g/m^{3}$ increases house prices by 0.23 %. So the effect here is almost four times larger than in the previous regression.

In contrast to reg the coefficient of pol_dif in reg2 has a p-value of 0.00005 and therefore is significant at the 0.1 % level. In addition to that the coefficient of determination $R^2$, which also is returned by reg.summary(), in reg2 clearly exceeds the one in reg1. If the meaning of the $R^2$ is not clear to you, check the info box below.

info("Coefficient of Determination") # Run this line (Strg-Enter) to show info

So applying the control variables improves the significance level and also adds much explanatory power to the regression. That's why we can say that including a bunch of controls and therefore taking also time variant factors into account really seems to improve our regression model again.

Clustered standard errors

One additional concern is that there could be heteroskedasticity in our model. In general it is said that heteroskedasticity occurs when the variance of the unobservable error, conditional on independent variables, is not constant. This could apply here because the characteristics of an area are related to the characteristics of other areas, especially if they are located in the same county. Heteroskedasticity causes biased and inconsistent standard errors and therefore violates A2 of our principal assumptions which justify the use of a linear regression model. For a more detailed description of heteroskedasticity have a look at Williams (2015).

To solve this problem we apply clustered standard errors. In our case it makes sense to cluster the areas by their affiliation to a county. The appropriate grouping variable in our data frame BFL.dta is called fips2. This variable represents the FIPS county code which uniquely identifies counties and county-similar areas in the USA (United States Census Bureau (2010)). This means areas with the same FIPS code belong to the same county or county-similar area.

Task: Take a look at the variables state_code, county_code and fips2 which represents the FIPS code for each area. They are all included in dat. Use the select() command. If you can't remember the structure of this command, have a look at the info box in Exercise 1.

# Enter your command here

You can see that the first numbers of the FIPS code represent the state code and the last three numbers the code for the county where the area is located. So using the variable fips2 we can identify all counties and states that include an area from our data set. You can look up all FIPS codes and the associated states and counties in the State FIPS Code Listing (2016).

To illustrate the information we get from the FIPS code we first filter all areas with a specific FIPS code from our data set. Afterwards we take these specific FIPS codes and tag both, the corresponding state and the corresponding county in Google Maps. To select areas from the main data set we can use the filter() function of the dplyr package. If you are not familiar with it, take a look at the info box below.

info("filter()") # Run this line (Strg-Enter) to show info

Task: Use filter() to generate a data set that only contains the areas with the FIPS code 1069, 6073 or 36031. Because we want to filter areas with different FIPS codes we have to use | between the different conditions. As this is the first time we apply the filter command you just have to uncomment the code and fill in the gaps with the different FIPS codes. Then you can press the check button.

# filter(dat, fips2 == ... | fips2 == ... | fips2 == ...)

! addonquizFips Code

Task: This exercise is optional. If you are interested in viewing the corresponding states and the corresponding counties which are represented by the FIPS codes from above in Google Maps, press check. Otherwise you can directly go to the next exercise. Note that you can click on an icon on the map to get information about the corresponding state and county. The number in the brackets is the FIPS code. The red marker stands for counties that are in danger to succeed the standards, the green markers for counties where the standards are fulfilled. This information comes from the Green Book of the United States Environmental Protection Agency (2016).

area.map()

A regression considering clustered standard errors in R also can be done with the felm() function from the lfe package. The procedure is explained in the info box below.

info("felm()") # Run this line (Strg-Enter) to show info

To use the function of the felm() command which considers clustered standard errors we have to define a cluster variable first. In this problem set we call this cluster variable fips.

Task: To define the cluster variable, extract the values for fips2 from dat and store them in the variable fips.

# Enter your command here

Task: Perform a regression similar to reg2 with standard errors clustered by fips. Store the results in reg3. Therefore you can use the regression formula again, which we stored in form before. If you do not remember how to do this, just take a look at previous exercises or press hint.

# Enter your command here

As we already determined, reg2 is clearly preferred over reg and reg1. So here it is sufficient to compare the summaries of reg2 and reg3 in order to examine if the usage of cluster standard errors improves our model.

Task: Print a summary of the regressions reg2 and reg3. Use the reg.summary() command.

# Enter your command here

The coefficient of pol_dif stays the same in both regressions. Just like the $R^2$. In contrast to that the p-value in reg3 has increased from 0.00005 in reg2 to about 0.04. So now the results are significant at the 5 % level. In reg2 the level of significance was 0.1 %. Due to this change in the level of significance our concern about heteroskedasticity seems to be confirmed. This means despite the lower significance level you should prefer reg3 which prevents heteroskedasticity by using clustered standard errors and therefore ensures that A3 of our principal assumptions is fulfilled.

Conclusion

Before we move on to the next chapter let us recall what we have learned so far: To estimate the effect of pollution reductions which were induced by the CAAA in 1990 by a linear regression model it is quite important to consider all factors that influence the PM$_{10}$ concentration and also independently affect the house prices. In order to do this we became acquainted with two approaches which we can apply to our model, namely the first-difference approach and the control variables. Applying these two approaches we include at least all relevant factors that are time-invariant or observable in our model.

So in the end we get the following regression formula:

$$ \triangle(p_i)=\theta\triangle(PM_i)+\beta\triangle(X_i)+\varepsilon_i $$

where

$$p_i$$ is the natural log of the median housing value in area i,

$$PM_i$$ is the concentration of PM$_{10}$ in area i and

$$X_i$$ is a vector that includes housing and neighborhood characteristics of area i. These are the so called control variables.

(BFL 2014)

Furthermore we found that because certain areas have similar characteristics we have to cluster the standard errors at the county level. The problem here is that despite the first-difference approach and the bunch of control variables there probably are still factors which are correlated with a PM$_{10}$ reduction and also independently affect the house prices, but aren't included in our model yet. If this was the case, there would be endogeneity and A2 of our principal assumptions wouldn't be fulfilled (Kennedy (2008)). One approach that addresses this problem is the Instrumental Variable regression. In the following exercises we will consider this approach as well and will contrast it with the OLS regression from this chapter.

Exercise 4 -- A further approach: The IV regression

So far we have learned that despite our first-difference approach and the bunch of controls, which we include in our model, there still could be factors that are correlated with $PM_i$ and also independently affect $p_i$, but which aren't considered in our model from Exercise 3. This applies especially for immeasurable changes in the characteristics of locations, e.g. for changes in the local infrastructure. When neglecting such factors, they are included in the disturbance of our model, whereby the expected value of this error term isn't zero and A2 of our principal assumptions, which justify the usage of a linear regression model, is violated. This problem is called endogeneity. As already mentioned, one approach to take also the influence of the factors into account which are unobservable and time-variant and to tackle the problem of endogeneity is the so called Instrumental Variable regression. Usually it is known as IV regression.

As the name indicates the base of such an IV regression is an instrumental variable. Thus the most important step is to find such an appropriate variable.

An instrumental variable has to fulfill two conditions.

  1. It has to be partially correlated with the endogenous variable once the other exogenous variables have been netted out.

  2. It must not be correlated with the error term epsilon.

Per endogenous variable you need at least one instrument that itself isn't already included in the OLS regression.

(Wooldridge (2010))

To identify an appropriate IV strategy BFL (2014) follow recent works (e.g. Gamper-Rabindran et al. (2011)) and exploit the findings from Exercise 2, where we could observe the within-county variation in pollution reductions, which is in part driven by the efforts of the local regulators to reduce the pollution especially around dirtier monitors. In Exercise 1 we have assumed that this behavior of the local regulators is caused by the EPA's non-attainment designation. That's why BFL (2014) use the attainment status of the monitor located next to an area and the attainment status of the county to which the area belongs to as an instrument for localized pollution reductions. In particular they use the ratios of years that the monitor (county) is out of attainment to the number of years for which there is a record during the time span 1992-1997. By using this ratio BFL (2014) want to take the heterogeneity in the persistence of the non-attainment status into account. In doing so they consider the severity of the violation which causes different extents in the air quality improvements. Because not all of the monitors have reliable data for all years the denominator of the monitor instrument differs between one to six. While the denominator for the county instrument is constant six.

Regarding the two conditions the instruments have to fulfill, we first have to examine if the non-attainment status really affects the reduction in PM$_{10}$. As the figure in Exercise 2_10 suggests, and as we will show more rigorously in Exercise 4.2 this condition clearly holds.

In contrast to the first condition, the fulfillment of the second condition can never conclusively be shown. But at least to reduce the doubts, we include a number of control variables in our regression model and will run robustness checks in Exercise 6. In doing so we minimize the unobserved factors in the error term. Thus the probability that the instruments could be correlated with the disturbance decreases.

In summary we can assume, that the monitor non-attainment status and the county non-attainment status fulfill both conditions and therefore seem to be appropriate instruments. To what extent these instruments really are capable to deal with the problem of endogeneity and how they affect the results of the estimation will be discussed in the next chapter. In the data set the values for these instruments are stored in mntr_instr and cnty_instr.

Exercise 4.1 -- OLS versus IV

In the following we want to have a closer look at the procedure of the OLS and the IV regression and therefore want to explain the differences in their estimations.

The following graphs should help you to understand the relationships between the different factors which matter in our analysis. Nodes correspond to observed (orange) or unobserved (grey) variables or groups of variables. The solid arrows represent assumed causal relationships that we explicitly take into account in our regression model. Dotted arrows represent assumed causal relationships that we don't explicitly model in a regression, e.g. because not all variables are observed.

So let's start with recapitulating what we learned about the OLS approach.

Remember that in this problem set we want to estimate the effect of changes in PM$_{10}$, induced by the Clean Air Act in 1990, on changes in house prices. The left branch in the graph describes our assumption we made in Exercise 1, that the non-attainment designation, which was introduced by the 1990 CAAA, nudges the local regulators and enforces them to target especially areas near dirtier monitors for cleanup. Thereby they create a within-county variation in pollution reductions. So by examining the effect of the actually observed pollution reductions on changes in house prices, we want to estimate the effects of the regulations, introduced by the 1990 CAAA.

Of course you can imagine that there are considerably more factors which have an effect on changes in house prices. As long as they don't affect the changes in the pollution as well they don't skew our results and do not have to be included in our model. But this also means that you have to include, if possible, all factors that influence changes in pollution and independently affect the changes in the house prices in your model. To consider at least those factors that are time-invariant or observable our OLS approach includes a bunch of controls and applies the first difference approach. However, as explained several times so far, it is reasonable to assume that there are also time-variant factors, which are correlated with the PM${10}$ concentration and independently affect house prices, but can't be observed and therefore are not taken into account by OLS. One possible example is the expansion in the local transportation infrastructure. In the graph this factor of enlargement in the infrastructure is represented by the node "Exogenous infrastructure changes". Let's assume that the infrastructure in a specific area is enlarged by an additional federal highway. Then it is obvious that the connection between the towns within this area is improved, whereby the transportation costs for companies decrease. This leads to the process that new companies settle in the range of these towns and so the economic development of the surrounding area benefits, for example by a decreasing unemployment rate. Therefore the level of prosperity is promoted. And if the wealth in an area increases, in general you can assume that house prices increase as well. Simultaneously the vitalization of the towns and the better infrastructure cause a considerable increase in traffic. And as we learned in Exercise 1 more motor vehicles cause a higher PM${10}$ concentration. Thus the expansion in the infrastructure leads to both, an increase in the PM$_{10}$ concentration and independent from that to an increase in the house prices.

! addonquizOmitted variables

Even though OLS doesn't consider the unobservable and time-variant factors, like the enlargement in the infrastructure, the variation in the observed values for the PM${10}$ reduction and for the changes in house prices is driven by these exogenous infrastructure changes. Thus the estimated coefficient of pol_dif in the OLS regression also captures the effect of exogenous infrastructure changes on house prices, and not only the effect of the PM${10}$ reduction due to regulatory measures. One says it is biased. According to Bound, Jaeger and Baker (1990) you can predict the sign of the bias caused by an omitted variable. It depends on the correlation between the omitted variable and the regressor and on the relationship between the omitted variable and the dependent variable of the regression model. This means that on the one hand we have to analyse the correlation between the PM${10}$ concentration in an area and an enlargement in the infrastructure. And on the other hand we have to think about the relationship between an enlargement in the infrastructure and the house prices in an area. The correlation between the PM${10}$ concentration and an enlargement in the infrastructure was already explained and we can state that it's positive. Following BFL (2014) we expect the relationship between an enlargement in the infrastructure and the house prices to be positive. So in the end we can make the assumption that our endogenous variable pol_dif is biased upwards.

! addonquizThe bias

In the regression, we performed in Exercise 3, the coefficient of pol_dif has a negative sign. So according to the argumentation above, that this coefficient is biased upwards and therefore towards zero, we can state that OLS tends to underestimate the effect of PM$_{10}$ reductions induced by 1990 CAAA on changes in house prices because it doesn't consider the influence of unobservable and time-variant factors like changes in the infrastructure.

In contrast to OLS the IV approach tries to exclude the variation in the variable of interest, which is caused by time-variant and unobservable factors.

The special thing about this approach is that you first regress the values of the endogenous variable on the detected instruments and then in the second stage, when you estimate the crucial relationship, you use only the variance of the endogenous variable that is explained by the instruments. In our case this means that we first regress the values of pol_dif on the non-attainment status and then use the predicted values of this first regression, instead of the actual observed ones, to estimate the effect of reductions in PM${10}$ on changes in house prices. By doing this we consider only the variation in PM${10}$ that is caused by the EPA's non-attainment designations and try to exclude the unobserved part of the variance which we assume to be caused by exogenous infrastructure changes. Therefore we aim at getting a coefficient for pol_dif which represents only the effect due to regulatory measures. In order to ensure that this works, we have to require that the non-attainment status covers only the effects of the 1990 CAAA on the air quality. This means you assume that the decision-making process of the local regulators to establish an additional highway or a new power plant is independent from the regulations of the Clean Air Act. In contrast to BFL (2014) we question this assumption because as already explained, these exogenous factors have a strong impact on the air quality, wherefore it would be quite conceivable that local regulators decide against an establishment of a new highway if they would endanger the attainment status of their area.

Nevertheless it is obvious that there is a problem of endogeneity in our model. Although the IV approach, using the non-attainment status of the county and the monitor as instruments, is not a perfect solution, we assume that its results should represent the effect of PM${10}$ reductions induced by the CAAA on changes in house prices better than OLS. That's the case because the IV regression excludes at least the part of the variation in the observed PM${10}$ values which is due to exogenous infrastructure changes that are implemented independently from their influence on the attainment status of an area. So in the next chapter, when we come back to the main question of this problem set, namely to estimate the effects of reductions in PM$_{10}$ induced by the 1990 CAAA on changes in house prices, we will apply the IV approach instead of OLS.

Exercise 4.2 -- Two-Stage Least Squares

In order to illustrate the instrumental variable approach we now run an IV regression by applying the so called Two-Stage Least Squares Method. As we did in the chapters before, we need to load the data first here as well.

Task: To load the data set BFL.dta and to store it into the variable dat, press edit and check afterwards.

dat = read.dta("BFL.dta")

First stage

As explained in Exercise 4.1, in the first stage of the IV regression you use the instruments to calculate "new" values for the endogenous variable. To do this we run a regression where we use the PM$_{10}$ reduction as dependent and the instruments as independent variables. Thereby we again have to consider the control variables, just like in the previous chapters. This means that when you run an IV regression, in addition to the detected instrumental variables, you also have to include all exogenous independent variables as instruments.

So according to BFL (2014) the first stage of the IV regression is as follows:

$$ \triangle(PM_i)=\varphi(N_i)+\Pi\triangle(X_i)+\mu_i $$

where $$N_i$$ is equal to the ratio of non-attainment years during the time span 1992 to 1997.

In Exercise 3 you became acquainted with the procedure to consider the whole number of controls in a regression formula. To keep the code brief we summarized this procedure and wrote a function that merges all relevant control variables with a corresponding core formula autonomously. As a result it returns a regression formula that can be passed to the felm() command. This function is called mf(). If you are interested in a detailed description how this function works and how you can apply it, click on the info button below.

! start_note "mf()"

Task: You don't have to edit this chunk. It should just show you the structure of mf().


To use this function you have to pass a string indicating the core part of your formula and a vector that includes all control variables as strings. The core part of your formula should include the dependent variable and the independent variable of interest.

Task: If you are interested in an example how to apply mf(), press check.


For more examples try to solve the following exercises in this problem set. In doing so you should become acquainted with this function.

! end_note

By applying mf() we can create the regression formula which should be used to examine the relationship between the non-attainment status and the PM$_{10}$ reduction. Because the control variables in this chapter are the same as in Exercise 3, when we ran the OLS regression, we don't have to define them again.

Task: Use mf() to create a regression formula that includes pol_dif as dependent variable and mntr_instr, cnty_instr plus the control variables as independent variables. Therefore you have to pass the core part of the regression as a string and the vector of controls which is called my.controls to the function mf(). The core part of the regression formula should include the dependent variable and the independent variable of interest. As this is the first time we apply the function mf() the command is presented below. So you just have to press check here. Nevertheless you should try to understand the command because we will need it several more times in this problem set. To do this you can also have a look at the info box above.

form = mf("pol_dif ~ mntr_instr + cnty_instr", controls=my.controls)

After we created the regression formula we now can examine the relationship between the instruments and the PM$_{10}$ reduction.

Task: Use felm() and the variable form to run the first stage of the IV regression. Consider the data included in dat. To consider the clustered standard errors you have to define the cluster variable fips before you run the regression. Store the results of the regression in the variable FirstStage. If you don't remember how to apply felm(), you should go through Exercise 3 again.

# Enter your command here

Task: Display the results of the first stage which are stored in FirstStage. Therefore use the function reg.summary().

# Enter your command here

! addonquizRegression output 2

Looking at the results of the first stage, we see that the coefficient of mntr_instr is (-11.80) and the coefficient of cnty_instr is (-2.59). Indicating p-values smaller than 0.001 both coefficients are highly significant. This clearly confirms the choice of our instruments. According to Staiger and Stocker (1997) we don't have to worry about weak instruments if the F-statistic is greater than ten. Furthermore the results imply that areas which are located near non-attainment monitors and which are part of a non-attainment county experience the largest drop in PM${10}$. That's consistent with our findings in Exercise 2 that the areas with the largest pollution in 1990 experience the largest drop in PM${10}$.

To clarify the interpretation of the coefficients we go through an example. Then you can try to apply what you have learned and edit a quiz.

We know that the coefficient of mntr_instr is about (-11.80). So areas assigned to a monitor which is always out of attainment experience a decline of 11.80 $\mu g/m^{3}$ in PM$_{10}$ relative to areas that are assigned to a monitor always in attainment. Note that this holds only for areas located in the same county, because then cnty_instr can be kept constant.

Use the below code chunk to answer the following quiz.

Task: You can enter whatever you think is needed to solve the quiz here.

# Enter your command here

! addonquizRegression output 3

In this chapter we could see that by running the first stage of the IV regression we simultaneously examine the first condition for our instruments. In R there is a special function which returns the results of the IV regression and in addition to that the significance level of the instruments. If you are interested in this function and in a description of how the inclusion of weak instruments affects the IV regression, check the box below.

! start_note "Weak-Instrument test"

The Weak-Instrument test in R runs an F-Test with the null-hypothesis that the instruments don't have a significant influence on the endogenous variable (Stock and Yogo (2001)). The inclusion of a weak instrument which doesn't affect the endogenous variable also leads to a bias and an inconsistency of the estimator. Just like it would be the case with an OLS regression (Bound, Jaeger and Baker (1990)). According to Staiger and Stocker (1997) we don't have to worry about weak instruments if the F-statistic is greater than ten, i.e. that the instruments are highly significant.

To take a look at the results of this Weak-Instrument test we apply the diagnostics function of the ivreg() command. For additional information about ivreg() have a look at Exercise 6.

Task: Just press check to get the results of the Weak-Instrument test. You can find them at the end of the output.

    dat = read.dta("BentoFreedmanLang_RESTAT_Main.dta")

    my.controls = c("share_black_dif_80", "pop_dif_80", "total_housing_units_dif_80", "ln_avg_fam_income_dif_80", "total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

    my.instr = c("share_black_dif_80", "pop_dif_80", "total_housing_units_dif_80", "ln_avg_fam_income_dif_80", "mntr_instr", "cnty_instr", "total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

    form = mf.iv("ln_median_house_value_dif ~ pol_dif",controls=my.controls,instr=my.instr)

   summary(ivreg(form, data=dat), diagnostics = TRUE)

! addonquizWeak-Instrument test

According to a F-statistic, even greater than 114, we clearly can reject the null-hypothesis which states that the instruments don't have a significant influence on the endogenous variable. Thus the Weak-Instrument test confirms our finding that the first condition for the instruments holds.

! end_note

Second stage

Now in the second stage of the IV regression we use the predicted values for the PM$_{10}$ reduction from the first stage to run the regression from Exercise 3 again, where we examined the effect of improvements in the air quality on changes in house prices.

Task: If you are interested in a comparison between the predicted values for the reduction in PM$_{10}$ which were estimated in the first stage and the original values which actually were observed, press check. If not, you can skip this exercise and go straight to the next one.

pol_dif = dat$pol_dif
pol_dif_hat = fitted(FirstStage)
comparison = data.frame(pol_dif, pol_dif_hat)
names(comparison)[2]<-paste("pol_dif_hat")
comparison

For an illustration to what extent predicted values differ from the actual observed ones, check the box below.

! start_note "The composition of the variance"

The following exercises should illustrate the difference between the variance of the observed values for the PM$_{10}$ reduction and the variance of the predicted values. To understand the difference it is essential to know the residuals of the regression.

Task: To calculate the residuals of the first stage regression, press edit and then check.

dat = read.dta("BFL.dta")

 my.controls = c("total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

form = mf("pol_dif ~ mntr_instr + cnty_instr",controls=my.controls)

fips = dat$fips2
FirstStage = felm(form, clustervar=fips, data=dat)

residuals = residuals(FirstStage)

Task: To compare the variance of the observed values with the variance of the predicted values and the variance of the residuals press check. The predicted values and the residuals were calculated in previous tasks of this chapter. The results are stored in the data set BFL.dta, just like the observed values.

dat = read.dta("BFL.dta")

var(dat$pol_dif)

var(dat$pol_dif_hat)

var(dat$residuals)

In general the variance of the actual observed values consists of an explained and an unexplained part. This applies here too because the variance of the predicted values and the variance of the residuals add up to the variance of the actual observed values. As described in Exercise 4.1 in the second stage of the IV regression you consider only the explained part of the variation. In our case this means we use only the variance of pol_dif_hat to estimate the effect of PM$_{10}$ reductions induced by the 1990 CAAA on changes in house prices.

! end_note

As already indicated the formula we use here should look quite similar to the one we used at the end of Exercise 3. The only difference is that we use pol_dif_hat instead of pol_dif as independent variable of interest.

Task: Create a regression formula that represents the second stage of the IV Regression. Use ln_median_house_value_dif as dependent and pol_dif_hat plus the control variables as independent variables. To do this you can apply the function mf() as you already did it in the first stage of the IV regression. Store the formula in the variable form.

# Enter your command here

By using this formula we can run the second stage of the IV regression now. As a result we should get a coefficient of interest that captures only the effect of the PM$_{10}$ reduction on changes in house prices, which is due to regulatory measures.

Task: Apply felm() to run the second stage of the IV regression. Therefore use the formula stored in form and the observations stored in dat. Please remember the clustered standard errors. Therefore define the cluster variable fips before you run the regression. Store the results in the variable SecondStage.

# Enter your command here

Task: Show the outcome of the second stage. Use the reg.summary() command.

# Enter your command here

Using the instrumental variable strategy we get a value of (-0.00777) for the coefficient of interest. The result in Exercise 3 was about (-0.002326). Additionally the significance level here improves from 5 % to 1 %. These results confirm our expectation that there are omitted factors in our OLS approach and that therefore in Exercise 3 we underestimate the effect of reductions in PM$_{10}$ on changes in house prices.

In this problem set we ran this Two-Stage Least Squares Method just to clarify the procedure of an IV approach. If you are interested in additional information about this method, we recommend Stock and Watson (2007). In R there are special commands that run the two stages of an IV regression together. These commands are often more practical. In the following we will apply them to examine the benefits of PM$_{10}$ reductions induced by the 1990 CAAA for the different groups of society.

Exercise 5 -- The effects of the 1990 CAAA-induced air quality improvements

In Exercise 4.1 and Exercise 4.2 we came to the result that the IV estimation should represent the effect of air quality improvements induced by the 1990 CAAA on house prices better. That's why in this chapter we will run some IV regressions with different data sets and therefore aim at estimating the effects of the 1990 CAAA-induced improvements in the air quality on different groups of society.

Reduced Form

But before we apply the non-attainment status as an instrument to run the IV regression which should examine the actual main question of this problem set, we want to estimate the direct effect of the non-attainment status on the different groups of society. This means we run a regression with the non-attainment status as independent and the changes in house prices as dependent variable, whereby we still consider the control variables. This approach is called reduced form.

$$ \triangle(p_i)=\gamma(N_i)+\Omega\triangle(X_i)+\upsilon_i $$

To examine the effects on different groups of society we exploit the finding from Exercise 2, that people living near a monitor seem to be poorer than people living further away. We do this by clustering all areas, included in BFL.dta, according to their value for ring. The exact meaning of ring was presented in Exercise 1. Using the different partial data sets which are generated by the cluster process, we run several regressions. In doing so each regression examines the effects for a different group of society. Specifically we run each regression twice and consider those two groups of areas which are located between zero and one mile respectively between five and ten miles away from the next monitor. Examining these two different groups of areas should yield higher quality results.

To cluster the areas, included in BFL.dta we can use the filter() function again.

Task: To load the data set BFL.dta press edit and check afterwards.

dat = read.dta("BFL.dta")

Task: Filter all areas that have a value of one for ring and store them in mile1. The same should be done with the areas that have a value of ten for ring. Store these results in the variable mile10. To do this use the filter() function, with which we already became acquainted in Exercise 3. Afterwards have a look at mile1 and mile10. Most of the code is presented. You just have to uncomment the code and add the two filter() commands. Then you can press the check button.

# mile1 = filter(...)
# mile10 = filter(...)
# mile1
# mile10

The results of this filter process are stored in the partial data sets mile1.dta and mile10.dta. So in the following chapters, when we have to consider the different groups of society again, we simply need to read in these partial data sets.

As already announced, in this chapter we want to examine the effects of the instruments, which are represented by the variables mntr_instr and cnty_instr, on changes in house prices. This means we have to create a regression formula which we can use to examine the relationship between the instruments and ln_median_house_value_dif.

Task: Define the regression formula explained above. This means you should include ln_median_house_value_dif as dependent variable and mntr_instr, cnty_instr plus the control variables as independent variables. To do this use the function mf(). Store the result into form. If you don't know how to do this have a look at Exercise 4.2 or press hint().

# Enter your command here

Because we include the instruments here as independent variables and examine their effect on changes in house prices we don't run an IV regression yet. So we can use the felm() command in the same form as we did it in the the previous chapters.

Task: Run a regression that examines the reduced form. Consider only the areas stored in mile1. To do this you have to pass the regression formula stored in form to the felm() command. Remember that you should consider clustered standard errors. To do this you have to define the cluster variable fips first. If you don't know how to do this have a look at Exercise 3. Store the results of the regression in the variable reduced1.

# Enter your command here

Task: Run the same regression as above. But now consider the areas included in mile10. Store the results in reduced10. Note that when you use a new data set you have to adapt your regression command and the cluster variable.

# Enter your command here

Task: Display the summary of the two regressions. To do this use the function reg.summary()

# Enter your command here

Regarding the results for the areas located next to a monitor, we detect values of 0.07 for the coefficient of mntr_instr and 0.06 for the coefficient of cnty_instr; whereas at least the coefficient of cnty_instr is almost significant. For the areas located five to ten miles away from the next monitor, the coefficient of mntr_instr is around 0 and the coefficient of cnty_instr is 0.03, both definitely not significant. So we effectively can state that the non-attainment status matters only for the areas located next to a monitor. Otherwise we would expect to see a stronger relationship between non-attainment status and changes in house prices also for areas that are located more than five miles away from the next monitor. This is consistent with the statement made in prior chapters that reductions in pollution especially take place in areas located near a monitor.

The IV regression

As announced, in this chapter we try to estimate the effects of the 1990 CAAA-induced improvements in the air quality on different groups of society. Therefore we exploit our instrumental variable strategy which we defined in the previous chapters. In contrast to Exercise 4.2 we don't run the two stages of the IV regression separately but apply a special function in R which compresses this procedure. To consider the different groups of society we apply the approach explained above and run two regressions, each with another data set. These corresponding data sets were already created in this chapter and are stored in mile1 and mile10.

In R there are several commands with the ability to run an IV regression. One of them is the felm() command, that you should already know from previous exercises. If you don't know how you apply it to run an IV regression, check the info box below.

info("IV regression with felm()") # Run this line (Strg-Enter) to show info

Even though we apply felm() here to run an IV regression we still have to consider the control variables in our regression formula. So to keep the regression command brief we have to use mf() again.

Task: Define a regression formula with ln_median_house_value_dif as dependent and pol_dif plus the control variables as independent variables. Use the function mf(). Store the formula in the variable form. Remember that you have already applied mf() in this exercise.

# Enter your command here

By using felm(), all exogenous regressors are automatically included as instruments.

Task: Run an IV regression that estimates the effect of PM$_{10}$ reductions on changes in house prices. Consider only the areas included in mile1. To do this you have to pass the regression formula stored in the variable form, the instruments mntr_instr and cnty_instr and the cluster variable fips, which you have to define before, to the felm() command. As this is the first time we use felm() to run an IV regression, you just have to uncomment the code and fill in the gaps. Afterwards press check. If you don't know how to fill the gaps, have a look at the info box above or press hint().

#fips = ...
#ivreg1 = felm(..., iv=list(pol_dif ~ ...),..., ...)

Task: Run the same regression as above, but now consider the areas that are included in mile10. Store the results into ivreg10.

# Enter your command here

Task: To compare the results for the different groups of society show the summary of both regressions. To do this use reg.summary().

# Enter your command here

! addonquizRegression output 4

Regarding the results for the areas located next to a monitor we get a value of (-0.0133) for the coefficient which represents the effect of PM${10}$ reductions on changes in house prices. Considering that there are logarithmized values for the dependent variable we can state that a decrease of one unit in PM${10}$ leads to an increase of 1.33% in house prices. This result is significant.

According to BFL (2014) the implied elasticity of house prices with respect to reductions in PM$_{10}$ is about (-0.6). This value is remarkable because similar articles, like Chay and Greenstone (2005), estimated an elasticity that nearly is half the size. If you are interested in an explanation of the procedure how to calculate the elasticity in a regression, click on the info box below.

info("Elasticity") # Run this line (Strg-Enter) to show info

For the areas included in mile10 and therefore located further away from the next monitor the coefficient of interest is only (-0.0044) and is not significant. This implies that if the distance of an area to the next monitor increases, the influence of the PM${10}$ reduction on changes in house prices clearly will become smaller. In Exercise 2 we learned that the smaller the distance of an area to the next monitor, the poorer its population is. Consequently we can say that a PM${10}$ reduction benefits the poorer part of the population to a larger extent. Furthermore, because of the reduced form, we know that the PM${10}$ reduction especially occurs in areas that are located close to a monitor. So poorer people do not only benefit more from a one unit reduction in PM${10}$ but also experience a higher PM$_{10}$ reduction.

info("Possible problems interpreting these results") # Run this line (Strg-Enter) to show info

Now it is important to remind you that these results, indicating progressive benefits, only hold for a specific subgroup of the population, namely the homeowners. In Exercise 8 we will discuss if you can apply these findings also to the whole population.

But before we do this we have to test if these results here are valid at all. This will be the content of the next two exercises.

Exercise 6 -- Robustness checks

In the last chapter we found that the benefits which are associated with 1990 CAAA-induced changes in the PM$_{10}$ concentration seem to be progressive for homeowners. To check if these results are really valid we will deal with some so called robustness checks now.

Robustness checks examine how certain the "core" regression coefficient estimates behave when modifying the regression specifications. For example by adding or removing regressors or by adapting the data set. If the coefficients of interest in this chapter are plausible and similar to those in our "core" regression, this is commonly interpreted as evidence of structural validity (Lu and White (2014)). Furthermore by showing that the error term does not contain specific factors and therefore reducing the probability that our instruments are correlated with the disturbance, these robustness checks can diminish the doubts that the second condition for our instruments is fulfilled.

Particularly, we run three different robustness checks. They involve the consideration of the socio-economic trends in the areas before 1990, an alternative instrument definition and a different set of monitors. Because we want to compare the results of these adjusted regressions to the ones of the "core" regression, we consider the same clustered data sets as in Exercise 5. This means we use mile1.dta and mile10.dta. As we learned in Exercise 5 they include only those areas that are located between zero and one mile respectively between five and ten miles away from the next monitor.

Task: Read in the data sets mile1.dta and mile10.dta. Store them into the variables mile1 and mile10.

# Enter your command here

Earlier trends

The first robustness check examines to which extent pre-treatment trends in neighborhood conditions affect the results. If we can confirm that the areas which experienced a large reduction in PM${10}$ were already on an upward trajectory and you can suppose that these reductions might have occurred even without the CAAA in 1990, then we might attribute further improvements to PM${10}$ reductions.

This test is performed by taking additional control variables into account. They represent the differences in an area between 1980 and 1990 with regard to the logarithmized median income, the share of black people, the population density and the number of housing units. Because you couldn't observe these trends before 1990 in all areas which are included in the data set you lose almost 20 % of the observations.

Task: Have a look at the variables which we mentioned above and which represent the trends before 1990. In the data set they are named as ln_avg_fam_income_dif_80, total_housing_units_dif_80, pop_dif_80 and share_black_dif_80. Use the select() command with which you already became acquainted with in Exercise 2. Consider the observations which are stored in mile1. To do this you need to remove #, fill in the gaps and then press check.

# select(mile1, ..., ..., ..., ...)

As a clarification: this test again examines the effect of pollution reductions on house prices, just like the "core" regression in Exercise 5. The only difference is in the vector which includes the control variables. This time it includes the additional variables from above which represent the trends in the areas before the treatment in 1990.

Task: In this chapter the vector of controls differs from the one in our "core" regression. That's why we have to define it here again. To do this run the presented code. Note that this time the vector includes the additional variables ln_avg_fam_income_dif_80, total_housing_units_dif_80, pop_dif_80 and share_black_dif_80.

my.controls_80 = c("share_black_dif_80", "pop_dif_80", "total_housing_units_dif_80", "ln_avg_fam_income_dif_80", "total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

Due to the missing data on the variables which represent the trends before 1990 we cannot apply felm() here to run an IV regression. Instead, in this robustness check we use the ivreg() command. If you are not familiar with this function, click on the info box below.

info("ivreg()") # Run this line (Strg-Enter) to show info

Because the structure of the ivreg() command differs from the one of felm() we have to create another kind of regression formula. The major difference is that, in addition to the string which indicates the core part of the formula and the vector of control variables, you also have to consider a string which presents all instruments. As explained in the info box "ivreg()" this string of instruments has to include the defined instruments and all the exogenous regressors. To abbreviate even this process, we wrote another function. It is called mf.iv(). For a detailed presentation of this function check the box below.

! start_note "mf.iv()"

Task: You don't have to edit this chunk. It should just show you the structure of mf.iv().


The application of this function is quite similar to the one of mf(). The only difference is that, in addition to the string which indicates the core part of the formula and the vector of control variables, you also have to pass a string including all instruments. As already explained in the info box "ivreg()" this string of instruments has to include the defined instruments and all the exogenous regressors.

Task: If you are interested in an example how mf.iv() works, press check.


For more examples try to solve the following exercises. In doing so you should become acquainted with this function.

! end_note

As described above and in the info box we first have to define an additional vector which includes our detected instruments plus all exogenous regressors, before we can apply mf.iv()

Task: Run the presented code to create the vector my.instr.

my.instr = c("mntr_instr", "cnty_instr", "share_black_dif_80", "pop_dif_80", "total_housing_units_dif_80", "ln_avg_fam_income_dif_80", "total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

Task: Define the appropriate regression formula for this robustness check. To do this you have to pass the core part of the regression formula, the vector my.controls_80 and the vector my.instr to mf.iv(). As this is the first time we apply mf.iv() you just have to delete #, fill the gap with the core part of the regression and then press the check button.

# form = mf.iv("...",controls=my.controls_80,instr=my.instr)

After we created the regression formula in a way that it can be passed to ivreg() we can run the robustness check now.

Task: Use the ivreg() command to run the robustness check. Consider only those areas that are included in mile1. Store the results in the variable EarlierTrends1. To apply ivreg() you have to pass the regression formula stored in form and the relevant data frame.

# EarlierTrends1 = ivreg(..., data=...)

Task: Run the same regression as above, but now consider the areas that are included in mile10. Store the results in the variable EarlierTrends10.

# Enter your command here

Task: Show the summary of both regressions. To do this use the function reg.summary.

# Enter your command here

For the areas located within a radius of one mile around a monitor the coefficient of pol_dif is (-0.0156), significant at the 5 % level. Regarding the areas located further away from the next monitor the coefficient amounts to only (-0.0028). Furthermore this coefficient is not significant at all. These results are quite similar to the ones of our "core" regression in Exercise 5 which means that the inclusion of trends before 1990 doesn't cause a remarkable change in our estimation.

New instrument

Next, we experiment with alternative measures of non-attainment. That means we replace the instruments mntr_instr and cnty_instr, which represent the ratio of non-attainment years during the time span 1992 to 1997, by a new one. It is called mntr_cont_instr_91 and equals max(0, annual PM${10}$ concentration in 1991 - 50). So the new instrument indicates to which extent the corresponding monitor of an area exceeds the minimum level of the PM${10}$ concentration in 1991 and therefore should be a predictor of future changes in the air quality. (BFL(2014))

Unfortunately there are some areas in the data set for which one couldn't observe the PM$_{10}$ concentration in 1991. This means that there is a missing data problem again. That's why we have to use ivreg() in the same way as we did it in the first robustness check and as explained above we adjust the choice of our instruments and therefore have to redefine the vector of instruments, before we can apply mf.iv.

Task: Run the presented code to create the vector my.instr_91. Note that it includes the variable mntr_cont_instr_91 instead of mtnr_instr and cnty_instr.

my.instr_91 = c("mntr_cont_instr_91", "total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

In this robustness check we do not consider trends before 1990. This means to consider the control variables, we can use the predefined vector my.controls again, just like we did in the previous chapters.

Task: Define the appropriate regression formula for this robustness check and store it into form. Use the function mf.iv(). If you don't remember how to do this, look at the first robustness check.

# Enter your command here

After we adjusted our regression formula and therefore consider the new instrument we can run the robustness check now.

Task: Run the regression that considers the new instrument. Use ivreg(). Take only the areas into account which are included in mile1. Store the results in the variable NewInstrument1. To solve this chunk remember the first robustness check. The commands are very similar.

# Enter your command here

Task: Run the same regression as above. But now consider the areas included in mile10. Store the results in the variable NewInstrument10.

# Enter your command here

Task: Show the summary of both regressions. Use the reg.summary() command.

# Enter your command here

Regarding the results for the areas located zero to one mile away from a monitor, we see that the coefficient of pol_dif has a value of (-0.0021). This value obviously is smaller than the one in our "core" regression. In addition to that, according to a p-value which is even higher than 0.5, it is not significant. In contrast to that, the results for the ares which are located five to ten miles away from the next monitor are qualitatively similar to the ones of the "core" regression. In particular they indicate a coefficient of (-0.00479) for pol_dif and a p-value which is about 0.24.

Following BFL (2014) we can say that the results for the areas with another manifestation of ring, which we didn't examine in particular, also are quite similar to the results of the "core" regression. This means that, with the exception of the areas in the tightest ring, we can detect the same decreasing influence of the PM$_{10}$ concentration on house prices, as with the "core" regression.

More monitors

In the last robustness check we examine how the restrictions on the set of monitors, which were presented in Exercise 1, affect the results of our "core" regression. Therefore we reduce the requirements a monitor has to fulfill to take part in the measurement. That means in this test we also include monitors which were denoted as unreliable by the EPA. With these additional monitors, the system of rings around each monitor can cover the whole area of the United States more precisely. So using more monitors leads to an increase in the areas that can be observed. In the end the number of observations grows by about 60 %. (BFL 2014)

The clustered data sets, which include the additional observations as well, are named as moremonitors1.dta and moremonitors10.dta. For a detailed description of the cluster process have a look at Exercise 5.

Task: Read in the new partial data sets that include a higher number of observations: moremonitors1.dta and moremonitors10.dta. Store them into the variables moremonitors1 and moremonitors10.

# Enter your command here

Here we do not have problems with missing data, as it was the case with the two previous robustness checks. That's why we can use felm() again to run the IV regression.

Task: Define the appropriate regression formula for this robustness check. Store the result in form. To do this use the function mf(). Because the only difference compared to the "core" regression is in the considered data sets we can use the predefined vector my.controls here again.

# Enter you command here

Note that because we apply the felm() command in this robustness check we can use clustered standard errors here again.

Task: Run an IV regression that examines the relationship between changes in house prices and a reduction in pollution. To do this you have to pass the regression formula stored in the variable form, the instruments mntr_instr and cnty_instr and the cluster variable fips, which you have to define before, to the felm() command. Take only the areas into account that are included in moremonitors1. Store the results in MoreMonitors1. If you don't remember how to do this, look at Exercise 5 or press hint().

# Enter your command here

Task: Run the same regression as above. But now consider the observations that are included in moremonitors10. Store these results in the variable MoreMonitors10.

# Enter your command here

Task: Show the summary of both regressions. Use the reg.sumary() command.

# Enter your command here

! addonquizRegression output 5

The summary of the regression which considers the areas located next to a monitor, shows that the coefficient for pol_dif is (-0.0090). It is significant at the 5 % level. This means a reduction in PM$_{10}$ by one unit leads to an increase in house prices by nearly 1 %. For the areas that are located five to ten miles away from the next monitor the value for the coefficient of pol_dif is obvious smaller. To be precise it is (-0.00533). In addition to that the p-value is higher than 0.5, what means that this coefficient is obviously not significant.

So in the end the results are again quite similar to those from the "core" regression, despite the additional data. This is caused by an increased level of noise in the readings which is due the additional unreliable monitors.

Conclusion

To summarize this chapter we can say that the results of all three robustness checks do not differ dramatically from our results in Exercise 5. So we can state that the results of our "core" IV regression seem to be valid. As described in the introduction of this chapter, robustness checks also support the assumption that the second condition for the instruments is fulfilled, if the coefficient of interest in these tests doesn't differ from the one in the "core" regression. As already said, this is the case here. Thus we can continue to assume that the second condition for our instruments holds.

BFL (2014) present the results of some additional robustness checks in their online appendix. These robustness checks consider the following adjustments: - Including only areas with boundaries that do not change between 1990 and 2000 - Instead of using partial areas when rings overlap, the observations are restricted on whole areas - Including region fixed effects - Exploiting information about the elevation of an area - Excluding California from the data set because it includes many monitors exceeding the thresholds for the PM$_{10}$ concentration.

(BFL (2014) online appendix)

If you are interested in these additional tests, click here. All of these tests show similar results as the ones presented in this problem set and therefore also indicate that the results of the IV regression in Exercise 5 are valid. That's why it's not necessary to present them here separately.

Exercise 7 -- Sorting

One additional concern in interpreting our results from Exercsie 5 is that households may relocate in response to changes in the air quality. On the one hand the households could have sorted before 1990, such that those with the greatest distaste lived in the areas which were initially the cleanest. On the other hand they also could have sorted in response to the changes in pollution during the 90s. If this was the case, it would be quite problematic to use our results from Exercise 5 for evaluating the distributional implications of the PM$_{10}$ reduction induced by the 1990 CAAA. (BFL (2014))

To get a first impression about this concern we select some neighborhood characteristics of the areas, compute the median change during the 90s and compare it to the median of the absolute values in 1990. This calculation is applied to the different groups of areas, which are again divided by their manifestation of the variable ring. In particular we consider the following characteristics of the areas: population density, number of houses owned by the inhabitants, number of people living in the same house as five years ago and the number of total housing units.

The data set which includes the respective values for all areas is BFL.dta.

Task: To load the data set BFL.dta and to compute the median of pop_dense_dif, pop_dense_90, share_units_occupied_dif, owner_occupied_units_90, share_same_house_dif, share_same_house_90 total_housing_units_dif and total_housing_units_90 for the different groups of areas, divided by their manifestation of the variable ring, press edit and check afterwards. The results are stored in the variables change_pop_density, pop_density, change_owned_units, owned_units, change_share_same_house, share_same_house, change_total_units and total_units. Here we use a combination of the summarise() and group_by() commands, as we already did it in Exercise 2. There you can find a corresponding info box.

dat = read.dta("BFL.dta")

dat %>%
   group_by(ring) %>%
   summarise(change_pop_density = median(pop_dense_dif),
             pop_density = median(pop_dense_90),
            change_owned_units = median(share_units_occupied_dif),
             owned_units = median(owner_occupied_units_90),
            change_share_same_house = median(share_same_house_dif),
             share_same_house = median(share_same_house_90),
            change_total_units = median(total_housing_units_dif), 
            total_units = median(total_housing_units_90))

Remember that if you press Description in the Data Explorer, you will get more detailed information about the variables which we use in this exercise.

Task: You can enter whatever you think is needed to solve the following quiz.

# Enter your command here

! addonquizPopulation density

The results of this calculation indicate that the socio-economic characteristics for all areas did not change dramatically between 1990 and 2000, with regard to the absolute values in 1990. This holds especially for those areas that are located next to a monitor. So the data suggests relatively little sorting in response to 1990 CAAA-induced changes in air quality.

Following BFL (2014) we run some additional tests to examine whether there were systematic changes in households residing in affected areas. To be exact we regress the change in the share of people living in the same house as five years ago, the change in the population density, the change in the total number of households and the change in the owner-occupied units each on the PM${10}$ reduction. In doing so we use the same IV strategy as in Exercsie 5. If there were re-sorting, you would expect to see differential rates of turnover, especially in the areas which experience particularly large reductions in PM${10}$. So in order to check this we use the clustered data sets mile1.dta and mile10.dta again, which include the areas that are located between zero and one mile respectively between five and ten miles away from a monitor.

Task: To load the data sets mile1.dta and mile10.dta and to store them into the variables mile1 and mile10 run the presented code.

mile1 = read.dta("mile1.dta")
mile10 = read.dta("mile10.dta")

As announced, in this chapter we replace the dependent variable of our regression model with the variables from the data set, at which we already had a look before. In the previous regressions these variables were included in the vector of controls. If you regress on a specific variable, this variable can't be a covariate at the same time. So in each of the following analysis we first have to redefine the vector of controls, meaning that we exclude the respective factor, which we use as dependent variable.

Nevertheless we continue to use felm() to run the different IV regressions. If you can't remember how to apply this function, have a look at Exercise 5.

Same house

Let's start with the regression which examines the impact of PM$_{10}$ reductions on the change in the share of people living in the same house as five years ago. The variable which represents the change in the share of people living in the same house as five years ago is share_same_house_dif. Before we can run the regression with share_same_house_dif as dependent variable, we have to make sure that it is not included in the vector of control variables.

Task: To redefine the vector of control variables my.controls just press check.

my.controls = c("pop_dense_dif", "total_housing_units_dif", "share_occ_own_dif", "share_units_occupied_dif", "share_black_dif", "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif", "share_female_hhhead_dif", "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif", "share_kitchen_none_dif", "share_plumbing_full_dif", "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif", "share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif")

Task: Use mf() to define the regression formula with share_same_house_dif as dependent variable and pol_dif plus the control variables as independent variables. In Exercise 3 is a detailed description how to apply mf(). To illustrate in what way the regression formulas in this chapter differ we name each variable which stores such a formula after the considered dependent variable. In this case you should store the regression formula in the variable SameHouse.

# Enter your command here

Using this formula we can run a regression that examines the effect of PM$_{10}$ reductions on the change in the share of people living in the same house as five years ago.

Task: Run the regression which is explained above. Therefore you have to pass the regression formula stored in the variable SameHouse, the instruments mntr_instr and cnty_instr and the cluster variable fips, which you have to define before, to the felm() command. Consider only the areas that belong to mile1. Store the results in the variable SameHouse1. If you don't know how to do this have a look at Exercise 5 or press hint().

# Enter your command here

Task: Run the same regression as above. But now take those areas into account, which are included in mile10. Store the results in the variable SameHouse10.

# Enter your command here

Task: To show the summary of both regressions use reg.summary().

# Enter your command here

To clarify the output of the regressions you can edit a quiz here.

! addonquizRegression output 6

When regarding these results, the first thing we should note is that the p-values amount to 0.89 respectively 0.12. So in both cases we can't reject the corresponding null-hypothesis and therefore can't detect a significant influence of a PM$_{10}$ reduction on the change in the share of people living in the same house as five years ago.

This suggests that areas, located next to a monitor and therefore see a larger reduction in PM$_{10}$, did not experience particularly large changes in the fraction of households that moved in the past five years.

Population density

Now we want to estimate the effect of PM$_{10}$ reductions on changes in the population density of an area. The variable representing the change in the population density is pop_dense_dif. To run a regression with pop_dense_dif as dependent variable, we first have to ensure again that this variable is not included in the vector of controls.

Because the procedure to adapt the controls and to create the regression formula is quite similar in each examination of this chapter, we summarize these two steps in the following exercises.

Task: To adapt the vector of control variables and to create an appropriate regression formula press check. This regression formula is stored in the variable PopDense.

my.controls = c("share_same_house_dif", "total_housing_units_dif", "share_occ_own_dif", "share_units_occupied_dif", "share_black_dif", "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif", "share_female_hhhead_dif", "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif", "share_kitchen_none_dif", "share_plumbing_full_dif", "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif", "share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif")

PopDense = mf("pop_dense_dif ~ pol_dif",controls=my.controls)

! addonquizControl variables

Task: Run a regression that examines the relationship between pop_dense_dif and pol_dif. Consider only the areas that belong to mile1. Store the results in the variable PopDense1. To get a hint how to do this, look at the previous regression of this chapter. The command differs only in the regression formula which you have to pass. Remember to define the cluster variable first.

# Enter your command here

Task: Run the same regression as above. But this time consider the areas that are included in mile10. Store the results in the variable PopDense10.

# Enter your command here

Task: Show the summary of both regressions. To do this use the reg.summary() command.

# Enter your command here

Running these regressions, we get p-values of 0.62 respectively 0.88 for the two coefficients which represent the effect of PM${10}$ reductions on changes in the population density. This means they also are not significant. So in this sub chapter we can state that areas experiencing a larger policy-induced reduction in PM${10}$ didn't see larger changes in the population density.

Total housing units

The next regression explores the effect of PM$_{10}$ reductions on the change in the total number of households in an area. In the data sets this characteristic is represented by the variable total_house_units_dif. So in this case before we run the regression we have to exclude total_house_units_dif from the vector of control variables.

Task: To adapt the vector of control variables and to create the appropriate regression formula press the check button. The regression formula is stored in TotalUnits.

my.controls = c("share_same_house_dif", "pop_dense_dif", "share_occ_own_dif", "share_units_occupied_dif", "share_black_dif", "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif", "share_female_hhhead_dif", "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif", "share_kitchen_none_dif", "share_plumbing_full_dif", "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif", "share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif")

TotalUnits = mf("total_housing_units_dif ~ pol_dif",controls=my.controls)

Task: Run a regression that examines the relationship between total_housing_units_dif and pol_dif. Consider only the areas that belong to mile1. Store the results in the variable TotalUnits1.

# Enter your command here

Task: Run the same regression as above. This time consider the areas that are included in mile10. Store the results in the variable TotalUnits10.

# Enter your command here

Task: To show the results of the two regressions use reg.summary().

# Enter your command here

! addonquizEffects on the total number of housing units

Here we get p-values of 0.67 and 0.76 for our two coefficients of interest. This means even these results are not significant at all. Consequently areas experiencing a larger policy-induced reduction in PM$_{10}$ didn't see particularly large changes in the total number of housing units.

Owner-Occupied units

To complete our analysis we regard the impact of PM$_{10}$ reductions on the change in the share of owner-occupied units. The change in the share of owner-occupied units is represented by the variable share_occ_own_dif in the data set. That's why this time share_occ_own_dif has to be excluded from the vector of control variables.

Task: Run the presented code to adapt the control variables and to create the appropriate regression formula. In this case the regression formula is stored in the variable OwnerUnits.

my.controls = c("share_same_house_dif", "total_housing_units_dif", "pop_dense_dif", "share_units_occupied_dif", "share_black_dif", "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif", "share_female_hhhead_dif", "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif", "share_kitchen_none_dif", "share_plumbing_full_dif", "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif", "share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif")

OwnerUnits = mf("share_occ_own_dif ~ pol_dif",controls=my.controls)

As this is the last time in this problem set that we run our two regressions to estimate the effects of an air quality improvement on different groups of areas, try to solve them without any support.

Task: Run a regression that estimates the effects of a PM$_{10}$ reduction on the share of owner-occupied units in an area. Consider only the areas that are included in mile1. Store the results in the variable OwnerUnits1.

# Enter your command here

Task: Run the same regression as above. But now consider the areas included in mile10. Store the results in the variable OwnerUnits10.

# Enter your command here

Task: Show the summary of both regressions. To do this use reg.summary().

# Enter your command here

Due to the p-values of 0.41 and 0.58, the results again are not significant. Therefore we can state that the areas which experience a larger policy-induced reduction also didn't see larger changes in the share of owner-occupied units.

Conclusion

As explained in the introduction of this chapter, if there was re-sorting, you would expect to see different results for the areas which experience a higher reduction in PM${10}$ when you examine the effects of a PM${10}$ reduction on different neighborhood characteristics. This means that the results for the areas located next to a monitor should indicate a significant higher value for the coefficient of pol_dif.

All our four tests show results for the effect of reductions in PM${10}$ that aren't significant. So the four characteristics don't seem to be affected by a change in pollution. That's why, in line with BFL (2014), we can reject the hypothesis that areas with a large policy-induced reduction in PM${10}$ differ in their turnover rates and therefore can state that the sorting responses to PM$_{10}$ reductions weren't large.

Exercise 8 -- Distributional implications and related literature

After we verified that our results from Exercise 5 are valid, we can now use them to discuss the distributional implications of the 1990 CAAA-induced improvements in air quality. In Exercise 5 we found that the benefits of a PM${10}$ reduction caused by the CAAA in 1990 are larger for poorer people. Furthermore we learned that such reductions in PM${10}$ especially take place in the areas that are inhabited by lower income households. So these two results suggest that the benefits of improvements in air quality seem to be progressive. In contrast to that, previous studies on distributional impacts of environmental policy typically found that the benefits were regressive (Banzahf (2011), Fullerton (2011), Bento (2013)).

One reason for these different findings could be that, unlike the previous literature, we focus on a specific subgroup of the population in our analysis, namely the homeowners. We are aware of the fact that the population with the lowest income usually doesn't own houses, but has to pay rents instead, wherefore our approach using house prices as dependent variable doesn't take them into account. Thus we also ran a regression which estimates the effect of PM$_{10}$ reductions on changes in rents. You can find it in the appendix of this problem set. In this regression the coefficients are neither remarkable nor significant. BFL (2014) argue that these outcomes imply that the rents aren't affected by reductions in air pollution. Given this they claim that if anything they tend to understate the progressivity of the program's benefits, for example because the landlords don't increase the rents due to improvements in the air quality, allowing renters to appropriate most of the improvements in air quality. But because this interpretation contradicts the theoretical expectations that an increase in the house prices, which is induced by pollution reductions, should go hand in hand with an increase in the rents and because we think that it is quite debatable to use results which are not significant for a conclusion, in contrast to BFL (2014) we limit ourselves in this problem set to the statement that the progressivity of the benefits only applies to that part of the population that owns a house. Another approach, to consider actually the whole population, would be to look for further measures that consider especially the effects on the poorer part of the population, for example like Banzhaff (2011), Bento (2013) or Fullerton (2011). Additionally to figures like land or house prices, they include measures for the effects on the labor market in the energy industry (Banzhaf (2011)), for the prices of carbon-intensive products (Fullerton (2011)) or for more socio-economic characteristics (Bento (2013)).

! addonquizConclusion

As mentioned above, previous works in this field often used different figures to measure the effects of an improvement in air quality. That's why it is quite difficult to compare the absolute values of our regression coefficients to the ones of other articles. To make different works comparable there is a commonly used measure in the literature (Fullerton (2011)): It is the Marginal Willingness To Pay (MWTP). In our case it represents the annual dollar amount a household would pay for a one-unit reduction in PM$_{10}$. To calculate the MWTP you have to transform house prices into annual expenditure. Therefore BFL (2014) assume an interest rate of eight per cent and a 30-year mortgage. In this problem set we don't execute this calculation on our own, but adopt the results of BFL (2014). If you are interested in the procedure how to calculate the MWTP, we can suggest Rosen (1974). Because the calculation of the MWTP is based on the results of our IV regression, the values represent the same variation across space as the results in Exercise 5. To be precise, the MWTP for the areas located zero to one mile away from a monitor is 129 dollars and decreases to 51 dollars for those areas located five to ten miles away from the next monitor. The value for the areas next to a monitor is quite similar to the results of other works, for example to the ones of Bayer et al. (2009), Lang (2012) or Bajari et al. (2012). This suggests that by exploiting our IV strategy we are doing at least something right.

Finally we want to have a closer look at the distribution of the income in all areas and therefore analyse if a progressive or regressive approach for the distribution of the benefits will lead to a higher overall welfare. If you are not interested in this additional analysis, you also can skip the last exercise of this chapter and go straight to the conclusion of this problem set.

Task: Press edit and check afterwards to have a look at the distribution of the income. Note that in this case we estimate the income density by epanechnikov kernel.

dat = read.dta("BFL.dta")

density_income=density(dat$median_family_income_90, kernel = "epanechnikov")
plot(density_income, xlab="1990 Median Income", main="Distribution of the income")

! addonquizIncome distribution 1

! addonquizIncome distribution 2

This density plot is right-sided and therefore indicates that there are quite a lot of people with a relative low income and only a few people with a really high income over 100.000 dollars. Because progressive benefits imply that poorer people benefit more than richer people, the increase in the overall welfare, induced by improvements in the air quality, should be higher in this case.

Exercise 9 -- Conclusion

Before we conclude the analysis, we want to emphasize that in contrast to BFL (2014) we do not claim to consider the whole effects of the 1990 CAAA, but take only those regulations into account which affect the PM$_{10}$ concentration in the air.

Using geographically divided data and running an instrumental variable regression, we examine the distribution of benefits associated with a PM$_{10}$ reduction induced by the Clean Air Act Amendments in the 1990s. In particular the CAAA created incentives for the local regulators to focus their actions on the dirtiest areas. This led to geographically uneven reductions in pollution. By exploiting this knowledge and using house price appreciation as a measure of welfare we find that the benefits of 1990 CAAA-induced improvements in the air quality are progressive for the subgroup of homeowners.

As our approach measures the benefits of the 1990 CAAA by the capitalization of air quality improvements into house values there is one issue left that could make our results vulnerable. A whole welfare analysis also would take costs into account. But costs are not captured in house prices. Robinson (1985) illustrated that in 1970 the costs, proportional to income, for pollution abatement were about twice as large for households in the lowest quintile of the income distribution than for households in the highest quintile. In our analysis we found that the coefficient which measures the effects of a PM$_{10}$ reduction is more than twice as high for people in the lowest income quintile compared to those in the highest income quintile. So if the costs of air pollution abatement have a similar distribution in 1990 and in 1970, then the total costs of the CAAA at least have to surpass its benefits to be regressive. The U.S. EPA (2011) determined that the benefits of the CAAA exceed the costs by a factor of about 30. So all in all, despite the exclusion of the costs, we can say that, at least for the people who own their home, the benefits of a reduction in pollution induced by the 1990 CAAA are progressive.

If you want to see the entirety of the awards that you collected during this problem set, just press edit and check afterwards in the below code block.

awards()

Exercise 10 References

Bibliography

R and Packages in R

Exercise 11 Appendix

IV Results for renters

At this point we examine the effect of PM${10}$ reductions induced by the 1990 CAAA on rents. In doing so we exploit the same IV approach as in Exercise 5, where we estimated the effect of PM${10}$ reductions on changes in house prices. This means in addition to the first difference-approach and the control variables, we apply mntr_instr and cnty_instr as instruments and cluster the standard errors at the county level. As it was the case with all the other IV regressions we only analyze the effects for the groups of areas that are located zero to one mile or five to ten miles away from the next monitor. We do this by considering the data sets mile1.dta and mile10.dta

Task: To run the regressions which estimate the effects of PM$_{10}$ reductions induced by the 1990 CAAA on rents only for the areas located zero to one respectively five to ten miles away from the next monitor and to have a look at the results, press edit and check afterwards.

mile1 = read.dta("mile1.dta")
mile10 = read.dta("mile10.dta")

my.controls = c("total_housing_units_dif", "share_units_occupied_dif", "share_occ_own_dif", "share_black_dif",  "share_latino_dif", "share_kids_dif", "share_over_65_dif", "share_foreign_born_dif",  "share_female_hhhead_dif", "share_same_house_dif",  "share_unemployed_dif", "share_manuf_empl_dif", "share_poor_dif", "share_public_assistance_dif", "ln_med_fam_income_dif", "share_edu_less_hs_dif", "share_edu_16plus_dif", "share_heat_coal_dif", "share_heat_wood_dif",  "share_kitchen_none_dif", "share_plumbing_full_dif", "pop_dense_dif",  "share_bdrm2_own_dif", "share_bdrm3_own_dif", "share_bdrm4_own_dif", "share_bdrm5_own_dif", "share_single_unit_d_own_dif", "share_single_unit_a_own_dif", "share_mobile_home_own_dif", "share_blt_5_10_own_dif", "share_blt_10_20_own_dif", "share_blt_20_30_own_dif", "share_blt_30_40_own_dif", "share_blt_40_50_own_dif", "share_blt_50plus_own_dif", "share_bdrm2_rnt_dif", "share_bdrm3_rnt_dif", "share_bdrm4_rnt_dif", "share_bdrm5_rnt_dif", "share_single_unit_d_rnt_dif", "share_single_unit_a_rnt_dif", "share_mobile_home_rnt_dif", "share_blt_5_10_rnt_dif", "share_blt_10_20_rnt_dif", "share_blt_20_30_rnt_dif", "share_blt_30_40_rnt_dif" ,"share_blt_40_50_rnt_dif", "share_blt_50plus_rnt_dif", "factor")

form = mf("ln_med_rent_dif ~ pol_dif", controls=my.controls)

fips1 = mile1$fips2
rents1 = felm(form, iv=list(pol_dif ~ mntr_instr + cnty_instr), clustervar=fips1, data=mile1)

fips10 = mile10$fips2
rents10 = felm(form, iv=list(pol_dif ~ mntr_instr + cnty_instr), clustervar=fips10, data=mile10)

reg.summary(rents1, rents10)

As you can see the values for the coefficients, which represent the effect of PM$_{10}$ reductions on changes in rents, are (0.0043) for the areas included in mile1 and (-0.0023) for the areas included in mile10. But due to the p-values of around (0.14) and (0.21) respectively these results are not significantly different from zero.



msporer/RTutorEnvironmentalRegulation documentation built on May 23, 2019, 7:54 a.m.