Effects of Building Codes on Energy Consumption

Author: Lisa Eilts

< ignore

library(restorepoint)
# facilitates error detection
# set.restore.point.options(display.restore.point=TRUE)

library(RTutor)
library(yaml)
#library(restorepoint)
setwd("C:/Users/admin/Desktop/MAfinal")
ps.name = "BuildingCodes"; sol.file = paste0(ps.name,"_sol.Rmd")
libs = c("foreign","dplyr","ggplot2", "tidyr","gridExtra", "scales", "lfe", "texreg", "regtools") # character vector of all packages you load in the problem set
name.rmd.chunks(sol.file) # set auto chunk names in this file
create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL,libs=libs, stop.when.finished=FALSE,
          addons = "quiz", var.txt.file = "variables.txt")
show.shiny.ps(ps.name, load.sav=FALSE,  sample.solution=TRUE, is.solved=FALSE, catch.errors=TRUE,
              launch.browser=TRUE)
stop.without.error()

>

Outline

Hello! Welcome to this interactive RTutor problem set which is part of my master thesis at the University of Ulm. Great that you found it. You can work on this problem set to practise a little bit in R programming. Additionally, you get an insight whether building energy codes are an effective way for saving energy in practice and you learn some facts about a few econometric models. After some short introduction you can directly start with solving this problem set.

First, you will get some information about the content of this problem set. Second, you will learn how to work through this problem set and third, you will see the structure of this problem set.

Paper

The problem of greenhouse gas emissions and climate change is omnipresent. Barack Obama already mentioned in 2015: "Climate change is no longer some far-off problem; it is happening here, it is happening now", see The White House, Office of the Press Secretary (2015).

Hence, policymakers introduce for instance many laws and regulations to increase energy efficiency and to reduce greenhouse gas emission. An important field is the building sector because it consumes a lot of energy. More precisely, the direct greenhouse gas emissions of the residential and commercial sector in the year 2014 are roughly 12 percent of the total greenhouse gas emissions in the U.S. (according to U.S. EPA, n.d.). And take note that indirect emissions from electricity consumption by residences, for example due to air-conditioning systems, have to be added to this number. One policy measure to increase energy efficiency and thus to reduce emissions is the implementation of building energy codes. In Florida, the energy codes were initially introduced in the year 1978. But do these building energy codes actually reduce the energy consumption of residences in Florida?

An answer to this question -- using the example of building energy codes in Florida -- is given by Grant D. Jacobsen and Matthew J. Kotchen in their paper with the title "ARE BUILDING CODES EFFECTIVE AT SAVING ENERGY? EVIDENCE FROM RESIDENTIAL BILLING DATA IN FLORIDA" (2013). You can find this paper at http://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00243#.WO3sjvnyjIU. However, for downloading you need an appropriate access authority. Alternatively, you can download the working paper version of this article from http://www.nber.org/papers/w16194. Furthermore, the Stata code is available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/20533. During this problem set we will refer to this article as paper.

The two authors of the article collected monthly utility billing data on electricity and natural gas consumption of households in Gainesville which is a city in the northern part of Florida. In the year 2002, the Florida Building Commission implemented a further tightening of the existing building energy code. To be more precise, the energy code states a minimum standard for the overall energy efficiency of a newly built residence. This minimum standard guideline is represented by a baseline home. Hence, the newly constructed buildings have to be at least as efficient as the baseline home. The characteristics of the baseline home -- like the emissivity of windows or the efficiency of the air-conditioning system -- are defined by the energy code. Thus, these characteristics are subject to more and more tightened standards over the years. The billing data of the article are from the years 2004 to 2006 and thus have been collected for the years after the code change. However, the considered residences have been built within the years from 1999 to 2005. Hence, we can divide our data set into two groups -- the first group containing the residences which were built before the code change in the year 2002 and the second group covering the buildings which were constructed after the code change.

The aim of this paper is to compare the two groups to obtain the influence of the code change on the actual energy consumption of these residences. For this purpose, the authors Jacobsen and Kotchen used different empirical strategies like two sample t-tests and different regression analyses. Further, they analysed the effects of the code change for hot and cold days by using a fixed effects regression.

In this problem set we will replicate the authors' results in R and we will additionally get familiar with these empirical methods.

Problem set

This short section provides you some basic knowledge you need for solving the problem set. In general it is not necessary to solve the exercises in the given order but it is recommended to do so because later exercises require the knowledge (e.g. of commands) of earlier exercises. But within an exercise you have to solve the tasks in the given order. Only tasks with an explicit remark can be skipped. Every time you start a new exercise you have to click on the button edit. Afterwards, you have the possibility to enter your code in the chunk. To check whether your solution is right just click on the button check. If you have difficulties while solving the task you can press hint to get some advice. To get a sample solution just click on the solution button.

Exercise content

Exercise 1: Overview of the data and graphical analysis of the variables
Exercise 1.1: Residential energy consumption and the variable EYB
Exercise 1.2: Housing characteristics
Exercise 1.3: Weather variables

Exercise 2: Building code change in more detail
Exercise 2.1: Comparison of the mean values
Exercise 2.2: Statistical analysis by using a two sample t-test

Exercise 3: Introduction to regression analysis
Exercise 3.1: Regression theory and implementation in R
Exercise 3.2: Simple regression model based on our data

Exercise 4: Regression analysis: pre- and post-code-change comparisons
Exercise 4.1: Estimating annual differences in energy consumption
Exercise 4.2: Robustness check of regression model 1
Exercise 4.3: Effect of the code change on different months
Exercise 4.4: Possible limitations

Exercise 5: Effects of the code change for hot and cold days
Exercise 5.1: Theory of fixed effects regression and difference-in-differences estimation
Exercise 5.2: Estimating differences towards variability in weather

Exercise 6: Conclusion

Exercise 7: References

Exercise 1 -- Overview of the data and graphical analysis of the variables

This problem set analyses the effects of building codes on energy consumption. More precisely, the effects on electricity and natural gas consumption of households in Gainesville, Florida. We want to know if the code change affects the energy consumption of these residences and actually leads to a reduction of it.

This exercise will introduce you to the topic and mainly will provide an insight into the collected data. We use monthly billing data for several households in Gainesville. In exercise 1.1 to 1.3 we will analyse the data more detailed and calculate some important values to get more information about the residences and the energy consumption. Moreover, we will draw some plots to visualise the different variables.

Now, let's start with the analysis.

To be able to take a look at the data we first have to load the data set. To do so we need the function read.dta() because the original data is available as Stata code with file name extension ".dta". Write the name of the data frame you want to load in the brackets of the function read.dta(). Put the file name in quotation marks. Additionally, we can assign the data frame with a specific name, e.g. dat which enables us to work with the data frame later on. In total, the command to load a Stata data frame into R should have the following form: dat = read.dta("???"). The function read.dta() is part of the package foreign. Therefore, we have to load this package with the command library(foreign) at first.

You can get more information about the package foreign at https://cran.r-project.org/web/packages/foreign/foreign.pdf.

Task: Use the function read.dta() to load the data set which is called billingdata.dta. Before you do this you have to load the package foreign with the command library(). Enter your commands and press check. If you need help click on the button hint.

#< task
# load the package foreign
# enter your code here...
#>
library(foreign)
#< hint
display("You want to load the package foreign. Use the command library() and type the name of the package inside. Then, press check afterwards.")
#>

Now, the required package is loaded and you can read in the data set billingdata.dta. Assign it with the name dat. This data frame contains the most important columns of the original data frame.

#< task
# enter your code here...
#>
dat = read.dta("billingdata.dta")

#< hint
display("Your command should have the following form: dat = read.dta(\"???\"). Replace the ??? by the name of the data set.")
#>

< award "Sucessfully started"

Great! You have solved your first task correctly. Additionally, you have learned how to load a data set with the command read.dta().

While going on and solving the problem set you can win some more awards. To see an overview of all your received awards enter the command awards() in the code chunk. Then, press the button run chunk.

>

Now, we can take a closer look at the loaded data. This step is really important because you have to be familiar with the data to do further analyses and calculations. First, take a look at the column and row names of dat and make sure you understand their meaning. Then, analyse some values and look for exceptionalities. While working with data you always have to be careful because it might contain errors or there may be missing data. Additionally, you should visualise some of the data for a better illustration (according to Kennedy (2008), pp. 363-364).

In the following section and in the next three exercises you learn a bit about helpful commands to get more familiar with the data. Additionally, we will plot some of the data for a better understanding.

Get familiar with the data

In our data set dat each row represents one observation. Thus, the total number of rows is equal to the number of observations.

Task: Find out the total number of observations of our data set dat. One single command is enough here. Press hint if you need some advice.

#< task
# enter your code here...
#>
nrow(dat)
#< hint
display("You only have to count the number of rows of dat. Therefore, use the function nrow(). Put the name of the data set inside.")
#>

The number of observations is that large because we have monthly billing data from $2004$ to $2006$ and of $2239$ different residences.

More interesting are the different columns of our data set. In the following code box we get to know how the different columns are named and implicitly how many columns there are in our data frame dat.

Task: Use the command colnames() to return all names of the columns of the data frame dat. Press check afterwards.

#< task
# enter your code here...
#>
colnames(dat)

If you want to get some more information about the data frame you can use the command head(). It returns the column names together with the six first entries of the data frame or matrix. Another possibility is to click on the button data.

You can solve the following task optionally.

#< task
# enter your command here
#>
head(dat)

You can see that there are $17$ different columns. Most of the column names are self-explanatory. However, we want to take a closer look at these variables now. Therefore, we will divide these variables into four groups:

We will analyse the different groups in separate exercises. Exercise 1.1 addresses the first two groups, namely the variables representing the energy consumption and the variable EYB. Exercise 1.2 deals with the housing characteristics and in exercise 1.3 we will analyse the weather variables.

Exercise 1 refers to pages 36 to 38 and 42 and to table 1 of the paper.

Exercise 1.1 -- Residential energy consumption and the variable EYB

This exercise deals with three really important variables of our data frame. First, we will take a closer look at the two variables elec and gas which represent the energy consumption of the residences. Then, we will analyse the variable EYB which stands for effective year built. Last, we will use these variables to plot some graphics.

But first of all, we have to load the data.

Because you just started a new exercise you now have to click on the edit button to be able to solve the next task.

Task: Load the data set dat_energy.dta and name it dat.

This data frame is a small part of the original data frame. It contains only the variables we need for this exercise. To be more precise, the data frame dat_energy.dta consists of five columns: home_id, year, elec, gas and EYB. We need the first two variables for data transformation and we want to analyse the last three variables.

We already loaded the package foreign in the previous exercise. Thus, we don't have to repeat this step in this and in the following exercises. Enter your commands to load the data frame dat_energy.dta and press check afterwards.

#< task
# read in the data set dat_energy.dta and assign it to the name dat
# enter your code here...
#>
dat = read.dta("dat_energy.dta")

#< hint
display("Your command should have the following form: dat = read.dta(\"???\"). Replace the ??? by the name of the data set.")
#>

Now, show a part of the data frame. Just press check.

#< task
head(dat)
#>

In the end of this exercise we will visualise the data and therefore we have to transform the data frame. Hence, it may be helpful to look at this output again.

Electricity and natural gas consumption

The two variables elec and gas are the most important variables in this data frame because they allow us to compare the energy efficiency of residences built before and after the code change.

To emphasize the connection between the building energy code and the actual energy consumption you have to know that the energy code establishes a minimal norm for energy efficiency in the areas of space heating, space cooling and water heating. Hence, the two main targets are heating and cooling. Furthermore, the primary energy source for heating is natural gas and for central air-conditioning systems and thus for cooling it is electricity (according to paper, p. 35).

Now, take a closer look at both variables. It may be helpful to look at the following info box first. We will need these functions in the next tasks.

< info "Useful mathematical functions"

Here is a list of some useful mathematical R functions.

x = c(2,8,4,7) # creates a vector x which contains the numbers 2,8,4 and 7
x 

min(c(2,8,4,7)) # calculates the mimumum of vector x
max(x) # calculates the maximum of x
range(x) # calculates the minimum and maximum of x and returns a vector with both values 
mean(x) # returns the mean of the elements of vector x

>

Task: Choose the column elec of the data frame dat. Use the \$-operator and write dat$elec to select this column. Then, compute the minimum and the maximum of the electricity consumption with the command range().

#< task
# enter your code here...
#>
range(dat$elec)
#< hint
display("Put the name of the selected coloum in the brackets of the command range(). But without quotation marks. You get some additional advice in the above info box.")
#>

The result is a vector which contains two values. Left, you see the minimum and right the maximum of the monthly electricity consumption in kWh of all our data.

It is also possible to compute the minimum and the maximum separately.

Task: First, compute the maximum of the natural gas consumption and then the minimum. Similar to the task before use the \$-operator. However, you have to select the column gas now.

#< task
# enter your code here...
#>
max(dat$gas)
min(dat$gas)
#< hint
display("Please compute the maximum first. Use the commands max() and min() and take a look at the task above or the info box. It works similar like with the function range().")
#>

The two values represent the largest and smallest natural gas consumption of one household compared to all other households in our data set. The gas consumption is measured per month and in therms. In the following info box you can find some information about this unit.

< info "Therms"

The unit therm is a measure for heat energy. In the U.S. it is used to measure the natural gas consumption of residences. One therm is equal to $100,000$ British Thermal Units (BTU), where one BTU "is the heat required to raise the temperature of one pound of water by one degree Fahrenheit", compare U.S. EIA (2016).

In Germany we are more familiar with the unit kWh for measuring energy consumption. Hence, we will convert therms into kWh: one therm corresponds to roughly $29.3$ kWh, see ConvertUnits.com (2016).

Further information you can find at http://mapawatt.com/2010/02/17/what-therm.

>

When we compare the results for the electricity and gas consumption we can see that the maximal consumption of electricity is a lot higher than for natural gas. A reason for this is the climate in Gainesville or more general the climate in Florida. The climate situation in Florida can be described as following: there are much more days with high temperature than with low temperature. Therefore, households in Florida typically use more electric energy for cooling than gas for heating. In exercise 1.3 you will get evidence for this issue.

Now, we will turn to another important variable -- the effective year built (EYB).

Effective year built (EYB)

The variable EYB is actually a housing characteristic but it is of primary interest for us and thus we consider it separately.

EYB allows us to determine whether a residence was built under the regulations of the old energy code or under the new one, established in the year $2002$. To be more precise, the variable EYB declares the year in which the construction of a house has been completed. This means the year in which the final inspection took place. However, it doesn't mean the year of some extensive remodeling (according to paper, p. 37).

Hence, with the help of the variable EYB we can now differ between pre- and post-code-change residences and divide our data set in these two groups. Pre-code-change residences are houses with an EYB of 2001 and earlier. In the following we will refer to them as pre-code residences. Furthermore, we assign residences to the post-code-change regime if their EYB is 2003 and later. Analogously, we will refer to post-code-change residences as post-code residences in the following part of this problem set. Note, that we drop houses with an EYB of $2002$ because it is the year of the code change and therefore not possible to uniquely define the code regime of the corresponding residences.

In the following part of this exercise we will separate our data frame into the two parts of pre- and post-code residences. Therefore, we need the function `filter(). You can find more details about this function in the following info box.

< info "Function filter()"

The function filter() is contained in the package dplyr. This function returns all rows of a data frame that fulfill a certain condition. Hence, we have to specify the data frame and we have to state the condition inside the brackets. Separate both by a comma.

In the following example we use the data frame dat. We want to generate a subset of this data frame containing only the billing data of the year $2002$. (The variable is called year.) The resulting data frame should be denoted by dat_02.

# load the package
library(dplyr)

# use the function filter
dat_02 = filter(dat, year == 2002)

Instead of $==$ you can for example use $<$ or $>=$ to generate different subsets of the data frame.

If you want to have more information about this function you can look at https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/filter.

>

Task: Divide the data set dat into two parts. One part should include all pre-code residences and the other part all residences built after the code change. Use the variable EYB for separation. Assign the new data sets with the names pre_code and post_code. The function filter() from the package dplyr will simplify your work. Enter your commands and then press check. If you need some advice press hint.

#< task
# load the package
library(dplyr)

# create a data frame of all residences built before the code change in the year 2002
# enter your code here...
#>
pre_code = filter(dat, EYB <= 2001)

#< hint
display("Your commands should have the following form: pre_code = filter(???,??? <= ???). To state the condition use the variable EYB and the correct year. If you need further information how to use the function filter(). Check the info box above.")
#>

#< task
# create a data frame of all post-code residences (built in the year 2003 or later)
# enter your code here...
#>
post_code = filter(dat, EYB >= 2003)
#< hint
display("Your commands should have the following form: post_code = filter(???,??? >= ???). To state the condition use the variable EYB and the correct year. If you need further information how to use the function filter(). Check the info box above.")
#>

< award "dplyr-user level 1"

Great! You are able to use the function filter() out of the dplyr package. This package is really helpful for preparing data sets. Thus, keep this package in mind! It will really simplify your work. In the following exercises you will get to know some other functions of this package.

>

The following task can be solved optionally.

Task: You can test your results of the above task for correctness. One possibility is to count the number of residences in the data set dat and then do the same for both smaller data sets. The first number should be the same as the sum of the latter ones. The commands are already given. The function length() counts the number of observations of the vector which is written in the brackets. The command unique() removes all duplicate elements. We need the function unique() because we have a lot of billing data for each individual residence and we only want to count each household once. Click on the button check to see if the results are correct.

#< task
# total number of residences 
total_number = length(unique(dat$home_id))
total_number

# first, compute the number of residences built before the code change
pre_number=length(unique(pre_code$home_id))
pre_number

# second, compute the number of post-code residendes
post_number=length(unique(post_code$home_id))
post_number

# third, calculate the sum of pre_number and post_number.
sum(pre_number,post_number)
#>

If the sum corresponds to the total number of $2239$ residences then you have split the data frame correctly. The results also show that the group of pre-code residences is larger than the group of residences built after the code change. The pre-code data frame contains $1293$ households and in the post-code group are $946$ different residences included.

We will use the two data frames pre_code and post_code for a visualisation of our data. We will compare the energy consumption of the before and after code change residences in the following part. We also use the two data frames in exercise 2.1 and 2.2.

Visualisation of the data -- energy consumption of pre- and post-code residences

Our main aim is to analyse if the building energy codes of Florida help to save energy. Therefore, we compare the energy consumption of pre-code residences with the energy consumption of post-code buildings. A visual comparison is helpful to observe the differences of both groups and additionally to get more familiar with the data.

This section is divided into two parts. In part a) we will visualise the electricity consumption by means of two different graphics. First, we focus on the energy consumption of the individual households and thus draw a scatterplot of the data. A scatterplot enables us to visually analyse the pairwise relationship of two variables (according to Maindonald and Braun (2007), pp. 50-52, Everitt and Hothorn (2011), pp. 26-28). In this exercise we additionally distinguish between the two different code regimes by colouring the points red and blue. Second, we draw a histogram of the electricity consumption. A histogram graphically represents the frequency distribution of the data. Hence, we can get a rough idea of the distribution of our data (according to Maindonald and Braun (2007), pp. 44-46, Kohler and Kreuter (2012), pp. 185-187). Then, in part b) we proceed in the same way but this time considering the natural gas consumption.

Data preparation

First of all, we have to prepare the data. Several transformations are necessary. In a first step, we reduce the data frame. We only choose data of the year $2006$ and furthermore we summarise the monthly values of the energy consumption to one single value per residence. We perform these transformations to proceed as in exercise 2.1 and further to decrease our data points to get nice plots. Hence, we reduce the number of observations from $66471$ to $2239$ which is the total number of residences.

In a second step, we add two new columns which indicate if a residence was built before the code change or after the code change. These columns will be called pre and post. An entry of column pre equals $1$ if the residence was built before the code change and it is $0$ if the residence was built after the code change. The entries of the column post behave vice versa. We determine these two columns with a condition we already used two tasks earlier. We use the function as.numeric() to transform the logical values TRUE and FALSE into $1$ and $0$, respectively.

More information about the function as.numeric() you can find at https://stat.ethz.ch/R-manual/R-devel/library/base/html/numeric.html.

You can find more details about the functions which we will use for data preparation and transformation in the info box below. Additionally, you can find there links to following exercises where these functions are again explained and used.

< info "Futher useful functions for data transformation"

If you like to read more about functions of the package dplyr you can take a look at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

>

If you havn't solved the last task you have to click on the edit button to be able to solve the next task.

Click on the button check to perform the first two steps of data transformation. Additionally, you will see a small part of the resulting data frame. If you like to look at the whole data frame dat_new press on the button data.

#< task
# 1. step: reduce and average the data 
dat = dat %>%
  filter(year == 2006) %>%
  group_by(home_id)  %>%
  summarise_each(funs(mean))

# 2. step: add new columns
dat_new = mutate(dat, pre = as.numeric(EYB <= 2001), post = as.numeric(EYB >= 2003))

# show a part of the transformed data frame
head(dat_new)
#>

In the third step of data preparation we want to add a column which assigns every residence to the code regime in which it was built. To be more precise, we add two columns: code_regime and value. The entries of the column code_regime will be called pre and post. Additionally, all entries of the column value have to be equal to $1$ which indicates the belonging to the corresponding code regime pre or post. Thus, all entries which are $0$ have to be removed. We get this result by using the functions gather() and filter().

The function gather() out of the package tidyr collapses several columns into key-value pairs such that each row represents one observation. You can receive more information about this package and its functions at https://cran.r-project.org/web/packages/tidyr/tidyr.pdf and in the info box in exercise 2.1.

Press check to perform the third step of data transformations. It will also show you the first and last few lines of our prepared data frame which will be denoted by dat_plot.

#< task
# load the package 
library(tidyr)

# 3. step: add the column code_regime with entries pre and post and the 
#          column value with entries equal to 1
dat_plot = dat_new %>%
  gather(key = code_regime, value = value, pre, post) %>%
  filter(value == 1)

# show the first few lines of the data frame
head(dat_plot)
# show the last few lines of the data frame
tail(dat_plot)
#>

The data preparation is finished. We created a smaller data frame which assigns every residence to its code regime pre or post. Now, we can start with plotting the data. In the graphics you will see why we created the column code_regime. We will need it to distinguish between the two groups of residences.

a) Electricity consumption

In this section we will visualise the residential electricity consumption. We will first draw a scatterplot and then a histogram of the data. We will especially compare the consumption of pre-code residences with the consumption of residences which were built after the code change.

Task: Draw a scatterplot of the electricity consumption of the individual residences. Hence, plot the values of the electricity consumption on the y-axis and the identification number of the residences on the x-axis. Additionally, use different colours for the pre-code and post-code residences to compare their consumption levels.

To get a nice output we use the function ggplot() of the package ggplot2. The function ggplot() creates the basic level of the graphic. The final plot is constructed incrementally by using the $+$ operator. With this operator we can add further elements to the graphic like histograms, lines or title of axes. If you haven't used ggplot() before, it is a little bit tricky. Therefore, you don't have to enter the commands of the following task by yourself. However, take a look at the given code and try to understand the commands.

In the following paragraph you find an explanation of the code of the following code chunk.

Before we can start with plotting we have to load the package ggplot2. In the first step we create the basic level with the command ggplot(). Inside the brackets you have to specify the data (dat_plot) and the axis by means of the command aes(). On the x-axis we will plot the identification number of the residences which is called home_id and the y-axis should cover the electricity consumption. Thus, we need the variable elec. We also add a title to the graphic with the command ggtitle("..."). We assign these settings to the variable s. The second step is used to draw the data points and add them to the basic level. Therefore, we write s + and then use the command geom_point() to create the scatterplot. To get data points with different colours for each code regime we write aes(colour=code_regime). Thus, the electricity consumption of pre-code residences is blue and the electricity consumption of post-code buildings is drawn in red. We also change the titles of the x- and y-axis with the commands xlab("...") and ylab("..."), respectively. Additionally, we use the function theme() to specify the sizes of the different text elements of the graphic. For example, the command axis.text = element_text(size=12) modifies the size of the labels of the axes. We save all previous changes in the variable s1. In the third step we call the variable s1 to plot it.

Further information about the package ggplot() and its functions you can find at https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf and https://www.rstudio.com/wp-content/uploads/2015/06/ggplot2-german.pdf. Additionally, you get more details about the function geom_point() at http://docs.ggplot2.org/0.9.3.1/geom_point.html and about the command theme() and its arguments at http://docs.ggplot2.org/current/theme.html.

Now, just press check to show the plot.

#< task_notest
# load the package
library(ggplot2)

# 1. Create the basic level and add a title. Assign it to variable s.
s = ggplot(data=dat_plot, aes(x=home_id, y=elec)) + 
  ggtitle("scatterplot of residential electricity consumption") 

# 2. Add the data points with different colours for each code regime. Also add titles
# for the two axes and modify the sizes of the text elements. Assign it to variable s1.
s1 = s + geom_point(aes(colour=code_regime)) + 
  xlab("identification number of the residence") + 
  ylab("averaged monthly electricity consumption in kWh") +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
        plot.title = element_text(size=17), legend.text = element_text(size=13),
        legend.title = element_text(size=14))

# 3. Call the variable to plot it.
s1
#>

We find that a great amount of data points is located in the lower part of the graphic. To be more exact, most residences consume less than $2000$ kWh electricity per month. Keep in mind that we consider averaged consumption values. Furthermore, there are only nine points out of the total $2239$ data points which lie on or above the $4000$ kWh consumption label. Now, we take a closer look at the two groups. The data points of the pre-code residences are coloured in blue and the points of the post-code buildings are red-coloured. Recall, that the number of pre-code residences is greater than the number of post-code residences. Hence, there are more blue-coloured points than red ones. However, after comparing the location of the different coloured points we find that there are only a few red points in the upper part of the graphic -- especially, above the level of $3000$ kWh of monthly electricity consumption. Hence, we can conclude that the average energy consumption of post-code residences is lower than the consumption of pre-code buildings when we focus on a high electricity level.

Now, we will visualise the residential electricity consumption with another graphic. We will draw a histogram.

Task: Draw a histogram of the electricity consumption and use different colours for the pre-code and post-code residences. To be more precise, the resulting graphic will contain two histograms -- one for each code regime. Hence, we can easily compare both histograms.

Again, we will use some functions out of the package ggplot(). The structure of the commands is the same as in the previous task but we use some different functions to generate a histogram. In the following paragraph you will find a short explanation of the commands which we will use for plotting.

In the first step we create the basic level with the command ggplot(). Inside the brackets of this function we specify the data frame and the variable of the x-axis. This time we will plot the electricity consumption elec on the x-axis. On the y-axis the counts will be generated. We additionally define the colour of the bins (fill="..."). The pre- and post-code residences should be drawn in different colours. We also add a title to the graphic and then assign these settings to the variable p. In the second step we draw the histogram with the command geom_histogram() and add it to the basic level. We also specify four features of the histogram. First, we define the width of the bins (binwidth=...). Second, we change the colour of the borders of the bins (colour="..."). Third, we specify the position of the two histograms. We choose position="identity" to get overlying histograms. Last, we define alpha=0.5 to get semi-transparent colours. In the second step we also add manually chosen colours with the command scale_fill_manual(values=c("#FF0000", "#00CCFF")). The first colour is red (representing post-code buildings) and the second is blue (representing pre-code buildings). Additionally, we change the title of the x-axis and modify the sizes of the text elements with the function theme(). We save all previous changes in the variable s1. In the third step we only call the variable p1 to plot it.

You can get further information about the function geom_histogram() at http://docs.ggplot2.org/0.9.3.1/geom_histogram.html. Furthermore, you can find a chart of all possible colours along with their notation at http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/.

Press check to show the plot.

#< task_notest
# 1. Create the basic level with different colours for the code regime and add a title. 
# Assign it to the variable p.
p = ggplot(data=dat_plot, aes(x=elec, fill=code_regime)) + 
  ggtitle("histogram of residential electricity consumption")

# 2. Add the histograms, a different title for the x-axis and specify the two colours.
# Additionally, modify the sizes of the text elements. Assign it to the variable p1.
p1 = p + geom_histogram(binwidth=100, colour="black", position="identity", alpha=0.5) +
  # add manually chosen colours: red and blue
  scale_fill_manual(values=c("#FF0000", "#00CCFF")) +
  xlab("averaged monthly electricity consumption in kWh") +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
        plot.title = element_text(size=17), legend.text = element_text(size=13),
        legend.title = element_text(size=14))

# 3. Call the variable to plot it.
p1 

#>

We see how the data of the average monthly electricity consumption is located. We additionally can conclude some facts of the distribution of the data. In general, we find two mostly overlying histograms and thus you can not only see the colours red and blue but also a mixture of these two colours.

Now, we will describe and interpret this graphic. First, we consider the shape of both histograms and we find that both histograms look similar to each other. However, the histogram of the pre-code group is mostly higher. One reason is the different group size. There are roughly $350$ residences less in the post-code group. Additionally, we can see that the data is not symmetric but positively skewed because it has a long tail to the right. Hence, our data can't be normally distributed (according to Maindonald and Braun (2007), p. 58). You can read further details about interpreting histograms at http://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htm.

Let's come back to our data. We find that the major part of the households consume between $500$ and $2000$ kWh of electricity per month. The highest frequency is between $900$ and $1000$ kWh. Additionally, we see that the tail values of the electricity consumption of the pre-code residences are more frequent in comparison to the post-code group.

Finally, we want to show the scatterplot and the histogram together in one graphic in order of an easier comparison of both plots. To get such an output we will use the function grid.arrange() out of the package gridExtra. You only have to specify the names of the plots inside the brackets of the function grid.arrange(). Further information you can get at https://cran.r-project.org/web/packages/gridExtra/gridExtra.pdf.

At first, load the package gridExtra with the command library(). Then, use the function grid.arrange(). The scatterplot was called s1 and the histogram was denoted by p1. Click on the button check to show the graph.

#< task_notest
# load the package
library(gridExtra)

# plot the graphics below each other
grid.arrange(s1,p1)
#>

Now it is straightforward to compare both plots. Generally, the scatterplot and the histogram show the same results. The major part of the households consume less than $2000$ kWh electricity per month and higher consumption levels are less frequent for post-code residences. Hence, we get a rough idea of the success of the building codes on the reduction of electricity consumption. Besides, we will prove this result with the help of some statistical methods in later exercises.

b) Natural gas consumption

Now, we consider the natural gas consumption of the Gainesville residences. We proceed similar to part a). Thus, we first draw a scatterplot and then a histogram. Finally, we compare both graphs and analyse them.

Let's draw the scatterplot of the residential natural gas consumption. We use the same commands as in part a). However, we have to use the variable gas now.

The code is already given so just press check. However, we do not plot the graph yet. Hence, there will be no output. We will show the scatterplot together with the histogram after the next task.

#< task_notest
# load the package
library(ggplot2)

# 1. Create the basic level and add a title. Assign it to the variable t.
t = ggplot(data=dat_plot, aes(x=home_id, y=gas)) + 
  ggtitle("scatterplot of residential natural gas consumption") 

# 2. Add the data points with different colours for each code regime. Also add titles 
# for the two axes and modify the sizes of the text elements. Assign it to the variable t1.
t1 = t + geom_point(aes(colour=code_regime)) + 
  xlab("identification number of the residence") + 
  ylab("averaged monthly natural gas consumption in therms") +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
        plot.title = element_text(size=17), legend.text = element_text(size=13),
        legend.title = element_text(size=14))
#>

Task: Plot a histogram of the natural gas consumption and use different colours for each code regime.

Some parts of the code are already given. Replace the ??? in the code with correct commands and then uncomment these lines. Finally, press check. Again we will not show the plot now but in the following task.

#< task_notest
# 1. Create the basic level with different colours for the code regime and add a title. 
# Assign it to the variable q.

# q = ggplot(data=dat_plot, aes(x=???, fill=???)) + 
# ggtitle("histogram of residential natural gas consumption") 
#>
q = ggplot(data=dat_plot, aes(x=gas, fill=code_regime)) + 
  ggtitle("histogram of residential natural gas consumption") 

#< task_notest
# 2. Add the histograms, the title of the x-axis and specify the two colours. Additionally,
# modify the sizes of the text elements and then assign it to the variable q1.

# q1 = q + ???_???(binwidth=4, colour="black", position="identity", alpha=0.5) +
#  # add manually chosen colours: red and blue
#  scale_fill_manual(values=c("#FF0000", "#00CCFF")) +
#  ???("averaged monthly natural gas consumption in therms") +
#  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
#        plot.title = element_text(size=17), legend.text = element_text(size=13),
#        legend.title = element_text(size=14))
#>
q1 = q + geom_histogram(binwidth=4, colour="black", position="identity", alpha=0.5) +
  # add manually chosen colours: red and blue
  scale_fill_manual(values=c("#FF0000", "#00CCFF")) +
  xlab("averaged monthly natural gas consumption in therms") +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
        plot.title = element_text(size=17), legend.text = element_text(size=13),
        legend.title = element_text(size=14))

< award "Plot histograms with ggplot2"

Great! You are able to drawn nice histograms by using the functions out of the package ggplot2. Visualisation of the data is really helpful to get more familiar with it. Particularly, a histogram shows the distribution of the data.

>

Task: Draw the scatterplot t1 and the histogram q1 together in one graph. Use the function grid.arrange(). Enter your commands and then press check.

#< task_notest
# enter your code here...
#>
grid.arrange(t1,q1)

#< hint
display("Your command should have the following form: grid.arrange(plot1,plot2).")
#>

The two histograms are again mostly overlying and thus these regions are plotted in a different colour.

In both plots we find that a great amount of data points is located in the lower or left part of the scatterplot or histogram, respectively.

Now, let's take a closer look at the scatterplot. Most data points are below the average gas consumption of $50$ therms per month. Furthermore, a huge number of data points indicate a consumption of zero therms gas per month. Let us consider the two groups of pre- and post-code residences in more detail now. When we compare the location of the data points above the consumption level of $75$ therms per month we find...

< quiz "interpretation of scatterplot"

question: Complete the previous sentence by marking the correct answer. sc: - ... the same number of blue and red data points. - ... more data points of post-code residences than of pre-code buildings. - ... twice as much data points of pre-code residences than of post-code residences.*

success: Great, your answer is correct! failure: Not correct answered. Try again.

>


Comparing the histograms: we find that the shape of both histograms is similar. However, they don't seem to be normally distributed because the histograms are not symmetric but positively skewed with a long tail to the right (according to Maindonald and Braun (2007), p. 58). Noticeably is the really high bin around the origin which is even larger for the post-code residences. We also see that the major part of the households consume less than $50$ therms gas per month. This holds true for both histograms. However, the histogram of the post-code residences has much more weight on the smaller consumption levels -- precisely between $0$ and $12$ therms per month. In general, on higher consumption levels the pre-code residences have a higher frequency. However, there are also a few post-code residences with a very high gas consumption.

To sum up, we find that the scatterplot and the histogram show the same results. The major part of the residences of the post-code regime has a very low average gas consumption. Additionally, there are only a few residences with a really high gas consumption. Hence, we can conclude that the code change seems to cause a decrease in the natural gas consumption of the residences.

Remembering the results for the electricity consumption, we find that the code change has a decreasing effect on the electricity and natural gas consumption of the residences. We will verify these results with empirical methods in the following exercises.

Exercise 1.1 refers to pages 36 to 38 of the paper and to table 1.

Exercise 1.2 -- Housing characteristics

In the previous exercise we took a closer look at the three variables elec, gas and EYB of our original billingdata.dta data frame. This exercise deals with further important variables -- with several housing characteristics.

First, we will consider the characteristics of one arbitrary selected residence. Then, we will plot some graphs of the housing attributes of all residences in our data frame.

First of all, we have to load the necessary data set.

We have just started a new exercise. Hence, you have to click on the edit button to be able to solve the next task.

Task: Load the data frame dat_hc.dta. This data frame just contains information about the housing characteristics (hc) of the residences. Therefore, this data set only has nine different columns.

The commands are already given. Just press check. If you like to take a look at the whole data frame press data.

#< task
# load the data set dat_hc.dta and assign it to dat_hc
dat_hc = read.dta("dat_hc.dta")
#>

Let's show the first few rows of the data frame dat_hc by clicking on the check button.

#< task
head(dat_hc)
#>

The different columns of the data frame dat_hc represent different characteristics of a residence. In our data set we have information about the following attributes: zip code (zip), square footage (sqfeet), number of bathrooms (baths), number of bedrooms (beds), number of stories (stories), central air-conditioning (centralair) and roof type (shingled).

Most of the variable names are self explaining. The last two characteristics centrailair and shingled are indicator variables and only attain the values $1$ and $0$ meaning "yes" and "no", respectively (according to Stock and Watson (2007), p. 158). Another description of the values $1$ and $0$ is the existence and the deficiency of some feature, respectively (according to Sen and Srivastava (1990), p. 83). Particularly referring to these two housing characteristics we can conclude the following:

Furthermore, $shingled== 0$ means that the house does not have a singled roof and if $shingled== 1$ the converse holds true.

Characteristics of an arbitrary residence

In the following part you will learn how to select particular housing characteristics of the data set. Furthermore, we want to get more details about this particular residence.

Let us choose an arbitrary household. We want to take the residence with identification number (home_id) $128196$. To choose this household we use the function filter() out of dplyr package. We already used this function in some previous tasks. We will save this household with its characteristics in the data frame example_home.

Furthermore, we are interested in the number of bathrooms and bedrooms of this particular residence. We will use the function select() of the dplyr package to choose these specific columns of the data frame. The command select(dat,col1,col2) keeps the two columns with the names col1 and col2 of the data frame dat. We will to select the two columns baths and beds of the data frame example_home.

If you want further information about the function select() you can take a look at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

Press check to solve the following task. Meaning, we will generate the data frame example_home of a single residence and then we will select the columns baths and beds of this data frame.

#< task
# generate a data frame which contains information about the residence 128196
example_home = filter(dat_hc, home_id == 128196)

# determine the number of bathromms and bedrooms
select(example_home,baths,beds)
#>

We find that the residence with identification number $128196$ has three bathrooms and four bedrooms.

We want to get more information about this residence.

Task: Show further characteristics of the residence $128196$, namely zip code (zip), square footage (sqfeet) and the number of stories (stories).

You should answer the questions of the quiz below. You can use the code box to enter some command that will help you to find the correct answers. Press hint if you need some advice.

#< task_notest
# enter your code here...
#>
# one possible solution is
select(example_home,zip,sqfeet,stories)

#< hint
display("The following command will produce all necessary results: select(example_home,zip,sqfeet,stories).")
#>

< quiz "characterisitcs of example home"

parts: - question: 1. How large is this residence (size in square feet). answer: 2602 roundto: 1

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


You got to know more information about the characteristics of one arbitrary chosen residence. Now, we will expand our analysis on the entire data frame dat_hc. We will focus on the variables describing the size of the houses.

Characteristics describing the size of residences

There are two different columns in the data set: sqfeet and logfeet. The variable logfeet is the natural logarithm of the variable sqfeet which measures the size of a residence in square feet. We afterwards need the variable logfeet in our regression models in exercise 4.

Now, we want to modify the data frame and add a new column sqmetre. This variable should also measure the size of a house but in square metre and not in square feet. A reason for this modification is that square metre is the common measurement of sizes in Germany. Therefore, we have a better understanding of its proportion.

Task: Add a new column sqmetre to the data frame dat_hc. Use the function mutate() out of the dplyr package. Name the expanded data frame dat_sqm. We can transform the variable sqfeet into the variable sqmetre by multiplying sqfeet with $0.092903$. We additionally round the numbers to two decimal places.

You get further information about the function mutate() at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf and you can find more details about the transformation from square feet into square metre at http://www.asknumbers.com/square-feet-to-square-meter.aspx.

Press check to perform the calculation. If you haven't solved the previous task you have to press edit first.

#< task
# add a new column called sqmetre
dat_sqm = mutate(dat_hc, sqmetre = round(sqfeet*0.092903, digits=2))
#>

Now, let's take a look at these three variables. To better compare them we only select the important columns of the data frame dat_sqm and save the reduced data frame as dat_size.

#< task
# select the columns home_id, sqfeet, sqmetre and logfeet
dat_size = select(dat_sqm, home_id, sqfeet, sqmetre, logfeet)
# show a part of the data frame
head(dat_size)
#>

Task: Calculate the size of the smallest building and the size of the largest residence in square metres. Use the data frame dat_size. One command is enough here. Enter your command and then press check.

#< task
# enter your code here...
#>
range(dat_size$sqmetre)
#< hint
display("Your command should have the following form: range(dat_name$column_name). Insert the correct data frame and column.")
#>

You see that the smallest residence is about $73$ square metre large. In comparison, the largest residence is roughly $9.5$ times bigger.

Visualisation of the data -- housing characteristics: beds, baths and centralair

Now, we want to draw some plots to illustrate the different housing characteristics and thus get a much deeper knowledge of these variables. We consider the entire data frame dat_hc. At the end of this section you should answer some questions regarding the housing characteristics. The following graphs will provide the answers.

First, we want to plot the number of bedrooms.

Task: Draw a bar graph of the column beds of the data frame dat_hc. Use the functions ggplot() and geom_bar() of the package ggplot2.

Further information about these two functions you can at https://www.rstudio.com/wp-content/uploads/2015/06/ggplot2-german.pdf and http://docs.ggplot2.org/0.9.3.1/geom_bar.html.

Press check to draw the graph.

#< task_notest
# 1. Create the basic level by using the data frame dat_hc. The variable beds should be 
# plotted on the x-axis. 
p = ggplot(dat_hc, aes(beds))

# 2. Add the bar graph and the title of the x-axis. 
p1 = p + geom_bar(fill= "green", colour="black") + xlab("number of bedrooms")

# 3. Plot it.
p1
#>

Task: Plot a bar graph of the number of bathrooms. Replace the ??? in the code with correct commands and then uncomment these lines.

#< task_notest
# 1. Create the basic level by using the data frame dat_hc. The variable baths should be 
# plotted on the x-axis. 
#q = ggplot(???, aes(???))
#>
q = ggplot(dat_hc, aes(baths))

#< task_notest
# 2. Add the bar graph and the title of the x-axis. 
#q1 = q + ???(fill= "blue", colour="black") + ???("number of bathrooms")
#>
q1 = q + geom_bar(fill= "blue", colour="black") + xlab("number of bathrooms")

#< task_notest
# 3. Plot it.
# ???
#>
q1

Task: Draw a bar graph of the number of residences with central air-conditioning. The variable is called centralair and takes the values $0$ and $1$, whereas $1$ represents residences with a central air-conditioning and $0$ residences without central air-conditioning.

Enter your commands and press check afterwards. Click on the hint button if you need further advice.

#< task_notest
# 1. Create the basic level by using the data frame dat_hc. The variable centralair 
# should be plotted on the x-axis. Assign it to r. 
# enter your code here...
#>
r = ggplot(dat_hc, aes(centralair))
#< hint
display("Your first command should look like: r = ggplot(data_name, aes(var_name)).")
#>

#< task_notest
# 2. Add the bar graph and the title "central air conditioning" to the x-axis . The 
# bars of the graph should be red and the borders black. Assign it to r1. 
# enter your code here...
#>
r1 = r + geom_bar(fill= "red", colour="black") + xlab("central air conditioning")
#< hint
display("Your second command should have the following form: 
        r1 = r + geom_bar(fill= \"???\", colour=\"???\") + xlab(\"title\").")
#>

#< task_notest
# 3. Plot it.
# enter your code here...
#>
r1
#< hint
display("Write the name of the variable to plot it.")
#>

< award "Plot bar graphs with ggplot2"

Great! You can use the function geom_bar() out of the ggplot2 package to visualise the data.

>

Now, answer the following three questions.

< quiz "housing characteristics"

parts: - question: 1. What's the most frequent number of bedrooms? answer: 3 roundto: 1

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


We find that central air-conditioning is really important for households in Gainesville.

Central air-conditioning

Let us now look closely at the relationship of residential energy consumption and central air-conditioning. Generally, air-conditioning is the main end use of residential electricity consumption. To be more precise, households which are located in the South Atlantic region of the U.S. use $21.2\%$ of the electricity for air-conditioning (according to U.S. EIA (2009), Table HC1.10). In contrast to this, the main end uses of residential natural gas consumption are space and water heating. To be more precise, South Atlantic residences use $6.0\%$ and $5.5\%$ of their total natural gas consumption for space heating and water heating, respectively (refer to U.S. EIA (2009), Table HC1.10).

Keep these numbers and relationships in mind because we will need them in exercises 4 and 5.

To sum up, this exercise dealt with the variables which describe the characteristics of the residences. In the following exercise we will turn to the weather variables.

Exercise 1.2 refers to pages 36 and 38 and to table 1 of the paper.

Exercise 1.3 -- Weather variables

This exercise will provide detailed information about the two weather variables: the average heating degree days (AHDD) and the average cooling degree days, short ACDD. We consider these variables in more detail because space heating and cooling are related to the weather and temperature outside the residence.

To be able to analyse this relationship weather data are required. Hence, the authors of the paper collected weather data from the National Climatic Data Center. They selected one weather station which is located at the airport of Gainesville. However, these data have to be prepared. Therefore, they merged these data with the monthly billing data. Finally, the authors calculated the two weather variables average heating degree days (AHDD) and the average cooling degree days (ACDD). These two variables are provided in the data frame with the column names HDD (representing AHDD) and CDD (standing for ACDD). In the info box below you can get some information about the calculation of these variables.

In this exercise we will not replicate these preparation steps because the original weather data is not available. However, the prepared data set is provided and thus we can analyse the weather variables. If you like detailed information about the preparation steps you may take a look at pages 37 and 38 of the paper.

< info "Calculation of AHDD and ACDD"

We want to calculate the two variables AHDD and ACDD. Several steps are necessary. First, we declare a reference point. The standard reference point is $65^\circ~F$ (Fahrenheit) which refers to $18.3^\circ~C$. Second, we have to calculate the average daily temperature. Then, we can determine the AHDD and ACDD by considering the difference of the reference point and the average temperature.

To sum up, we consider the average temperature of a single day to calculate the AHDD or ACDD of this day. Each degree that the average temperature differs from the reference point counts for one degree day.

reference: paper, p. 37, U.S. EPA (2014) and U.S. Department of Energy (2013)

>

Before we can start our analysis we have to load the data. We load the data frame billingdata.dta which we already used in exercise 1.

First of all, click on the button edit to be able to start with the first task. Then, press check. All commands are given because you already loaded a data frame several times before.

#< task
# load the data set billingdata.dta and assign it to dat
dat = read.dta("billingdata.dta")
#>

To get some basic information about the weather variables let us compute some summary statistics.

Some summary statistics of the weather variables

Therefore, we need some functions out of the dplyr package. You already got to know the select() command. There is another useful function called summarise() which summarises multiple values to one value. In our case, we want to summarise whole columns to one value. You can get further information about this function at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

In the following task we use both commands select() and summarise() one after the other. At first, we select the columns CDD and HDD from the data set dat. Then, we calculate the maximum of each column. We want to use the ${\%>\%}$ operator to combine both commands in an easy readable structure. You find more information about this operator in the info box below.

< info "Pipe operator"

Here you can find some further information how to use the ${\%>\%}$ operator in combination with dplyr functions. The ${\%>\%}$ operator is provided by the magrittr package. However, this operator will be imported by dplyr without having additionally loaded magrittr.

The idea of this operator is to avoid nested functions which you have to read from the inside to the outside but to allow the functions to be read from left to right. This works because the output of the function on the left hand side is used as input of the function on the right hand side, compare Irizarry and Love (2013). Therefore, you don't have to specify the data frame inside the functions, e.g. don't write select(dat_name, col_names) but dat_name %>% select(col_names).

A simple rule for writing: first, specify the data frame and then write the functions one after the other. Connect each function with the ${\%>\%}$ operator. To get a nice structure write only one command and ${\%>\%}$ operator per code line and then start a new line. This helps to make your code easier readable and better understandable.

Further information about this operator you can find at http://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html.

>

Calculate the maximum of the variables ACDD and AHDD. Press check to see the solution.

#< task
# use the %>%-operator to combine the functions select()  and summarise()
dat %>%
  select(CDD,HDD) %>%
  summarise(max(CDD),max(HDD))
#>


The two values indicate the highest difference of the average daily temperature and the reference temperature. The first value shows that the average temperature in Gainesville exceeds $65^\circ~F$ by roughly $17.6^\circ~F$ at most. The second value denotes that the average temperature in Gainesville does not fall below the reference point by more than $11.8^\circ~F$.

The results also show that the maximum of the ACDD is higher than the maximum of the AHDD. The reason for this result is that Florida is a state with mild and sunny climate. For example, the statewide annual average temperature of Florida in the year $2002$ was $71.4^\circ~F$ with the lowest monthly average temperature of $57.5^\circ~F$ in December, according to Florida Climate Center (2016). Further climate data of Gainesville you can find at http://www.usclimatedata.com/climate/gainesville/florida/united-states/usfl0163/2016/1.

Task: Select the columns CDD and HDD of the data frame dat and then compute the mean of the average cooling degree days (CDD) and the average heating degree days (HDD). Use the ${\%>\%}$ operator. Enter your commands and then press check.

#< task
# enter your code here...
#>
dat %>%
  select(CDD,HDD) %>%
  summarise(mean(CDD),mean(HDD))

#< hint
display("Your command should have the same structure like in the code box above. You only have to use another function to calculate the mean. You also may take a look at the first info box in exercise 1.1.")
#>

< award "dplyr-user level 2"

Great! You can easily deal with the functions select() and summarise() out of the dplyr package. Additionally, you know how to use the pipe operator. Hence, you are now able to handle a data frame with a lot of data and produce some summary statistics.

>

The results show that the average of the ACDD is higher than the mean value of the AHDD which can also be explained by the climate situation in Florida.

Visualisation of the data -- weather variables: AHDD and ACDD

In the previous tasks you got some overview of the weather variables. Now, we want to visualise both weather variables AHDD and ACDD to see how they behave over several years. More precisely, we want to plot the average cooling degree days and the average heating degree days by month and year over the time interval from $2004$ to $2006$.

Therefore, we have to prepare the data.

Data preparation

We already did similar preparations in the previous tasks. We need the columns CDD, HDD, month and year of the data set dat. We use the command select() to keep them. Second, we use the group_by() function. This function pools the rows of a data frame which have the same values within a specified variable or group. Last, we use the command summarise() to calculate the mean values of the variables CDD and HDD. We additionally assign the resulting values to ACDD and AHDD.

Hence, we get monthly averages of the weather variables over a three-year period from the years $2004$ to $2006$. Summing up, there are $36$ mean values for each of the two variables.

Task: You only have to press check to create a new data frame dat_weather and show the result. We use this data frame in the following task to plot the weather variables by month and year. If you want to see the whole data frame of dat_weather press the button data.

#< task
# create a subset of the data frame dat and assign it to dat_weather
dat_weather = dat %>%
  # keep the four columns month, year, CDD and HDD 
  select(month,year,CDD,HDD) %>%
  # group the data with the same year and month together
  group_by(year,month) %>%
  # calculate the mean value of the columns CDD and HDD
  summarise(ACDD = mean(CDD), AHDD = mean(HDD))

# show a part of the data frame
head(dat_weather) 
#>

The resulting data frame dat_weather contains important values. We will use the values of the columns ACDD and AHDD directly for plotting. They will be drawn on the y-axis.

However, the values which we want to plot on the x-axis are still missing. In our graph the x-axis will represent the time. Therefore, we have to do a further step of preparations. The aim is to add a new column to the data frame dat_weather which contains our time data. More precisely, the new column will be called date and it will cover the monthly dates from January $2004$ to December $2006$. We use the function mutate() to add this new column. Before we can do this we have to remove the existing grouping of the data with the command ungroup(). We create the time sequence with the command seq(). We have to define the initial date (from=...) and the end date (to=...) and the time intervals between (by=...) these two dates. The last step of preparation is to save the expanded data frame as plot_weather to be able to use it for plotting the weather variables.

You can find more information about the function ungroup() at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf. More details about the command seq() you can read at https://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.Date.html and if you like to know more about the different date formats in R you can take a look at http://www.statmethods.net/input/dates.html.

Now, we perform the final step of data preparation and add a column with time data to the data frame dat_weather.

#< task
# add a date column to the data frame dat_weather
plot_weather = dat_weather %>%
  # cancel the grouping operation
  ungroup() %>%
  # create a time sequence and save it as new column with the name date
  mutate(date=seq(from=as.Date("2004-01-01"), to=as.Date("2006-12-01"), by="month"))

# show a part of the data frame
head(plot_weather)
#>


We will use the data frame plot_weather to plot the monthly averages of the weather variables on the y-axis against the monthly time period from $2004$ to $2006$.

Before plotting the weather data let's answer the questions of the following quiz.

< quiz "plot weather variables"

question: What do you expect? Two answers are correct. mc: - ACDD peaks in the summer months. - ACDD is high during the winter months.
- AHDD peaks in summer. - AHDD is high during the winter months.

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


Now, we plot the weather variables and you will see if your expectations hold true.

We use the function ggplot() and geom_line() of the package ggplot2 to get a nice line graph. The code is already given and it may look more complicated than the ones before. You don't have to understand all commands in detail. The functions are just used to get a nice plot with different coloured lines, with a legend and with an appropriate description of the x-axis.

More details about the command scale_x_date() and its arguments you can get at http://docs.ggplot2.org/0.9.3.1/scale_date.html and http://www.statmethods.net/input/dates.html. If you are interested in the functions of step three, take a look at http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/ and http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/#change-size-of-tick-text-axis.text.x.

Press check to create the plot.

#< task_notest
# load the package scales which is necessary for the second step
library(scales)

# 1. Create the basic level and add the titels of the y-axis and the x-axis. Assign it to 
# the variable w. 
w = ggplot(plot_weather, aes(x=date)) + ylab("Degree Days") + xlab("Time")

# 2. Add the lines representing the variables ACDD and AHDD. Use different colours. 
# Additionally, adapt the description of the x-axis.
w1 = w + geom_line(aes(y=ACDD, col="ACDD")) + geom_line(aes(y=AHDD, col = "AHDD")) +
  scale_x_date(labels=date_format("%m/%Y"), breaks = date_breaks("2 months"))  

# 3. Add a legend and rotate the text of the x-axis. Additionally, modify the sizes of the
# text elements. Assign it to w2.
w2 = w1 + scale_colour_manual(name="",values=c("ACDD" = "blue","AHDD" = "red")) +
  theme(axis.text.x=element_text(angle=40, vjust=0.7), axis.text = element_text(size=12), 
        axis.title = element_text(size=15), legend.text = element_text(size=13))

# 4. Call the variable to plot it.
w2
#>

This graphic is based on figure 5 of the paper.

In general, the plot shows the variability of the weather. When you take a closer look at the graph you can see three prevailing differences between both lines. First of all, the peaks of ACDD are higher than the ones of AHDD. We have already observed this fact in previous tasks where we calculated the mean values and the maximums of both weather variables. Second, the peaks of ACDD are broader -- almost twice as much -- than the ones of AHDD. This indicates that heating is less important in Florida. The third difference concerns the seasons of the year. ACDD peaks in the summer months and it is close to zero during the winter. AHDD behaves exactly the opposite which you might have already expected.

Summing up, we see a high variability of the weather variables. These changes are the main reasons of fluctuations in the residential energy consumption. Keep this relationship in mind as we will come back to the weather variables and their influence on the energy consumption in exercise 5.

Exercise 1.3 refers to pages 37, 38 and 42 and to table 1 of the paper.

Exercise 2 -- Building code change in more detail

In this exercise we will analyse more accurately the differences of the residences which were built before the code change and the ones which were constructed after the code change. We will take a look at the energy consumption and the housing characteristics of these residences.

First of all, you get some further information about the building code and its changes.

Most U.S. states have state-wide building energy codes which regulate the energy efficiency of newly built residences. The aim of these building codes is a reduction in the energy consumption of these buildings and consequently also a decrease in the carbon dioxide emissions.

Now, we will focus on the building code which has been established in Florida. This building code states a minimal standard for the overall energy efficiency of a newly built residence compared to a baseline home. That means, certain components of the new building are allowed to use more energy than the baseline home but in total the new house has to be more efficient. The building code determines the characteristics of the baseline home. Hence, a tightening of the building code leads to a more energy efficient baseline home. We will closely look at the established code change in March $2002$ and at its effects. The Florida Building Commission implemented three major and several smaller changes to the building code. The most important change was the reduction of the Solar Heat Gain Coefficient from $0.61$ to $0.4$. This coefficient is "the amount of solar heat that actually enters the window compared to the amount that strikes it on the outside" (see http://www.energygauge.com/FlaRes/new_code.htm).

Further information about the building code change you can get directly from the paper. You may take a look at pages 34 and 36. Additionally, you can find more details at the following two links: http://www.energygauge.com/FlaRes/new_code.htm and https://energy.gov/eere/buildings/building-energy-codes-program.

In the following exercises we will analyse if the building code change in the year $2002$ causes a reduced actual energy consumption. And we will additionally look at the effects of the code change on the housing characteristics. In more detail, in exercise 2.1 we calculate the mean values of the energy consumption and of the housing attributes separately for the residences built before and after the code change in order to compare them. We will also plot these values to get a rough idea of the effects of the code change. In exercise 2.2 we will perform a two sample t-test to support our findings of exercise 2.1.

This exercise refers to pages 36 and 38 and to table 2 of the paper.

Exercise 2.1 -- Comparison of the mean values

In this exercise we will get a rough idea of the effects of the code change. Therefore, we calculate the mean values of the energy consumption and of the housing attributes separately for the pre- and post-code residences.

Before calculating these values we have to prepare the data in several steps. One important step is to divide the data frame into two parts: one containing the data for all pre-code residences and the other containing the data for all residences built after the code change. Then, we will calculate the mean values and compare them. We will also draw some plots for a better visualisation of the differences.

First of all, we have to load the data. In this exercise we use the data frame codechange.dta. This data frame is a part of the original data frame. It contains $12$ different columns. These variables are elec, gas and six housing characteristics. We want to compare the mean values of these eight variables. The remaining four variables of this data frame are used to prepare and separate the data frame.

Because you have started a new exercise you have to click on the edit button to be able to enter your code in the code box.

Task: Load the data frame codechange.dta and assign it to dat2. Remember the last exercise how to do this. If you need help you can press hint. Additionally, print the first few lines of the data frame by using the command head().

#< task
# load the data frame codechange.dta  
# enter your code here...  
#>
dat2 = read.dta("codechange.dta")

#< hint
display("To read in the data your command should have the following form: 
        dat2 = read.dta(\"???\"). Replace the ??? by the name of the data set you want to load.")
#>

#< task
# show a part of the data frame dat2 
# enter your code here...  
#>
head(dat2)

#< hint
display("Enter head(???). Replace the ??? with the name of the data frame.") 
#>


Now, we can work with the data frame dat2. This data frame has $12$ columns -- containing variables describing the energy consumption and some housing characteristics. Additionally, two columns characterize the code regime, i.e. if the residences are built under the rules of the old building code or under the regulations of the new and more tightened one.

We want to divide this data frame into two subsets -- one containing the pre-code buildings and the other the post-code buildings. After that we will calculate some mean values and then compare both groups.

First of all we we have to do several steps in order to prepare the data.

Data preparation

First, the authors Jacobsen and Kotchen only used billing data of the year $2006$ to determine the difference of pre- and post-code residences. Hence, we will also just consider data of $2006$ to receive the same results as in the paper. To perform this step in R we use the function filter() out of the dplyr package and choose the condition year == 2006. Second, the data frame should contain only one observation of each variable per house. This means the number of rows of this data frame should be $2239$. Every household is characterized by an identification number. Hence, we use the command group_by(home_id). Furthermore, we need a function that reduces the number of observation separately for each column. To be more precise, we need a function that pools the monthly values together to only one value, separately for each column. This function is called summarise_each(). Inside the brackets you have to define the functions you want to apply on all columns. In our case, we will use the function mean and thus write funs(mean).

In previous tasks we used the function summarise() for pooling values together. However, in this task we need the function summarise_each() because we have a large number of columns and it would be effortful to write each and every column name. Hence, when using the function summarise_each() we don't have to explicitly define the different columns because it applies the pooling function to all columns automatically. If you like to read more about the function summarise_each() you can take a look at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

Task: Prepare the data frame dat2 as described above and assign it to dat_house. You have to replace the ??? by the correct commands. Then, uncomment the lines and check your solution.

#< task
# replace the ??? with the correct commands and uncomment the code

# dat_house = ??? %>% 
#   ???(year == 2006) %>%
#   group_by(home_id) %>%
#   summarise_each(funs(mean))
#>

dat_house = dat2 %>% 
  filter(year == 2006) %>%
  group_by(home_id)  %>%
  summarise_each(funs(mean))

The following task can be solved optionally.

Task: You may like to take a look at the prepared data frame. You can use the command head() to show the first few lines of dat_house. The function as.data.frame() converts the data set dat_house into a data frame.The latter function is used to get an output where all columns are printed.

#< task
# enter your code here...
#>
head(as.data.frame(dat_house))

Because we are interested in the differences between pre- and post-code buildings we have to divide the prepared data frame dat_house into two groups.

Variable code_change

To separate the data frame we use the variable code_change. This variable represents the building code change and thus it is directly connected to the variable EYB. The variable code_change is an indicator variable, i.e. it takes only the values $0$ and $1$. In more detail:

Now, we separate the data frame into two groups by using the variable code_change. You already performed similar computations in exercise 1.1. Thus, you only have to press check.

If you haven't solved the previous task you first have to click on the edit button.

#< task
# form a data frame which contains all residences built before the code change
pre_code  = filter(dat_house, code_change == 0)

# form a data frame which contains all houses built after the code change
post_code  = filter(dat_house, code_change == 1)
#>

You optionally can take a look at the data frames pre_code and post_code. Press check if you like to see the first few lines of the data frames. Otherwise continue with the next task.

#< task
# show parts of the data frame pre_code and post_code
head(as.data.frame(pre_code))
head(as.data.frame(post_code))
#>

We have to do another step of preparation before we can start with calculating the mean values. We are not interested in all variables of the two data frames pre_code and post_code. Thus, we have to remove some columns. To be more precisely, we only want to consider the following eight variables: elec, gas, sqfeet,baths, beds, stories, centralair and shingled. Hence, we have to remove four columns of our data frames. We will use the function select() out of the dplyr package with a minus in front of the variable name. For example, the command select(dat, -year) creates a new data frame which is a subset of the data frame dat without the column year.

You can get more information about the function select() at https://cran.r-project.org/web/packages/dplyr/dplyr.pdf.

If you haven't solved the previous task first press edit.

Task: Remove the columns home_id, year, code_change and EYB from the data frames pre_code and post_code. Use the command select() and assign the new data frames to pre_code2 and post_code2, respectively. Replace the ??? in the code by your commands and then uncomment these lines. Finally, press check.

#< task
# replace the ??? by correct commands and uncomment the lines
# pre_code2 = ???
#>
pre_code2 = select(pre_code,-home_id, -year, -code_change, -EYB)

#< hint
display("Your command should look like:
pre_code2 = select(dat_name,-col_name1, -col_name2, -col_name3, -col_name4). Insert the correct names of the data frame and columns you want to remove.")
#>

#< task
# replace the ??? by correct commands and uncomment the lines
# post_code2 = ???
#>
post_code2 = select(post_code,-home_id, -year, -code_change, -EYB)

#< hint
display("The command is similar to the one above. You only have to change the name of the data frame.")
#>

< award "dplyr-user level 3"

Great! You get more and more familiar with the dplyr package. Now, you know many functions of this package and you hopefully see how important this package is for preparing data frames.

>

The following task can be solved optionally.

Task: You may take a look at the prepared data frames pre_code2 and post_code2. If you like press check to show the first six lines of the data frames. Press data to see the whole data frames.

#< task
# show parts of the data frames pre_code2 and post_code2
head(pre_code2)
head(post_code2)
#>

All preparations are done.The data frames cover the necessary variables representing the energy consumption and some housing characteristics.

Calculation of mean values

Hence, in order to compare the residences built before the code change and built after the code change we calculate the mean value of all these variables now.

Press edit if you haven't solved the previous task.

We will compute these values in the following task. Again, we will use the function summarise_each() to perform the calculations for each column separately. We will save the results as beforeCC if we used the data frame pre_code2 and as afterCC for the data frame post_code2. The code is already given. Thus, press check to perform the calculations and to show the resulting values.

#< task
# calculate the mean values of all 8 variables of the data frame pre_code2 
beforeCC = summarise_each(pre_code2, funs(mean))
beforeCC

# calculate the mean values of all 8 variables of the data frame post_code2 
afterCC = summarise_each(post_code2, funs(mean))
afterCC
#>

Both vectors contain eight different values. These values represent the mean values of the corresponding variables. In the following part of this exercise we want to take a closer look at these values. It is really helpful to visualise them in order to determine the differences. Hence, we will draw some plots.

Further data preparation -- using the function gather()

We will generate a graphic consisting of smaller plots for each variable. Therefore, we have to transform the data such that each row represents one observation and each column contains the values of one of the variables. The function gather() out of the package tidyr transforms the data from wide format into long format. We already used this function in exercise 1.1. Take a look at the info box below the code chunk if you need further information about this function or if you like to see the resulting data frames which we will create in the following code chunk.

Click on the button check to vary the format of the data frames.

#< task
# prepare the data frames: change long format to wide format
# pre-code data frame
BCC_plot = gather(beforeCC)
#post-code data frame
ACC_plot = gather(afterCC)
#>

The info box below contains further information about the function gather() and an example. Additional information about wide and long format you can find in Andreß et al. (2013), pp. 16-19.

! start_note "Package tidyr and function gather()"

The package tidyr contains different functions which transform the data into a clear and simple format. One important function out of this package is the function gather(). This function collapses several columns into key-value pairs such that each row represents one observation and each column contains the values of one of the variables. Hence, this function transforms the data from wide format into long format. This means, the resulting data frame has fewer columns but a lot more rows than the original data frame.

You can receive more information about this package and the function gather() at https://cran.r-project.org/web/packages/tidyr/tidyr.pdf.

To illustrate these explanations we take a look at the transformation of our data frame that we performed in the previous code chunk. We choose the data frame afterCC which consists of averaged data of the post-code residences. This data frame is saved in wide format with eight different columns. Press check to show the data frame.

#< task
afterCC
#>

Then, we use the function gather() to transform the data frame into long format. The resulting data frame is called ACC_plot. Click on the check button to show the data frame.

#< task
ACC_plot
#>

Now, you can clearly see the transformation. The resulting data frame ACC_plot has two columns and eight rows. The column key contains the different variable names and the column value contains the different mean values.

! end_note

Visualisation -- plotting the mean values

All preparations of the data frames are done. So, we can now plot the data. We want to plot the mean values of the energy consumption and the housing characteristics of pre- and post-code residences in order to compare them. Each of these variables should be plotted separately and the corresponding mean values should be drawn inside the smaller plots. However, all eight smaller plots together should form the whole graphic.

Again, we will use the functions out of the package ggplot2 for plotting. This time, we need several functions to specify the features of the plot. There are some comments written in the code box. However, you can find a more detailed description of the code lines in the following info box.

< info "Description of the code for plotting the mean values"

This info box gives detailed information about the code of the following chunk. The following code creates a graphic for the mean values of the different variables of our data frames with separate features (i.e. colour and shape) for pre- and post-code residences.

We want to plot the mean values on the y-axis and the different variables on the x-axis. Therefore, we write aes(x=key, y=value). We can't specify the data frame in the command ggplot() because we have two different data frames. Hence, we define the data frame with the command data=dat_name inside the function geom_point(). We use this function because we want to plot a few discrete values. We add this function in the second step. Back to the first step: we further add the titles of the x- and the y-axis.

To differ between pre- and post-code values we use different colours (colour = ???) and shapes (shape = ???) to draw the points. We have to specify both arguments manually by using the functions scale_colour_manual() and scale_shape_manual() to be able to add a legend. The different shapes are represented by numbers from 1 to 25. We add these manuals in step 3. Back to the second step: the argument size=2 enlarges the points. In addition, we don't want just one graphic but several smaller plots next to each other. Every variable (saved in the column key of the data frames) should be contained in one smaller plot. To get such a graphic we use the function facet_grid(. ~ key). The arguments scales = "free" and space = "free" allow varying scales and panel sizes.

Last but not least, we want to add a legend. We use the two functions scale_colour_manual() and scale_shape_manual(). We additionally use the command theme(legend.position = "bottom") to change the position of the legend from the left side to the bottom. The arguments axis.text.x = element_blank() and axis.ticks.x = element_blank() are used to get rid of the axis ticks and the text of the x-axis. In this graphic they are not necessary because every subplot only contains one variable and already has a title. Furthermore, we use the other arguments of the function theme() to modify the sizes of the text elements of this graphic.

Further information about the function facet_grid() you get at http://docs.ggplot2.org/0.9.3.1/facet_grid.html. Additionally, you can find a table of the point shapes and their corresponding numbers at http://www.cookbook-r.com/Graphs/Shapes_and_line_types/. More details about the command theme() and its arguments you can read at http://docs.ggplot2.org/current/theme.html and at http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/.

>

Just press check to create the plot. If you haven't solved the previous task you have to press edit first.

#< task_notest
# create the basic level with titles for the y- and x-axis
p = ggplot(NULL, aes(x=key, y=value)) + ylab("mean values") + xlab("variables")

p1 = p + 
  # add the mean values of the pre-code data frame 
  geom_point(data=BCC_plot, aes(colour="pre", shape="pre"), size=2) + 

  # add the mean values of the post-code data frame 
  geom_point(data=ACC_plot, aes(colour="post", shape="post"), size=2) +   

  # add panels, i.e. split the plot into several parallel subplots  
  facet_grid(. ~ key, scales="free", space="free") 

# add a legend with manually declared points (colour and shape), specify its position
# and modify the sizes of the text elements
p2 = p1 + 
  scale_colour_manual(name="", values=c("pre"="blue", "post"="red")) +

  scale_shape_manual(name="", values=c("pre"=4,"post"=3)) +

  theme(legend.position="bottom", axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), axis.text = element_text(size=12), 
        axis.title = element_text(size=15), legend.text = element_text(size=13),
        strip.text = element_text(size=13)) 


# call the variable to plot it
p2
#> 

The graphic shows the mean values of the pre-code buildings (coloured in blue) in comparison to the post-code residences (coloured in red) separately for eight different variables. It is difficult to detect the differences of the particular variables because most of the points are lying really close to each other. Only the mean values of the electricity consumption and the square footage differ visibly. We can see that the mean values of the pre-code residences are higher than the ones of the post-code buildings. However, for the remaining six variables we can not make a clear statement. Therefore, we will use another format with different scales on the y-axes.

Hence, we will use the function facet_wrap(~ key) instead of facet_grid(. ~ key). We specify ncol = 8 to receive a graphic with eight smaller plots next to each other. Furthermore, we add the command expand_limits(y=0). This command specifies the scale of the y-axis such that the value $0$ is included.

Further information about the function facet_wrap() you can get at http://docs.ggplot2.org/0.9.3.1/facet_wrap.html. Additionally, you can find more details about the command expand_limits() at http://docs.ggplot2.org/0.9.3/expand_limits.html.

All other commands left unchanged in comparison to the previous code chunk. Hence, just press check to see the graphic.

#< task_notest
q = ggplot(NULL, aes(x=key, y=value)) + ylab("mean values") + xlab("variables")

q1 = q + 
  geom_point(data=BCC_plot,aes(colour="pre", shape="pre"), size=2) +

  geom_point(data=ACC_plot, aes(colour="post", shape="post"), size=2) +

  # use this function to get a different format of the graphic
  facet_wrap(~ key, ncol=8, scales="free") + expand_limits(y=0) 

q2 = q1 + 
  scale_colour_manual(name="",values=c("pre"="blue", "post"="red")) +

  scale_shape_manual(name="", values=c("pre"=4,"post"=3)) +

  theme(legend.position="bottom", axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), axis.text = element_text(size=11), 
        axis.title = element_text(size=15), legend.text = element_text(size=13),
        strip.text = element_text(size=13)) 

q2
#>

Now, you see the differences of the mean values much better. However, keep in mind that every smaller plot has a different scale on the y-axis. Once more, we see the differences in the variables elec and sqfeet for the two groups. Additionally, we observe differences in the gas consumption which we haven't seen in the previous plot.

Take a closer look at the plot and answer the questions of the following quizzes.

< quiz "comparison of pre-code and post-code residences"

question: Which statement about the energy consumption of the residences is correct (including both electricity and natural gas)? sc: - Post-code residences consume more energy than pre-code residences. - The natural gas consumption of post-code buildings is smaller but the energy consumption is higher in comparison to residences built before the code change. - Houses built after the building code change need less energy than houses constructed before the code change.*

success: Great, your answer is correct! failure: Not correct answered. Try again.

>


In general, our data indicates that post-code residences consume less energy than pre-code buildings. Now, we will look at the differences of the energy consumption in more detail.

In order to answer the next questions you can enter some code in the box below. If you need some advice press hint.

#< task_notest
# enter your code here...
#>
# answer of the first question:
round(beforeCC$elec - afterCC$elec, digits=2)
# answer of the second question:
round(beforeCC$gas - afterCC$gas, digits=1)

#< hint
display("Performing following calculations will help you to solve the questions: 
        for the 1. beforeCC$elec - afterCC$elec, 
        for the 2. beforeCC$gas - afterCC$gas.")
#>

< quiz "further comparison of pre-code and post-code residences"

parts: - question: 1. How much kWh of electricity do post-code residences use less than houses built before the code change, on average? Round the number to two digits after the decimal point.
answer: 44.85 roundto: 0.01

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


To sum up, we considered the differences in the mean values of the energy consumption. We found differences in both electricity and natural gas consumption.

Now, we will shortly analyse the mean values of several housing characteristics. Theses variables do not really differ between the two groups as we already observed in the first graphic. However, there is one exception: residences which were built after the code change are a lot smaller than houses which were built before the code change. These values have been measured by the variable sqfeet. To be more precise, the residences built after the code change are on average about $4.5\%$ smaller than the ones built before the code change.

The authors of the paper additionally calculated the standard deviation of these values. You can find the results in table 2 of the paper. If you like to calculate standard deviations in R you can use the command sd().

Exercise 2.1. refers to page 38 and to table 2 of the paper.

Exercise 2.2 -- Statistical analysis by using a two sample t-test

In the previous exercise we analysed the mean values of electricity and natural gas consumption and of several characteristics of the pre- and post-code residences. Our results were the following: residences built after the code change use less energy (i.e. electricity and natural gas) than pre-code buildings. Additionally, we found out that post-code residences are smaller than pre-code buildings.

In this exercise we want to verify these results based on a statistical test. We use the two sample t-test.

First of all, you get some theoretical background information about the t-test. Then, you will learn how to implement this test with R. Last but not least, you can solve some tasks and apply your knowledge to our data.

Theoretical information about the two sample t-test

The two sample t-test checks whether the mean values of two groups or samples are equal or different to each other. In our case, we will test if the characteristics of the pre-code residences are smaller or equal to the characteristics of the residences built after the code change. We know that the two groups are independent. Let us assign the pre-code residences to group 0 (with mean value $\mu_0$) and the post-code residences to group 1 (with mean value $\mu_1$). We additionally define the difference $diff$ of the mean values as $diff = \mu_0 - \mu_1.$ Hence, we test whether the null hypothesis $$H0: diff \le 0$$ holds true. In contrast, the alternative hypothesis is $$H1: diff > 0.$$ To be more precise, we test a one-sided null hypothesis. In our case the null hypothesis is left-sided (according to Weerahandi (1995), p. 29).

You can find further information about the two sample t-test and about hypothesis testing in the following info box.

< info "Two sample t-test and hypothesis testing"

The test statistic TS of a two sample t-test is $$TS = \frac{(\overline{\mu_0}-\overline{\mu_1}) - 0}{SE_{pooled}(\overline{\mu_0}-\overline{\mu_1})}\ $$ if both samples have the same variance. $\overline{\mu_0}$ and $\overline{\mu_1}$ are estimators of $\mu_0$ and $\mu_1$, respectively and $SE_{pooled}(X)$ denotes the pooled standard error of X (according to Stock and Watson (2007), pp. 83-84, 88-92).

We assumed equality of variances in the previous formula and this holds true for our two samples. However, we will not prove the equality or homogeneity of variances. If you are interested in different proofs you can take a look at http://www.cookbook-r.com/Statistical_analysis/Homogeneity_of_variance/.

If H0 holds true, TS is Student-t-distributed with $m+n-2$ degrees of freedom (df) (according to Stock and Watson (2007), pp. 89, 91), where $m$ is the sample size of group1 and $n$ the sample size of group2. In our case, it holds that $m = 1293$ and $n = 946$ and hence $df = 2237$. This means that we have $2237$ linearly independent sample observations which we use for the calculation of the test statistic (see Kennedy (2008), p. 505). Because the sample sizes of both groups are large we can conclude by the central limit theorem (CLT) that both samples are approximately standard normal distributed (according to Stock and Watson (2007), pp. 83-84, 91-92).

More information about the CLT you can find in Stock and Watson (2007) on pages 52 to 55.

Furthermore, we denote a realisation of the test statistic TS by ts.

To test if a hypothesis holds true we need two additional values: the p-value and the significance niveau $\alpha$.

The p-value is the probability to obtain a realisation of the test statistic which is such or more extreme than its observed value. However, the null hypothesis H0 has to hold true (see Kennedy (2008), p. 508 and Weerahandi (1995), p. 29). You can compute the p-value by using the following formula $$p = Pr(|TS| \ge ts ~|~ H0 ~holds ~true)$$ (according to Weerahandi (1995), pp. 30-31).

We employ the (typical) confidence level of $0.95$. This means, the significance level $\alpha$ is equal to $0.05$ (according to Stock and Watson (2007), p. 79).

We can conclude if the null hypothesis holds true or not by comparing the significance niveau with the p-value. A small p-value indicates that the data doesn't support the null hypothesis well but favors the alternative hypothesis. Hence, if the p-value is smaller than $\alpha = 0.05$ we can reject the null hypothesis $H0: diff <= 0$ at a significance niveau of $5 \%$ (according to Weerahandi (1995), pp. 30-31, 110-111).

>

Now, you are familiar with the theory behind the t-test and thus we will perform this test with our data.

Before doing this we have to load the data. We want to use prepared data frames because we already performed the data transformation in the previous exercise. The two data frames of the previous exercise were called pre_code2 and post_code2. Remember, that these data frames only contain billing data of the year $2006$.

In this exercise we load the data frames pre_code2.dta and post_code2.dta which are equal to the transformed data frames of exercise 2.1. We load these data frames with the command read.dta() and assign them to pre_code and post_code, respectively. The commands are already given. First, press edit and then check to load the data.

#< task
# load the two data frames
pre_code = read.dta("pre_code2.dta")
post_code = read.dta("post_code2.dta")
#>

The following task can be solved optionally.

Task: You may like to look at the prepared data frames again. Use the command head() to show the first few lines of the two data frames pre_code and post_code.

#< task
# show parts of the data frames pre_code and post_code
#>
head(pre_code)
head(post_code)

Now, we want to perform the two sample t-test with our data.

Implementation in R

The corresponding R function is called t.test(). You may take a look at the info box below to get further information about this function and its arguments. If you already are familiar with this function you can directly solve the following tasks.

< info "Function t.test()"

We want to perform a t-test with our data. The corresponding R function is called t.test(). This function has several arguments which you can see in the following code box.

# t test with default options
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, 
       paired = FALSE, var.equal = FALSE, conf.level = 0.95)

We have to specify some arguments to perform an appropriate test for our data.

First of all, we have two samples pre_code and post_code. Therefore, we have to assign the corresponding data sets to x and y, respectively. Furthermore, we want to test several variables of the two data frames and not the whole data frames themselves. Hence, we have to define the sample and the variable name, for example we choose post_code$elec. Then, we have to use the alternative greater because we defined our alternative hypothesis H1 as $H1: diff > 0$. Our above defined variable diff is conform to the argument mu. Last, we have to set var.equal = TRUE or for short var.equal = T if variances are equal. If the variances of both samples are not equal the Welch test will be used.

All other arguments can be left unchanged.

You see the function t.test() with the adjusted arguments in the following code box. We additionally can omit all default options to get a function call which is shorter and therefore also clearer.

# t test with specified options
t.test(x = pre_code$var_name, y = post_code$var_name, alternative = "greater", mu = 0, 
       paired = FALSE, var.equal = TRUE, conf.level = 0.95)

# t test with specified options and omitted default options
t.test(x = pre_code$var_name, y = post_code$var_name, alternative = "greater", 
       var.equal = TRUE)

Further information you can find at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/t.test.html.

>

Application to our data

Task: Test if the average electricity consumption of pre- and post-code buildings is equal to each other. Use a two sample t-test. The code is already given. Just press check.

#< task
test1 = t.test(pre_code$elec, post_code$elec, alternative = "greater", var.equal = T)
# show the output
test1
#>

Now, you can see the test results. The next step is to interpret these results. Important for interpreting are the following three values: the t-statistic ($t = 1.7934$), the p-value ($p$-$value = 0.03652$) and the confidence level ($1-\alpha = 0.95$).

We compare the significance niveau with the p-value to make a statement if we can reject the null hypothesis or not. The p-value of our test is equal to $0.03652$. Hence, our p-value is smaller than $\alpha = 0.05$ and thus we can reject the null hypothesis $H0: diff <= 0$ at a significance niveau of $5\%$. Therefore, we can accept the alternative hypothesis and say that pre-code residences consume significantly more electricity than post-code residences at the $5\%$ significance niveau.

Task: Perform another two sample t-test. Check whether the null hypothesis "The average natural gas consumption of residences built before and after the code change is equal to each other" can be rejected or not.

Save the test results in the variable test2. Replace the ??? by the correct code and then uncomment this line. Afterwards, press check. If you need some help press hint.

#< task
# perform the t-test
# test2 = ???
#>
test2 = t.test(pre_code$gas, post_code$gas, alternative = "greater", var.equal = T)

#< hint
display("Your command should look like: 
        test2 = t.test(pre_code$gas, post_code$gas, alternative = \"???\", var.equal = ???). Replace the ??? by the correct arguments. You can find futher information in the info box.")
#>

# show the results
test2

< award "Hypothesis tester"

Great! You performed a two sample t-test on your own! You are now able to use a statistical test to check whether a hypothesis can be rejected or not.

>

While looking at the test results we recognize that the p-value is $1.182e$-$15$ and thus approximately $0$. Therefore, it is below the significance level $\alpha = 0.05$. Hence, we can reject the null hypothesis and conclude that after-code-change buildings use significantly less gas for heating than houses built before the code change at the $5\%$ significance niveau.

Summarising the results of both tests, we have evidence for the statement that post-code residences consume less energy. Post-code buildings need on the one hand less electricity for cooling and on the other hand less natural gas for heating. Hence, we can affirm the results of exercise 2.1.

Now, we consider the housing characteristics. We want to know whether the energy code change leads to statistically significant changes in the mean value of these variables.

We do this analysis exemplarily on the basis of two attributes, namely the square footage (sqfeet) and the number of bedrooms (beds). We expect significant changes in the variable sqfeet because of the results of the previous exercise. There, we could observed visibly differences in the mean values of the variable sqfeet.

Task: Run two two sample t-tests for the variables sqfeet and beds. We will analyse the differences of the samples pre_code and post_code. Replace the ??? in the code with the correct commands and uncomment these lines. Then, press check.

#< task
# two sample t-test for the variable sqfeet
# test3 = ???
# test3
#>
test3 = t.test(pre_code$sqfeet, post_code$sqfeet, alternative = "greater", var.equal = T)
test3

#< task
# two sample t-test for the variable beds
# test4 = ???
# test4
#>
test4 = t.test(pre_code$beds, post_code$beds, alternative = "greater", var.equal = T)
test4


Now, answer the questions of the following two quizzes. The results of the previous t-tests provide the answers.

< quiz "interpretation of test results"

question: Consider the results of test3. Denote the correct statement. sc: - The null hypothesis can not be rejected at a significance level of 5%. - Post-code buildings are significantly smaller than pre-code buildings at the 5% significance niveau.*

success: Great, your answer is correct! failure: Not correct answered. Try again.

>


< quiz "further interpretation of test results"

question: Look at test4 and the results. Denote the correct statement. sc: - The p-value is smaller than alpha. - The null hypothesis can not be rejected at the 5% niveau.* - The null hypothesis "The average number of bedrooms of pre- code houses is greater or equal than the number of post-code residences" is true.

success: Great, your answer is correct! failure: Not correct answered. Try again.

>

< award "Interpreter of test results"

Great! You are able to interpret the test results of two sample t-test correctly. Furthermore, you can decide if a null hypothesis can be rejected or not.

>


We won't perform t-tests for the remaining housing characteristics because the results are either not statistically significant or the differences are really small as we already saw in the plots of exercise 2.1. But if you like to see the results you can take a look at table 2 of the paper.

In Kohler and Kreuter (2012) on pages 232 to 235 you find details about how to perform a two-sample t-test with Stata.

Summarising, we performed four different t-tests in this exercise. Due to the results of $test1$ to $test3$ we could reject the null hypothesis of each test. Additionally, these results supported our expectations formulated in exercise 2.1. Therefore, we can conclude that post-code residences use significantly less electricity and natural gas at the $5\%$ significance niveau. Furthermore, we can conclude that post-code residences are significantly smaller than residences built before the code change -- also at the $5\%$ significance niveau. We will support these results, especially the first one, by further statistical analysis in exercises 3 and 4.

This exercise refers to page 38 and to table 2 of the paper.

Exercise 3 -- Introduction to regression analysis

In this exercise we will get some information about linear regression theory. This basic knowledge is important for understanding the following exercises and also to be able to perform the estimations in R.

Our aim is to calculate estimates for the differences in electricity and natural gas consumption between the residences built before and after the code change. For this purpose, we will set up a linear regression model and after that we will estimate the differences using ordinary least squares (OLS) estimation.

In exercise 3.1 you will learn about regression theory and how to fit linear models in R. In exercise 3.2 we will introduce a simple regression model of our data and estimate the coefficients. Additionally, you will learn how the regression results can be interpreted.

The following exercises -- exercises 4 and 5 -- base upon these methods. Hence, we will deepen the regression analysis of our data in these exercises.

Exercise 3.1 -- Regression theory and implementation in R

First, we state some basic information about regression theory. We introduce the ordinary least squares (OLS) estimator and learn about the assumptions which have to be fulfilled to get a good estimator. We also mention how to deal with violations of these assumptions. Then, we learn how to fit linear models in R. There are some possibilities to do so. In this problem set we will use the function felm().

Regression theory

In this section you will learn about linear regression theory and especially about OLS estimation.

Linear Regression

Our aim is to ascertain the relationship of the variables of our data frame. Above all, we want to know which variables affect the energy consumption of the residences.

This type of analysis is called regression (according to Sen and Srivastava (1990), p. 1). To be more precise, we formulate a linear equation which enables us a good prediction of our dependent variable, i.e. the residential electricity or natural gas consumption, on basis of one ore more explanatory variables (according to Kohler and Kreuter (2012), pp. 247, 249-251). In our case, we will for example choose the variables code_change and logfeet as explanatory variables.

We will use the OLS estimator to estimate the unknown coefficients in our linear regression models. The OLS estimation is a good method if it fulfills certain conditions (according to Sen and Srivastava (1990), pp. 11-13). We will mention these conditions later.

Ordinary least squares (OLS) estimator

First, you will get further information about the OLS estimator. Take a look at the info box below. Then, we will learn about the assumptions the regression model hast to fulfill in order to receive good estimates.

< info "OLS estimator"

The ordinary least squares estimator generates estimates by minimising the sum of squared residuals.

To understand the meaning of this we first set up a linear regression model of the form $$y = X \cdot \beta + \epsilon$$ where $y$ is the dependent or response variable and $X$ is a matrix containing the explanatory variables. Furthermore, $\beta$ is the vector of the true coefficients and the last parameter of this equation is the disturbance $\epsilon$.

Then, we give a definition of the residuals -- which are the estimated values of the disturbance $\epsilon$. It holds that $\hat\epsilon = y - \hat{y} = y - X \cdot \hat\beta$ with $\hat\beta$ being the estimate of the true coefficient vector $\beta$ and $\hat y$ being the predicted or fitted values of y.

Last, the formula for the OLS estimate is $$\hat\beta = \underset{\beta \in \mathbb{R}^{K+1}}{\arg\min}{\sum^T_{t=1} \hat\epsilon_t^2}$$ with $t=1,...,T$ being the number of observations and $K+1$ is the number of different regression coefficients.

Hence, the OLS estimator $\hat\beta$ minimises the sum of the squared residuals.

reference: Kennedy (2008), pp. 12-13, Weerahandi (1995), pp. 14-17

>

If the regression model and its parameters fulfill the following five assumptions we consider the OLS estimator to be the optimal estimator -- this is a consistent, unbiased and efficient estimator.

You can find more details in Kennedy (2008), pp. 12-18, 40-44. Especially, on pages 41 and 42 you get more information about the five assumptions.

Let $t=1,...,T$ be the number of observations.

Assumption 1: The response variable $Y$ can be represented as a linear function of a set of $K$ explanatory variables (which are stored in the matrix $X$) and of a disturbance term $\epsilon$. Let's call the unknown coefficients $\beta_i$ with $i = 0,...,K$. These coefficients should be constant and they form the vector $\beta$. Hence, the formula of this regression model is $$Y = X \cdot \beta + \epsilon$$ with $Y = (y_1,...y_T)$, $\epsilon = (\epsilon_1,...,\epsilon_T)$ and $X$ being a $\textrm{T} \times \textrm{(K+1)-matrix}$.

Assumption 2: The expected value of $\epsilon$ is zero. Thus, $E[\epsilon] = 0$.

Assumption 3: All $\epsilon_t$ have the same variance and are uncorrelated.

Assumption 4: The "observations on the independent variable can be considered fixed in repeated samples" (Kennedy (2008), p. 41).

Assumption 5: There are more observations than explanatory variables. Hence, it holds that $T > K$. And additionally, "there are no exact linear relationships between the independent variables" (Kennedy (2008), p. 42).

Violations of one of these assumptions lead to different problems and thus we may get an inconsistent or unbiased estimator. If you like detailed information you can look at Kennedy (2008), pp. 41-44 and chapters 6 to 12.

Autocorrelated error terms

In this problem set we only take a closer look at violations of assumption 3. A typical violation of this assumption is if we have autocorrelated error terms. Autocorrelation means that "the disturbance term for one observation is correlated with the disturbance term for another observation" (Kennedy (2008), p. 113). Autocorrelated error terms cause an increase in the variance of our estimator (according to Sen and Srivastava (1990), p. 133) and thus our OLS estimator is no longer efficient (according to Andreß et al. (2013), pp. 109, 122-123).

In our case, the error terms are correlated because of clustering. Each residence belongs to a different cluster. Hence, all observations of one residence form a cluster. To be more precise, the energy consumption of households is influenced by the housing characteristics of their residence and therefore the errors of one residence are correlated with each other (according to Kennedy (2008), pp. 112-113, 118-122, 128 and paper, p. 39).

More information about cluster analysis you can find in Everitt and Hothorn (2011), chapter 6.

The following graphic will visualise the different clusters which we can observe in our data.

But first of all, we have to load the data. The data frame which we will use in this exercise is called cluster_plot.dta. It is already a prepared in order to generate a nice plot. You may take a look at exercise 1.1 again. There you can find similar data preparations.

Press edit and afterwards check to load the data frame and to show the first few lines of it.

#< task 
# load the data
dat = read.dta("cluster_plot.dta")

# show a part of the data frame
head(dat)
#>

To exemplarily show the clusters in our data we will draw a scatterplot of the electricity consumption.

Hence, we use the variable elec. Additionally, we arbitrarily select some residences to receive a plot with a few observations only. For that reason we use the command slice(dat,2400:3590). The remaining functions are used to generate the plot. We used all of these commands before. Thus, there will be no further explanations.

Just press check to see the graphic.

#< task_notest
# select an arbitrary part of the data frame
dat_clust = slice(dat,2400:3590) 

# draw the plot 
c = ggplot(data=dat_clust, aes(x=home_id, y=elec)) 

c1 = c + geom_point(colour="red") + xlab("identification number of residence") + 
  ylab("electricity consumption") +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15))

c1
#>

You find no uniformly distributed points but points which lie on vertical lines. Hence, you can observe the clusters. All observations of one residence form a cluster.

We handle this problem of autocorrelated error terms by using clustered standard errors (refer to paper, p. 39). You will get more details in the following part of this exercise.

Implementation of linear regressions in R

Now, we learn how to implement a linear regression in R. There are several functions we could use to estimate linear models. You may have already used the function lm() for this purpose. However, in this problem set we will use the function felm() out of the package lfe because our linear models have a lot of factors with many levels and additionally we are not interested in their estimates.

Further information about the lfe package you can get at https://cran.r-project.org/web/packages/lfe/lfe.pdf. Additionally, you find more details about the function felm() in the info box below.

< info "Function felm()"

In general, you can use the function felm() to run a linear regression. This function has additional features in comparison to the function lm(). An important argument of the function felm() is formula. Beside the response variable you can determine four additional features of the formula specification. These features are written as a four-part-formula separated by $|$. In case of a simple linear regression lm() and felm() have the same arguments and work similar.

The following code shows how to use this function in R.

First, we load the package lfe. Then, we choose a simple linear regression which is part of a regression we use in a later exercise. We want to regress elec on code_change and logfeet. Hence, elec is the dependent or response variable and the others are the explanatory variables. Therefore, we write $elec \sim code_change + logfeet$. This term forms the first part of the four-part-formula. As additional argument of the function we specify the data frame. In our case it is called dat.

In the following code chunk you see the structure of the function felm(). We will omit the output this time. However, we will perform exactly the same regression in part a) of exercise 3.2.

# load the package
library(lfe)

# perform a simple linear regression
felm(elec ~ code_change + logfeet, data=dat)

Now, we will see one additional feature of the function felm().

In this problem set we will deal with regression models with many factors which have a lot of levels. Furthermore, we only use the factors for controlling the group effects but are not interested in their estimates. Hence, we use the function felm() instead of lm() because the former is able to project the factors out before when performing an OLS estimation.

To illustrate this we expand our linear model and add some factors named controls. However, we don't want to estimate them. We implement this model in R by writing $elec \sim code_change + logfeet ~|~ controls$. Hence, the second part of the four-part-formula of the function felm() consists of factors which we want to project out.

In the following code chunk you see how to implement a linear regression with control variables -- which we want to project out -- in R.

# perform a linear regression with projecting out the factors 
felm(elec ~ code_change + logfeet | controls, data=dat)

Further information about the function felm() you can find at http://finzi.psych.upenn.edu/R/library/lfe/html/felm.html and https://cran.r-project.org/web/packages/lfe/lfe.pdf.

>

We have to be careful when performing regressions with our data because our error terms are correlated which we already mentioned in the first part of this exercise. Therefore, we use standard errors that are clustered at the residence level. Take a look at the following info box if you want to know how to deal with clustered standard errors when using the function felm().

< info "Clustered standard errors in felm()"

We have to specify another part of the four-part-formula of the function felm() in order to get clustered standard errors. In our case, we use the variable home_id for clustering. Therefore, we add this variable to the formula of the function felm(). The cluster option is the fourth part of the formula.

Hence, we have to insert $0$ in the third part. This part contains the instrumental variables and in this problem set we don't need them. Therefore, the third part is always zero. More information about instrumental variables you get in Stock and Watson (2007), chapter 12.

In the following code chunk you see how to implement an expanded regression model -- with clustered standard errors at the residence level but without instrumental variables -- in R.

# regression with robust standard errors which are clustered at the residence level
felm(elec ~ code_change + logfeet | controls | 0 | home_id, data = dat)

You can get further information at http://finzi.psych.upenn.edu/R/library/lfe/html/felm.html and https://cran.r-project.org/web/packages/lfe/lfe.pdf.

>

To sum up, the R-function felm() consists of several arguments. One important argument is the formula specification (formula) which contains the response variable in the first place. Then, a four-part-formula follows. However, these four parts got mixed up. Answer the question of the following quiz and arrange the four parts of the formula correctly.

We find the following situation:

< quiz "function felm"

question: Bring the four parts in the correct order! Write the correctly ordered numbers without blanks in the response field. answer: 4213 roundto: 1

success: Great, your answer is correct! failure: Try again.

>


< award "Quiz master - function felm()"

Great! You are familiar with the R-function felm(). You know that the structure of the formula is the following: felm(response variable ~ ordinary explanatory variables | controls | IV | clust, data frame). Hence, you are able to use this function to implement linear regressions in R.

>

This exercise refers to page 39 of the paper.

Exercise 3.2 -- Simple regression model based on our data

In this exercise we will set up simple regression models of our data. These regression models are the basis of the regression models of the following exercises. Hence, these models will be expanded later. Additionally, we will learn how to interpret regression results. Last, we will also visualise some of the regression results.

We will divide this exercise into two parts and analyse the effects of the code change on electricity and natural gas consumption separately.

Before we set up a regression model of our data we have to load the data. The data frame is called CCC_regression1.dta and we will also use this data frame in exercises 4.1 and 4.2. The abbreviation CCC stands for code change comparisons which shortly describes the type of analyses we perform in exercise 4. The data frame CCC_regression1.dta contains $12$ different columns. We need these variables to set up the regression model 1 which you will find in exercise 4.1. In this exercise we only need the variables elec, gas, code_change and logfeet.

First, press edit and then check to load the data and to show a part of the data frame.

#< task
# load the data
dat = read.dta("CCC_regression1.dta")

# show a part of the data frame
head(dat)
#>

In part a) of this exercise we will analyse the effects of the code change on electricity consumption and in part b) we will focus on residential gas consumption.

a) Electricity consumption

We start with a simple regression model. We want to find out how the code change affects the residential electricity consumption. Hence, we regress code_change on elec. The regression model looks like $$Y_{it} = \beta_0 + \delta \cdot code_change_i + \epsilon_{it}$$ with the dependent variable $Y_{it}$ being the monthly electricity consumption. The index $i$ indicates the residence and $t$ denotes the month and year of the billing record. $\beta_0$ is an intercept and $\epsilon_{it}$ the error term. The variable code_change is the only explanatory variable in this regression.

Task: Fit this linear model by using the function felm() out of the package lfe.

Because we use the function felm() for the first time we have to load the package lfe with the command library() at first.

#< task
# load the package
library(lfe)
#>

Run the simple regression and save the results as reg1_elec. Then, show the regression results with the command summary().

More information about the function `summary() you can get at https://www.rdocumentation.org/packages/lfe/versions/1.5-1045/topics/summary.felm.

Press check to perform the regression and to show the results.

#< task
# run the regression
reg1_elec = felm(elec ~ code_change, data=dat)

# show the regression results
summary(reg1_elec)
#>

Here you can see the results of our first simple linear regression. We want to interpret these results now. We first take a look at the coefficient estimate of the variable code_change along with its significance niveau. The coefficient is called $\delta$ and thus we denote the estimate by $\hat\delta$. Hence, we find $\hat\delta = -10.36$ which means that post-code residences consume roughly $10.4$ kWh per month less electricity than pre-code buildings.

Let us remember exercise 2.1: among others, we calculated the differences of the mean values of the energy consumption of the pre- and post-code residences. The difference in the electricity consumption should be the same as the value of $\hat\delta$. However, the difference computed in exercise 2.1 was roughly $-44.85$ kWh. This does not result from a mistake we made but from the fact that we used a different data frame in exercise 2.1. When taking the same data frame in both exercises we will get the same results. In the following info box you can find the corresponding calculations.

< info "Repeated calculations of the results of exercise 2.1 -- using different data now"

We will perform a part of the calculations of exercise 2.1 again. Our aim is to show that the calculated differences of the electricity and natural gas consumption for pre- and post-code residences are the same as the estimated coefficients of the regressions of this exercise.

In exercise 2.1 we used a data frame which only contained billing data from the year $2006$. Now, we use the complete data frame with billing data from $2004$ to $2006$.

We have already loaded the corresponding data frame in this exercise and denoted it with dat. Hence, we can directly start with the calculations. Press check to show the result.

# repeating calculations of exercise 2.1 with the complete data frame dat

# form pre-code group and post-code group
pre_code = filter(dat, code_change == 0)
post_code = filter(dat, code_change == 1)

# calculate mean values
beforeCC = summarise_each(pre_code, funs(mean))
afterCC = summarise_each(post_code, funs(mean))

# calculate the difference of electricity consumption
diff_elec = beforeCC$elec - afterCC$elec
diff_elec

We find diff_elec is roughly $10.36$. Thus, it is conform with the result of $\hat\delta = -10.36$.

We also show the result for the differences of the natural gas consumption. You will find the regression results in part b).

# calculate difference of gas consumption
diff_gas = beforeCC$gas - afterCC$gas
diff_gas

We find a difference in the residential gas consumption of roughly $5.64$ therms. Anticipating the results of reg1_gas in part b): the coefficient estimate for code_change is approximately $-5.64$. Thus, these results are conform, too.

>

To sum up, the coefficient estimate of code_change indicates a negative effect on the electricity consumption but it is not significant. Hence, it is not reliable that our estimate is close to the true value. To be more exact, we can reject the null hypothesis that our estimated value is equal to the true value (refer to Stock and Watson (2007), pp. 149-155).

Generally, we can interpret the regression coefficient $\beta_1$ in a simple regression model like $Y = \beta_0 + \beta_1 \cdot X_1 + \epsilon$ in the following way:

When we expand the simple regression model to a multiple regression model we only have to add a supplement to the interpretation which says "holding the other regressors constant" (according to Stock and Watson (2007), pp. 193-194).

However, the interpretation of our previous estimate $\hat\delta$ is different because the variable code_change is a dummy variable and not a continuous variable. Hence, we cannot refer to $\hat\delta$ as a slope. Instead, $\hat\delta$ describes the difference in the sample means of elec for the two groups (according to Stock and Watson (2007), pp. 158-159). Therefore, we can interpret $\hat\delta$ as the difference in the electricity consumption of pre- and post-code residences in our data set.

Now, we continue with the interpretation of the regression results and we take into account the coefficient of determination ($R^2$).

Coefficient of determination

We are mainly interested in the values of the adjusted and the multiple $R^2$ of the full model. Regarding the results of reg1_elec: both values are close to $0$ which means that the explanatory variable code_change explains roughly $0\%$ of the variance of the dependent variable, which is the electricity consumption in this model (according to Stock and Watson (2007), p. 125). To sum up, we can say the model doesn't fit the data well. More information about the coefficient of determination you can find in the info box below.

< info "(Adjusted) coefficient of determination"

When estimating the coefficients of a regression model you're not only interested in the resulting estimates but also in the goodness of fit, that is how well the data is predicted by your estimates. You can interpret the goodness of fit visually as how well the regression line fits the data.

The $R^2$ can take values between $0$ and $1$ whereas these two values are typically not attained. A value close to $1$ indicates that the explanatory variables predict the dependent variable $Y_t$ well. This means "that the variance of the OLS residual is small compared to the variance of the dependent variable" (Stock and Watson (2007), p. 238). However, if the $R^2$ takes a value close to $0$ the opposite holds true.

The $R^2$ can be mathematically defined in two different ways. First of all, let us specify some notations: the different observations are denoted by $t=1,...,T$, $\hat{Y}t$ are the predicted values of the dependent variable $Y_t$ and $\bar{Y}$ is the average of $Y_t$. Thus, the coefficient of determination can be defined as $$R^2 = \frac{ESS}{TSS} = \frac{\sum^T{t=1}(\hat{Y}t-\bar{Y})^2}{\sum^T{t=1}(Y_t-\bar{Y})^2}$$ with $\textrm{ESS}$ being the explained sum of squares and $\textrm{TSS}$ the total sum of squares (according to Stock and Watson (2007), pp. 123-124). Hence, you can describe the $R^2$ as being the "fraction of the sample variance of $Y_{[...][t, A/N]}$ explained by (or predicted by) the regressors" (Stock and Watson (2007), p. 200). An alternative but equivalent definition is: $$R^2 = 1 - \frac{SSR}{TSS} = 1 - \frac{\sum^T_{t=1}\hat{\epsilon}^2_t}{\sum^T_{t=1}(Y_t-\bar{Y})^2}$$ with $\textrm{SSR}$ being the sum of squared residuals (according to Stock and Watson (2007), pp. 123-124). This means that "the $R^2$ is 1 minus the fraction of the variance of $Y_{[...][t, A/N]}$ not explained by the regressors" (Stock and Watson (2007), p. 200).

The problem with the $R^2$ in multiple regression is that its value increases as soon as a new explanatory variable (with an estimated coefficient different to $0$) is added to the regression model. Hence, we use the adjusted $R^2$ to avoid this problem.

The adjusted $R^2$ is always smaller than the $R^2$ and can also be negative. Mathematically you have to correct the $R^2$ by the factor $\frac{T-1}{T-K-1}$ to get a formula for the adjusted $R^2$, where $K$ is the number of different coefficients $\beta_i$. The exact formula of the adjusted $R^2$ is $$adj.~R^2 = 1 - \frac{T-1}{T-K-1} \cdot \frac{SSR}{TSS}.$$ More details you can find in Stock and Watson (2007), p. 201.

However, do not solely rely on this factor when checking for a good regression model.

reference: Kennedy (2008), pp. 13-14, 26-28, 79-80, Stock and Watson (2007), pp. 123-124, 200-202, 237-238.

>

To conclude, the simple regression model reg1_elec is not good enough to explain the effect of the code change on the residential energy consumption, as the coefficient estimate is not significant and the value for the (adjusted) coefficient of determination is close to $0$.

One reason for these results is that there are further important factors that influence the response variable and we haven't considered these factors in our model yet (according to Kohler and Kreuter (2012), pp. 262-263). In our case, the residential energy consumption is for example also affected by the size of the residence. Omitted variables cause biased estimates (according to Sen and Srivastava (1990), pp. 233-238 and Andreß et al. (2013), pp. 98-99). Hence, we have to expand our simple regression model with additional explanatory variables.

A linear regression model with several explanatory variables is called multiple regression model (according to Kohler and Kreuter (2012), p. 262).

Therefore, we add an additional explanatory variable to our simple regression model.

Adding a second explanatory variable

We choose the variable logfeet. We already studied this variable in exercise 1.2. It describes the size of a residence and thus likely affects the electricity consumption of the residence.

The new regression model is of the form $$Y_{it} = \beta_0 + \delta \cdot code_change_i + \gamma \cdot logfeet_i + \epsilon_{it}.$$

Before running the regression you should solve the following task and calculate the correlation of the two variables code_change and logfeet. Use the function cor().

The correlation measures the linear relationship of these two variables and is independent of measurement units. It lies between $-1$ and $1$, indicating a negative or positive relationship, respectively (according to Everitt and Hothorn (2011), p. 14 and Sen and Srivastava (1990), p. 285).

The following result will help you to answer the questions of the quiz below. The code is already given. Just press check.

#< task
# calculate the correlation of code_change and logfeet
cor(dat$code_change,dat$logfeet)
#>

We find a value of roughly $-0.023$. Thus, the correlation between these two variables is negative.

Answer the questions of the following quiz.

< quiz "effect of adding variable logfeet"

question: What do you expect? Two answers are correct. mc: - The variable logfeet influences the residential electricity consumption positively, i.e. larger residences consume more electricity than smaller ones. - The direct influence of the variable logfeet on the residential electricity consumption is negative. - Hence, the estimate of code_change will be more negative in the expanded regression model. - Hence, we will get a more positive estimate of code_change in the expanded regression model.

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


If you haven't solved the previous task press edit first.

Now, we want to examine our expectations and run the extended regression.

We will call the expanded regression reg2_elec. Afterwards, we will show the regression results with the command summary(). Click on the button check to solve this task.

#< task
# run the regression
reg2_elec = felm(elec ~ code_change + logfeet, data=dat)

# show the regression results
summary(reg2_elec)
#>

< award "Master of regression theory"

Great! Now, you know a lot of regression theory: especially about simple and multiple linear models, about OLS estimation and about the assumptions to receive a good estimator. Additionally, you learned about autocorrelated standard errors and how to interpret regression results.

>

We find the following results: the coefficient estimate of code_change is positive now and takes a value of roughly $8.72$. However, it is not significant. The coefficient estimate of the variable logfeet is approximately $1113.1$ and highly significant (at the $0.1\%$ niveau). The explanation for these results -- especially for the increase of the estimate $\hat\delta$ -- is the following: the residences which were built after the code change are smaller and thus consume less electricity. Moreover, the coefficient estimate of the variable code_change in regression model reg1_elec had already considered the effect of smaller residences and therefore is positive in the new regression model.

When we consider the (adjusted) coefficient of determination we find a value of roughly $0.25$. Hence, its value increased in comparison to the simple regression model.

We can conclude that the expanded regression model is better compared to the simple regression model. However, the expanded model is not good enough yet because the estimate of code_change is still not significant and positive. Therefore, we have to add further control variables to our regression model. You will find the complete regression model in exercise 4.1.

b) Natural gas consumption

Now, we will analyse the effects of the code change on residential gas consumption. In this part of the exercise we will proceed similar to part a) of this exercise. However, you have to solve some tasks on your own.

The code of the simple linear regression is already given. The only difference towards the regression model of part a) is that we now use the dependent variable gas. Press check to show the results.

#< task
# run the simple regression
reg1_gas = felm(gas ~ code_change, data=dat)

# show the results
summary(reg1_gas)
#>

We find a highly significant coefficient estimate of the variable code_change. It takes the value $-5.64$. However, the $R^2$ is roughly $0.007$ and thus it is too low.

Therefore, we add a further variable to our regression model.

Adding a second explanatory variable

We choose the variable logfeet again.

Task: Fit the expanded linear model with the function felm(). Replace the ??? with correct commands and uncomment this line. Then, press check.

#< task
# enter your code here...
# reg2_gas = ???
#>
reg2_gas = felm(gas ~ code_change + logfeet, data=dat)

< award "Regression master level 1"

Great! You are able to perform a linear regression. Note: you can either use the function felm() or lm() to fit linear models with only a few explanatory variables.

>

Task: Show the results of the regression reg2_gas. Enter your commands and press check afterwards.

#< task
# enter your code here...
#>
summary(reg2_gas)

We find $\hat\delta = -5.07$. This estimate is significant at the $0.1\%$-niveau. In comparison to the first regression it is a little bit more positive. Furthermore, the coefficient estimate of logfeet is approximately $33.41$ and highly significant.

Now, we will show the regression results in a different form.

Visualisation of regression results -- function effectplot()

We choose a graphical presentation using the function effectplot() out of the package regtools. This function is especially useful when the explanatory variables are measured in different units. It shows the strength of the impact of the explanatory variables on the response variable. The individual effects shown in the plot are the effects if an explanatory variable changes, ceteris paribus, from its $10\%$ quantile to its $90\%$ quantile.

A $10\%$ quantile divides the data in two parts such that $10\%$ of the data lie below this value and $90\%$ above (according to Kohler and Kreuter (2012), pp. 173-174).

However, if the explanatory variable is a dummy variable, the visualised effects come from a change from value $0$ to $1$. The function effectplot() also shows the different signs of the coefficient estimates of the explanatory variables in different colours.

More details about the function effectplot() you can find at https://github.com/skranz/regtools/blob/master/man/effectplot.Rd.

Task: Load the package regtools and show the results of the regression reg2_gas with the function effectplot(). Press check to show the results.

#< task
# load the package 
library(regtools)

# show the regression results graphically
effectplot(reg2_gas)
#>

Now, we can better compare the impact of the two explanatory variables on the natural gas consumption. First, we see that the greatest impact results from the variable logfeet. Second, the influence of logfeet is positive and the influence of code_change negative. Furthermore, we can interpret the effects of the explanatory variables more precisely. Take a closer look at the graphic: below the variable names you can find the $10\%$, $50\%$ and $90\%$ quantile of each variable. On the basis of these values we can determine the impact of the explanatory variables on the response variable.

For the variable logfeet the following holds: if logfeet changes from $7.19$ (i.e. its $10\%$ quantile) to $8.05$ (i.e. its $90\%$ quantile) then the natural gas consumption increases by $28\%$ per month.

Interpret the effect of the variable code_change by solving the following quiz.

< quiz "interpretation of effectplot"

question: Denote the correct statement. sc: - The natural gas consumption increases by 28 percent per month if the variable code_change changes from 0 to 1. - The natural gas consumption decreases by 5.1 percent per month if the variable code_change changes from 0 to 1.* - The natural gas consumption increases by 5.1 percent per month if the variable code_change changes from 0 to 1. - The variable code_change decreases by 5.1 percent if the natural gas consumption changes from 0 to 1.

success: Great, your answer is correct! failure: Not correct answered. Try again.

>


< award "Effect plotter"

Great! You can show the results of a regression with the function effectplot() and you additionally are able to interpret the plot correctly.

>

Exercise 4 -- Regression analysis: pre- and post-code change comparisons

In this exercise we will perform several regression analyses with our data to determine the effects of the building code change on the actual energy consumption of Gainesville residences. The authors of the paper used two different empirical strategies to analyse the data. This exercise deals with the first strategy and exercise 5 with the second one. Note that exercise 4 reproduces the main results of the article. Especially, the results of exercise 4.1 are essential (according to paper, p. 47).

The aim of this exercise is to calculate estimates for the differences in electricity and natural gas consumption between the residences built before and after the code change. Hence, we perform a comparison of the pre- and post-code buildings and produce the results with several regression analyses.

In exercise 3.1 we already learned a bit about regression theory and about the implementation of regressions with the function felm(). In exercise 3.2 we set up a simple regression model with our data and determined that this simple regression model has not enough explanatory power. Hence, we need to set up a multiple linear regression model to fit our data better.

In this exercise we will formulate several multiple regression models and estimate the differences in the energy consumption of pre- and post-code residences by using ordinary least squares estimation. To be more precise, in exercise 4.1 we will calculate estimates for the annual differences in the energy consumption of pre- and post-code residences. Then, in exercise 4.2 we will test the regression model of exercise 4.1 for robustness. In exercise 4.3 we will consider an expanded version of the regression model of exercise 4.1. With help of this alternative specification we will analyse the effects of the code change on different months. Last, we will discuss possible limitations of our chosen regression models.

Exercise 4.1 -- Estimating annual differences in energy consumption

In this exercise we will calculate the estimates for the annual differences in the energy consumption of pre- and post-code buildings. First, we will focus on the differences in the electricity consumption and then in part b) we will consider the natural gas consumption. We will use a regression analysis to determine the differences and additionally we will visualise the results of the regression.

Regression model -- model 1

At first, we set up our regression model which we denote by model 1. This model has the form $$Y_{it} = \delta \cdot code_change_i + \beta \cdot X_i + v_t + \epsilon_{it}$$ where $Y_{it}$ is the dependent variable with $i$ being the index of the residence and $t$ being a time index. In part a) the dependent variable is given by the monthly electricity consumption and in part b) by the monthly natural gas consumption. Furthermore, the time is the month and year of the billing record. Another parameter is $code_change_i$. It is an indicator variable and we already used this parameter in the previous exercises. As a reminder, $code_change_i$ is equal to $1$ if the residence has been constructed after the energy code change and $0$ otherwise. Moreover, $X_i$ contains our explanatory variables. In the first part of this exercise it includes the variables logfeet, centralair, shingled and dummy variables of the variables beds, baths, stories and zip. In part b) we will use different explanatory variables (as you will see later). A further parameter is the intercept $v_t$ which depends on the month and year of the billing records. The last parameter is the error term $\epsilon_{it}$.

To sum up, this model explains the differences in energy consumption by controlling for observable housing characteristics. Especially, the dummy variables for the zip code zip "account[...] for unobserved heterogeneity that is common among all residences within the same zip code" (paper, p. 41). Additionally, the intercept $v_t$ controls for the effects between the different months which influence all residences -- like changes in weather or in energy prices.

In the following info box you find information about dummy variables.

< info "Dummy variables -- definition, usage in regressions and implementation in R"

A dummy variable only takes the values $0$ and $1$. To be more precise, we can define a dummy variable as "an artificial variable constructed such that it takes the value unity whenever the qualitative phenomenon it represents occurs, and zero otherwise" (Kennedy (2008), p. 232).

We need dummy variables in a regression model to represent explanatory variables when they are "qualitative in nature" (Kennedy (2008), p. 232). However, we have to be careful when the regression model contains an intercept. Hence, to avoid multicollinearity and be able to compute an OLS estimator we have to omit one of the dummy variables (according to Kennedy (2008), pp. 232-234).

Multicollinearity occurs when one explanatory variable can be represented as linear function by the other explanatory variables (according to Stock and Watson (2007), p. 204). Further information you can get in Stock and Watson (2007), pp. 203-208.

You can implement a dummy variable in R by using the command factor(). If a variable has already been transformed into a factor you can see the different levels of this factor by using the function levels().

You will find an example in the following code chunk. We use the data frame CCC_regression1.dta which we already used in the previous exercise. To show the implementation of a dummy variable in R we choose the variable beds.

# load the data 
dat = read.dta("CCC_regression1.dta")

# create a factor or dummy variable
Dbeds = factor(dat$beds)

# show the levels of the dummy variable Dbeds
levels(Dbeds)

In this case, there are four levels. For each value of the variable bed one dummy variable has been created. Let us denote them by $\textrm{Dbed2}$ to $\textrm{Dbed5}$. The variable $\textrm{Dbed2}$ takes the value $1$ if the residence has $2$ bedrooms and $0$ otherwise. Similar for $\textrm{Dbed3}$: it is $1$ if the building has $3$ bedrooms and 0 otherwise and so on.

Further information you can find at https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html and https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html.

>

We will estimate our regression model 1 by using OLS estimation. We are mainly interested in the estimate of $\delta$ because it states the annual difference in the energy consumption between pre- and post-code residences. In part a) of this exercise we regress on the electricity consumption and in part b) on the consumption of natural gas.

We expect estimates of $\delta$ being smaller than zero. This would indicate that the building code change leads to a reduced energy consumption in our sample.

Before we can estimate the parameters of our regression model we have to load the data. As you already loaded several data frames you only have to press check. However, click on the button edit first. We will also see a part of the data frame.

The required data frame for this exercise is called CCC_regression1.dta. We already used this data frame in the previous exercise and in the info box above. This data frames contains all variables which we need in order to run the following regressions. We will also use this data frame in exercise 4.2.

#< task
# load the data set CCC_regression1.dta and assign it to dat
dat = read.dta("CCC_regression1.dta")

# show a part of the data frame
head(dat)
#>

We will use the function felm() to perform the regression analysis. We have already loaded the package lfe in exercise 3.2 and thus it's not necessary to load it again.

We will perform our regression analysis in two steps. First, we look at the electricity consumption and in part b) we consider the natural gas consumption.

a) Electricity consumption

Now, we will perform the first regression analysis of this problem set. We want to calculate the estimates for the annual differences in electricity consumption between the residences built before and built after the code change. You can find the equation of regression model 1 at the beginning of this exercise. You may take a look at this formula again.

Regression model 1 -- application to our data and implementation in R

On the basis of this model we will set up the regression formula by using the specific variables of our data frame. We regress code_change on elec. Furthermore, we include eight additional variables which also affect our dependent variable elec in order to get a precise and unbiased OLS estimator. We estimate the effect of the regressor code_change on the electricity consumption by holding those other eight variables constant. Hence, we can refer to these variables as control variables (according to Stock and Watson (2007), pp. 186, 193-194). The first three control variables are logfeet, centralair and shingled. The other five variables -- baths, beds, stories, zip and monthyear -- have to be transformed into dummy variables by using the function factor(). However, we are not interested in the estimates of the five dummy variables. Thus, we will project them out. Therefore, we write the dummy variables in the second part of the four-part-formula of the function felm(). However, these variables are automatically treated as factors and thus we don't have to specify the variables as $factor(baths), factor(beds), factor(stories)$ and so on. It suffices to write $baths, beds, stories$ and so on.

Hence, the complete regression formula is $$elec \sim code_change + logfeet + centralair + shingled ~|~ baths + beds + stories + zip + monthyear.$$ We additionally use clustered standard errors. Therefore, we have to add the variable home_id in the fourth part of the formula of the function felm(). However, we don't need instrumental variables and thus the third part contains $0$ as an argument.

Task: Run the regression and save the results in the variable reg1a. Then, show the results with the command summary().

The code is already given. Just press check.

#< task
# run the regression
reg1a = felm(elec ~ code_change + logfeet + centralair + shingled | baths + beds +
               stories + zip + monthyear | 0 | home_id, data=dat)

# output of the results
summary(reg1a)
#>

You can see the results of our first linear regression. We will interpret these results now.

Interpretation of regression results

Before taking a look at the estimated coefficients we take the coefficient of determination into account. Remember, we already analysed different values of the $R^2$ in exercise 3.2 and we mentioned that the $R^2$ will increase if we add new explanatory variables. Hence, we consider the adjusted $R^2$ to avoid this problematic. You can find the exact formula in the corresponding info box of exercise 3.2.

In regression model reg1a both coefficients -- the adjusted $R^2$ and the $R^2$ of the full model -- are very similar to each other and roughly take the value $0.499$. This means, that the explanatory variables explain approximately $50\%$ of the variance of the dependent variable (according to Stock and Watson (2007), p. 125). To sum up, we can say the explanatory variables of our model predict the electricity consumption well.

Then, we take a look at the different coefficient estimates along with their significance niveaus. Two of the estimates are significant which is indicated by the stars printed next to the p-values.

Explanatory variable code_change

One of the coefficients is the estimate for $\delta$ which we denote by $\hat\delta$. It is the coefficient of the variable code_change and thus of our main interest. To be more precise, $\hat\delta = -47.62$ at a $5\%$ significance level. In case of a correct interpretation we have to consider the sign and the value of the estimates. To sum up, the estimate $\hat\delta$ indicates that post-code residences consume monthly about $47.6$ kWh less electricity than pre-code buildings, when holding all other regressors constant.

The last term "when holding all other regressors constant" has to be added when we interpret the regression results of a multiple regression. We already mentioned this in exercise 3.2. However, in the following part, we won't indicate this term explicitly at every interpretation.

A possibility to better compare the results is to transform the estimates into percentage differences. We convert the estimate $\hat\delta = -47.62$ and thus get a result of approximately $4\%$. Hence, we find a $4\%$-decrease in the monthly electricity consumption of Gainesville households caused by the building code change.

Take a look at the following info box if your are interested in the computation of percentage differences.

< info "Exemplarily calculation of percentage difference"

This info box shows an exemplarily calculation of percentage difference. We will transform our estimate $\hat\delta$ into percentage difference. To be more precise, $\hat\delta$ is the coefficient estimate of the variable code_change and the variable elec is the dependent variable of the regression model.

You can see the calculation in the following code chunk. We have to divide delta_hat by mean_elec and then multiply this value by $100\%$.

A positive value indicates an increase and on the opposite a negative value means a decrease.

delta_hat = -47.62
mean_elec = 1146.4

# transforme the estimate delta into percentage differences
delta_pd = delta_hat/mean_elec * 100#%
delta_pd

You see the result which is roughly $4\%$. It coincides with the value mentioned above.

>

Explanatory variable logfeet

The other significant OLS estimate is the coefficient for logfeet. This estimate is positive and highly significant at a level of $1\%$. This means in general that "larger residences consume more electricity" (paper, p. 39).

Now, we want to interpret this coefficient estimate more detailed. We cannot use the interpretation rule from exercise 3.2 because in this case the explanatory variable is in logarithm. Hence, we state a new rule.

Let us assume a simple regression model of the form $Y = \beta_0 + \beta_1 \cdot ln(X_1) + \epsilon$. Thus, we can generally interpret the regression coefficient $\beta_1$ in the following way:

Let's come back to our data and regression results. The coefficient estimate for the variable logfeet takes the value $958.94$. This means, if the square footage of a building increases by $1\%$, then the electricity consumption of this household will increase by roughly $9.6$ kWh per month. We can also consider a $10\%$ increase in the square footage. Then, the electricity consumption of this household would increase by roughly $96$ kWh per month or by $8.3\%$ (according to paper, p. 39).

We additionally want to visualise the results of regression reg1a.

Visualisation of regression results

We will draw a bar plot with the function effectplot(). We already used this function in exercise 3.2. Thus, just press check to show the graphic.

#< task
# show the regression results graphically
effectplot(reg1a)
#>

You find a bar plot with four bars where one bar represents one explanatory variable. Answer the questions of the following quiz. Hint: three statements are correct.

< quiz "effect of coefficients of reg1a"

question: You see the effects of the explanatory variables of regression reg1a. Which statements are true?
mc: - If code_change changes from 0 to 1 then residential electricity consumption decreases by 48 percent per month. - The influence of the variable logfeet on the residential electricity consumption is negative. - The variable logfeet influences the residential electricity consumption positively and its influence is greatest in comparison to all other variables. - The 10 percent quantile of logfeet is 7.55. - We find information about the significance niveaus in the graphic. - Three explanatory variables are dummy variables.*

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>

b) Natural gas consumption

Now, we will run the next regression to determine the estimates for the annual differences in natural gas consumption between pre- and post-code residences.

Hence, there is only one change in the regression formula, namely the dependent variable is gas now instead of elec.

Press check to perform the regression. The results will be saved in the variable reg1b.

#< task
# run the regression
reg1b = felm(gas ~ code_change + logfeet + centralair + shingled | baths + beds +
               stories + zip + monthyear | 0 | home_id, data=dat)
#>

Before we print the regression results we will show the results graphically.

Visualisation of regression results

Task: Use the function effectplot() to show the regression results graphically. Enter your code and then press check.

#< task
# enter your code here...
#>
effectplot(reg1b)

The bar graph looks very similar to the one in part a). Two explanatory variables have a positive influence and the other two have a negative impact on the gas consumption. Furthermore, the ranking of the impact on the dependent variable is the same. The variable logfeet has the greatest impact and code_change the smallest.

We will also print a table which contains the exact regression results.

Comparison of regression results with screenreg()

This time we won't use the function summary(). Instead, we will show the results of both regressions reg1a and reg1b next to each other in order to compare them. For this purpose, we use the function screenreg() out of the package texreg.

First, we have to load this package. Then, we modify the arguments of the function screenreg(). We specify the regression models with the command list(regression1, regression2,...). Furthermore, we set individual names for the regression models (instead of Model 1 and Model 2) by using the argument custum.model.names. The commands omit.coef and digits are explained directly in the code box. The last argument is stars = c(0.01, 0.05, 0.10). It changes the significance levels. We will change the default significance levels because we want to use the same levels as the authors used in the paper. Hence, it will print $\textrm{}$ to the coefficients where the p-value is below $0.01$, $\textrm{}$ when the p-value is between $0.01$ and $0.05$ and $\textrm{}$ if the p-value is greater than $0.05$ but lower than $0.1$.

You can find more information about the package texreg and its functions at https://cran.r-project.org/web/packages/texreg/texreg.pdf.

Click on the button check to show the resulting table which contains the regression outputs of reg1a and reg1b.

#< task
# load the package
library(texreg)

# output regression results
screenreg(list(reg1a, reg1b), custom.model.names = c("Electricity","Natural gas"),
          # output without intercept
          omit.coef = "(Intercept)", 
          # round numbers to 3 decimal places
          digits = 3, 
          # define different significance niveaus
          stars = c(0.01, 0.05, 0.10))
#>

This table reports the results of our two regressions of model 1. In the model reg1a we used the dependent variable elec and in reg1b we regressed on the variable gas. The clustered standard errors are written in brackets underneath the estimates.

When comparing both regressions we can see that the respective estimates have the same signs. We are mainly interested in the estimates of the variable code_change which are negative in both regression models.

However, we will now take a closer look at the results of regression reg1b for natural gas consumption.

Interpretation of regression results reg1b

The $R^2$ and the adjusted $R^2$ are again equal and take the value $0.452$. This value is a little bit lower compared to our first regression. However, it also shows that the model and the data match well (according to paper, p. 39). In this case, the explanatory variables explain roughly $45\%$ of the variance of the natural gas consumption (according to Stock and Watson (2007), p. 125).

All four estimates are statistically significant but the significance niveaus for the estimates of centralair and shingled are really low and hence not very reliable (according to paper, pp. 39-40). Therefore, we focus on the coefficient estimates for the variables code_change and logfeet.

Answer the question of the following two quizzes and you will get a detailed interpretation of these two estimates.

< quiz "regression results - code_change"

question: Denote the correct statement. sc: - Post-code residences consume roughly 1.5 therms more natural gas per month than pre-code buildings. This result is significant at the 5% niveau. - Post-code residences consume roughly 1.5 therms less natural gas per month than pre-code buildings. This result is significant at the 5% niveau.*

success: Great, your answer is correct! failure: Try again.

>


To sum up, the coefficient estimates for the variable code_change in reg1a and reg1b show that the building code change leads to a reduction in electricity and natural gas consumption. To be more exact, the code change causes roughly a $4\%$ monthly decrease in electricity consumption and a reduction of $1.5$ therms per month in natural gas consumption. Translated into percentage difference, this is a decrease of $6.4\%$ in natural gas consumption per month. If you don't remember how to transform the estimate into percentage difference look at the corresponding info box of part a).

< quiz "regression results - logfeet"

question: Complete the following sentence. If the square footage of a residence increases by 10 percent, then the natural gas consumption will increase by ... therms per month. Round the number to 2 decimal places. answer: 2.98 roundto: 0.01

success: Great, your answer is correct! failure: Try again.

>

< award "Regression master level 2"

Great! You are able to interpret the results of a multiple regression correctly!

>

This exercise refers to pages 39 and 40 and to table 3 of the paper.

Exercise 4.2 -- Robustness check of regression model 1

You must have solved exercise 4.1 to be able to solve this exercise.

In this exercise we perform a robustness check of both regressions reg1a and reg1b of the previous exercise.

A robustness check is a test "where the researcher examines how certain 'core' regression coefficient estimates behave when the regression specification is modified by adding or removing regressors" (Lu and White (2013), p. 194). Core variables are the regressors which are of main interest for us. "If the coefficients are plausible and robust, this is commonly interpreted as evidence of structural validity" (Lu and White (2013), pp. 194-195). Hence, a robustness check tests if the results remain similar although the model has been slightly modified. If this holds true we can conclude that this model is robust against changes.

Before we test for robustness we have to load the data. We choose the same data frame as in the previous exercise. This data fame was called CCC_regression1.dta.

First, press edit and then check to load the data. Again, we denote the data frame by dat.

#< settings
import.var = list("3.1.1"="reg1a", "3.1.1"="reg1b")
#>

#< task
# load the data set CCC_regression1.dta and assign it to dat
dat = read.dta("CCC_regression1.dta")
#>

As a reminder, the formula of the regression model of the previous exercise (model 1) is $$Y_{it} = \delta \cdot code_change_i + \beta \cdot X_i + v_t + \epsilon_{it}.$$

Now, we vary this model and allow "time effects to differ by each zip code" (paper, p. 39). This means we no longer use dummy variables for the variable monthyear (which is represented by $v_t$) and for the variable zip (which is an element of $X_i$) separately. However, we interact the time effect $v_t$ and the dummy variables for each zip code.

In the following info box you find details about how to implement an interaction of two factors in R.

< info "Interaction of factors -- implementation in R"

We want to compute the interaction of two factors in R. The R command to compute interactions is simply called interaction(). You can abbreviate this command by just typing : between the two factors which should be interacted. Hence, we have to assure that the variables we want to interact are factor variables. If this is not the case we can transform these variables with the command factor().

As described above we allow the time variable monthyear to differ by the variable zip. The two variables are part of the data frame dat and thus we have to write for example dat$zip. However, in the regression formula we just write zip because we specify the data frame separately.

In the following code chunk you can see the R commands for implementing the interaction of the variables zip and monthyear.

# long version
interaction(factor(dat$zip), factor(dat$monthyear))

# short form
factor(dat$zip):factor(dat$monthyear)

You can convince yourself of a correct implemented interaction by showing the different levels of the factors. First, show the levels of the individual factors. Second, show the levels of the interacted factor. Then, compare their lengths. The number of levels of the interacted factor should be the product of both individual factors. In our case, there are nine levels of the dummy variable zip and $36$ levels of monthyear and thus there are $9*36=324$ levels in total.

# individual factors
levels(factor(dat$zip))
levels(factor(dat$monthyear))

# interacted factor
length(levels(factor(dat$zip):factor(dat$monthyear)))

You can find further information about implementation of interactions at https://stat.ethz.ch/R-manual/R-devel/library/base/html/interaction.html. Furthermore, you can get more details about how to use factor interactions in combination with the function felm() at https://cran.r-project.org/web/packages/lfe/lfe.pdf and http://finzi.psych.upenn.edu/R/library/lfe/html/felm.html.

>

Similar to the previous exercise we will calculate the estimates for the electricity consumption first. In part b) we will run a second regression in order to get the estimates for the natural gas consumption.

a) Electricity consumption

Now, we will test the alternative specification of the regression model which we already described above. Again, we will run the regression with the function felm(). The estimation methods and the arguments of the function felm() are equal to the ones of the previous exercise except the following: we now use interacted dummy variables of zip and monthyear. When performing such regression we have to assure that these variables are factor variables.

In the info box above you could find out how to implement an interaction of two factors in R.

In this part of the exercise we regress on the electricity consumption. This means our dependent variable is elec. We save the regression results in the variable reg2a.

Press check to run the regression.

#< task
# run the regression               
reg2a = felm(elec ~ code_change + logfeet + centralair + shingled | baths + beds + 
               stories + factor(zip):factor(monthyear) | 0 | home_id, data=dat)
#>

Let us show the results of the regression. We don't want to consider the results of this regression isolated but in comparison with the results of the original regression (model 1). The results of the original regression with the dependent variable elec are stored in reg1a. Hence, we will show the results of reg1a and reg2a by using the function screenreg(). We will specify the same arguments of the function screenreg() as in the previous tasks.

Click on the button check to show the results.

#< task
# output of regression results of reg1a and reg2a
screenreg(list(reg1a, reg2a), 
          custom.model.names = c("Electricity (model 1)","Electricity (modified)"),
          omit.coef = "(Intercept)", digits = 3, stars = c(0.01, 0.05, 0.10))
#>

We find very similar results for both regressions. The OLS estimates all have the same signs and the estimated values are really close to each other. Additionally, the significance levels of all estimates are the same and the differences between the standard errors are small. Comparing the adjusted $R^2$ of both regressions we see that the adjusted coefficient of determination of the second regression is a little bit higher. It takes the value $0.511$ instead of $0.499$.

All in all, both models fit the data. However, in regression reg2a a greater part of the variation in electricity consumption is explained. To sum up, the coefficient estimates seem to be robust towards a variation of the regression model.

b) Natural gas consumption

Now, we perform a robustness check for the natural gas consumption. Hence, the response variable of our regression is the variable gas. All other parts of the regression formula stay the same as in part a).

Task: Estimate the annual differences in the natural gas consumption between pre- and post-code residences. Choose the regression model with the additional specification of time effects that vary with each zip code. Furthermore, use the same methods as in the previous regressions, like OLS estimation with standard errors clustered at the residence level and project the factors out.

Enter your code and press check afterwards. If you need an additional advice press hint.

#< task
# enter your code here...
# reg2b = ???
#>
reg2b = felm(gas ~ code_change + logfeet + centralair + shingled | baths + beds + 
               stories + factor(zip):factor(monthyear) | 0 | home_id, data=dat)

#< hint
display("Your code should look like: reg2b = felm(??? ~ code_change + logfeet + centralair + shingled | baths + beds + stories + factor(zip):factor(monthyear) | 0 | home_id, data=dat). Replace the ??? by the correct dependent variable.")
#>

< award "Regression master level 3"

Great! You performed a complex regression on your own by using the function felm(). Additionally, you took the interaction of two factors into account.

>

You estimated the coefficients of the expanded regression model. The results are stored in the variable reg2b.

In the following, we will compare this model with our original model with uniform time effects (reg1b) in order to check for robustness. Hence, we use the function screenreg() to show both regression results next to each other.

#< task
# show regression results 
screenreg(list(reg1b, reg2b), custom.model.names=c("Natural gas (model 1)","Natural gas (modified)"),
          omit.coef="(Intercept)", digits=3, stars = c(0.01, 0.05, 0.10))
#>

Interpret these results by solving a quiz.

< quiz "robustness check - natural gas"

question: Denote the correct statement. sc: - Coefficient estimates of both regression models are very similar.*
- The two regressions produce really different estimates.

success: Great, your answer is correct! failure: Try again.

>


To sum up, both regression models for the annual differences in natural gas consumption fit the data well.

All in all, the resulting estimates are stable between the corresponding two regression models, that are reg1a and reg2a for electricity and reg1b and reg2b for natural gas consumption. The differences of the regression models of exercise 4.1 and exercise 4.2 are the time effects. We used uniform month-year effects in the regression models of the previous exercise and allowed a variation of the time effects by the variable zip in the regression models of this exercise. Hence, the regression models are robust.

This exercise refers to pages 39 to 41 and to table 3 of the paper.

Exercise 4.3 -- Effect of the code change on different months

In exercise 4.1 we estimated the annual differences of the energy consumption of pre- and post-code residences. Our main findings were that the code change led to a reduction in residential energy consumption.

In this exercise we no longer want to consider estimates which are averaged across months. However, we will calculate individual estimates for each month of the year. The reason for this approach is that the effects of the code change most likely differ between the different months within one year because the variability of the weather throughout the year is really high. Therefore, the demand for cooling and heating also differs heavily among the seasons. Because of the fact that natural gas is used for heating and electricity for air-conditioning -- and therefore for cooling -- there is a dependency between weather variability and residential energy consumption (according to paper, pp. 35 and 40).

To conclude, it is essential to estimate monthly differences to get preciser estimates.

Regression model -- model 2

Hence, we expand the regression model of exercise 4.1 (model 1) by dummy variables for the different months of the year. This model, which we denote by model 2, is given by $$Y_{it} = \delta \cdot code_change_i \times month_t + \beta \cdot X_i + v_t + \epsilon_{it}.$$ All but one parameters are the same as in regression model 1. The new parameter is called $month_t$ and it contains dummy variables for each month of the year. Additionally, the variables code_change and month are interacted. Therefore, we get $12$ different estimates ($\hat\delta$) for the effects of the code change on energy consumption. Each estimate is associated with one month.

Furthermore, we still use the same estimation methods, like OLS estimation and standard errors clustered by the variable home_id.

In the first part of this exercise we will run two regressions of model 2. We will first use electricity consumption as dependent variable and then in part b) natural gas consumption. In part two of this exercise we will visualise the regression results by two figures.

Let's start with our analysis and load the data.

The data frame we will use in this exercise is called CCC_regression2.dta. Remember, the abbreviation CCC stands for code change comparisons. There is only one difference to the data frame CCC_regression1.dta of the previous exercises: the data frame CCC_regression2.dta contains an additional column which is called month. We need this variable because we want to determine the effects of the code change for different months.

First, press edit and afterwards check to load the data frame. Additionally, show a part of it.

#< task
# load the data set CCC_regression1.dta and assign it to dat
dat = read.dta("CCC_regression2.dta")

# show a part of the data frame dat
head(dat)
#>

Before we run the first regression of model 2 we will take a closer look at the interaction of the two variables code_change and month. The R code for implementing this interaction is $code_change:factor(month)$. The interaction term contains $12$ different variables which are again dummy variables. Let us denote the first interacted variable by CodeMonth1. For this variable the following holds:

Let the second variable be CodeMonth2. This variable takes the value $1$, if $code_change = 1$ and $monthD2 = 2$ (meaning February) and $0$ otherwise. The corresponding holds true for the further $10$ code-month variables.

We will perform our regression analysis in two steps. Similarly to the previous exercises we will first look at the electricity consumption and in part b) we will consider the natural gas consumption.

a) Electricity consumption

Now, let's run the first regression of model 2. You can find the equation of this model at the beginning of this exercise.

We want to calculate the estimates for the monthly differences in electricity consumption between pre- and post-code residences. Hence, we use the dependent variable elec.

Before we start with the regression analysis answer the question of the following quiz.

< quiz "effect of code change on electricity"

question: What do you expect? Denote the correct answer. sc: - The influence of the code change on electricity consumption is greater during the summer months than in the winter months.* - The variable code_change affects residential electricity consumption mainly in the winter months.

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


Now, we will run the regression and check if our expectations are correct.

We will save the regression results in the variable reg1a_month because we need this variable in the following part of this exercise for plotting. Further, we will display the results of the regression with the function screenreg(). We are only interested in estimates of the monthly code change effects. Thus, we omit all other coefficients in the output table.

Press check to run the regression and show the results.

#< task
# run the regression
reg1a_month = felm(elec ~ code_change:factor(month) + logfeet + centralair + shingled | 
                     baths + beds + stories + zip + monthyear | 0 | home_id, data=dat)

# show the regression results 
screenreg(reg1a_month, custom.model.names = c("Electricity"), 
          # omit all coefficients apart from the monthly code change variables
          omit.coef = "(Intercept)|(logfeet)|(centralair)|(shingled)", 
          digits = 3, stars = c(0.01, 0.05, 0.10))
#>

We shortly interpret the results of reg1a_month. You see that the $R^2$ and the adjusted $R^2$ are approximately $0.5$ which means that the regressors predict roughly $50\%$ of the variance of the monthly electricity consumption. Seven coefficient estimates are significant at a level of $10\%$ or even lower. These estimates correspond to the months April to October and are all negative. This means that post-code residences consume less electricity in the warmer months of the year than pre-code residences. The greatest difference between the electricity consumption of pre- and post-code residences is in June -- where the after-code-change buildings consume roughly $114.3$ kWh (per month) less electricity than pre-code residences.

You can find a detailed interpretation and further explanations of the results at the end of this exercise.

Now, we turn to the estimation of the monthly effect of the code change on residential natural gas consumption.

b) Natural gas consumption

Again, we use regression model 2 but in this case the dependent variable is gas.

State your expectations first. After that we will start with the regression analysis.

< quiz "effect of code change on natural gas"

question: What do you expect? Denote the correct answer. sc: - The influence of the code change on natural gas consumption is greater during the summer months than during the winter months. - The variable code_change affects residential gas consumption mainly during the colder months.*

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


Task: Estimate the monthly differences in natural gas consumption between pre- and post-code residences. Use OLS estimation and the function felm(). Furthermore, choose regression model 2 and the same methods as in the previous exercises, i.e. project factors out and use standard errors that are clustered by the variable home_id. Store the results in the variable reg1b_month.

Enter your code and then press check.

#< task
# enter your code here...
# reg1b_month = ???
#>
reg1b_month = felm(gas ~ code_change:factor(month) + logfeet + centralair + shingled | 
                     baths + beds + stories + zip + monthyear | 0 | home_id, data=dat)

#< hint
display("Your command should look like reg1b_month = felm(??? ~ ??? : ??? + logfeet + centralair + shingled | baths + beds + stories + zip + monthyear | 0 | home_id, data=dat). Replace the ??? by the response variable and two explanatory variables.")
#>

Task: Show the regression results. Don't use the function summary(). The necessary package is already loaded. Some arguments of the function are given.

Replace the ??? by correct commands. Then, uncomment these lines and press check.

#< task
# replace the ??? by correct commands

# ???(???, custom.model.names=c("Natural gas"),
# omit.coef="(Intercept)|(logfeet)|(centralair)|(shingled)", digits = 3, 
# stars = c(0.01, 0.05, 0.10))
#>
screenreg(reg1b_month, custom.model.names = c("Natural gas"), 
          omit.coef = "(Intercept)|(logfeet)|(centralair)|(shingled)", 
          digits = 3, stars = c(0.01, 0.05, 0.10))

#< hint
display("The function you should use is called screenreg and the regression model reg1b_month.")
#>

< award "Regression master level 4"

Great! You can run a regression with the function felm(), show its results with the function screenreg() and interpret these results correctly.

>

You find three highly significant coefficient estimates -- at a $1\%$-significance level. They belong to the months January, February and December. These estimates are again all negative and thus show a reduced monthly natural gas consumption of post-code residences in comparison to pre-code buildings. The lower gas consumption ranges between $7.3$ and $9.3$ therms per month. Additionally, the coefficient of determination indicates a good model fit.

You can find further interpretation of the $R^2$ in the following quiz.

< quiz "interpretation R-squared"

question: Complete the following sentence. The regressors explain roughly ... % of the variance of the monthly natural gas consumption. Round the result to zero decimal places.
answer: 46 roundto: 1

success: Great, your answer is correct! failure: Try again.

>


You can find a detailed interpretation of the previous regression results at the end of this exercise.

Now, we will visualise the main results of the regressions reg1a_month and reg1b_month by two figures.

Visualisation of regression results

We will draw two graphs which will show the estimates for the code change effects for each month. This means we will plot the estimates $\hat\delta$ of regression model 2. Due to a better comparison we will transform the estimates $\hat\delta$ into percentage differences. Additionally, we will draw the $95\%$ confidence intervals of the estimate in order to illustrate sampling uncertainty (according to Stock and Watson (2007), pp. 155-156).

Hence, we have to do several steps of data preparation.

Data preparation

First, we calculate the average monthly electricity and natural gas consumption. We need these mean values for transforming the estimates into percentage differences.

#< task
# prepare data set and calculate the monthly mean values
dat_month = dat %>%
  select(elec,gas,month) %>%
  group_by(month) %>%
  summarise(Melec = mean(elec), Mgas = mean(gas))

# show a part of the data frame dat_month
head(dat_month)
#>

Hence, the mean values are stored in the variables Melec and Mgas for electricity and natural gas consumption, respectively.

In the next step, we select the estimates $\hat\delta$ of the regressions reg1a_month and reg1b_month. The first regression yields the electricity results and the second one the results for natural gas consumption. You get these estimates by using the command reg_name$coef. Store them in the vectors deltaElec and deltaGas.

Then, transform the estimates into percentage differences. If you can't remember how to do this take a look at the info box in exercise 4.1.

The following code chunk contains these preparations. Just press check to perform the transformations.

#< task
# store the estimates in separate vectors
deltaElec = reg1a_month$coef[4:15] 
deltaGas = reg1b_month$coef[4:15] 

# transform deltaElec into percentage differences (pd)
pd_Elec= deltaElec/dat_month$Melec * 100#%
# show the vector
pd_Elec

# transform deltaGas into percentage differences
pd_Gas= deltaGas/dat_month$Mgas * 100#%
# show the vector
pd_Gas
#>

In the following step, we will calculate the confidence intervals. If you like to get more information about confidence intervals you can look at the following info box.

< info "Confidence intervals"

A confidence interval for a parameter $\beta$ is an interval which contains the true value of this parameter with a pre-defined probability (according to Stock and Watson (2007), p. 776). Hence, a $95\%$ confidence interval for the regressor $\beta_1$ contains the true value of $\beta_1$ with a probability of $95\%$.

A second but equivalent definition of a $95\%$ confidence interval is that we cannot reject this set of values when using a two-sided hypothesis test with a significance level of $5\%$.

We can construct confidence intervals for a single regressor by using its OLS estimate and its standard error (SE). The sample size of our data is large and thus we can assume an asymptotic standard normal distribution. Therefore, the critical value of a two-sided test statistic is $\pm 1.96$ (according to Stock and Watson (2007), pp. 78-79). Hence, the $95\%$ confidence interval for the regressor $\beta_1$ is $$[\hat\beta_1 - 1.96 \cdot SE(\hat\beta_1), \hat\beta_1 + 1.96 \cdot SE(\hat\beta_1)]$$ (according to Stock and Watson (2007), pp. 155-157).

Similarly, you can construct a confidence interval for a single coefficient in a multiple regression model (according to Stock and Watson (2007), p. 223).

reference: Stock and Watson (2007), pp. 78-79, 155-157, 223, 776

>

The R function confint() computes confidence intervals for the parameters of a fitted model. We will use $95\%$ confidence intervals. Thus, we have to choose level=0.95. The confidence intervals for the monthly estimates $\hat\delta$ are stored in particular rows. In our case, we have to select rows $4$ to $15$ to receive the proper confidence intervals.

More details about the function confint() you can find at https://stat.ethz.ch/R-manual/R-devel/library/stats/html/confint.html.

Furthermore, we have to transform the confidence intervals into percentage differences to make them suitable for drawing the figures.

The following code chunk contains these two preparation steps. You only have to press check.

#< task
# compute the confidence intervals:
# for first regression reg1a_month (electricity consumption)
CI_Elec = confint(reg1a_month, level=0.95)
CI_E_low = CI_Elec[4:15,1] # lower borders
CI_E_up = CI_Elec[4:15,2] # upper borders

# for second regression reg1b_month (natural gas consumption)
CI_Gas = confint(reg1b_month, level=0.95)
CI_G_low = CI_Gas[4:15,1] # lower borders
CI_G_up = CI_Gas[4:15,2] # upper borders

# transform the confindence intervals into percentage differences
# electricity consumption
# lower borders
CI_pdE_low = CI_E_low/dat_month$Melec * 100#% 
# upper borders
CI_pdE_up = CI_E_up/dat_month$Melec * 100#%

# natural gas consumption
# lower borders
CI_pdG_low = CI_G_low/dat_month$Mgas * 100#%
# upper borders
CI_pdG_up = CI_G_up/dat_month$Mgas * 100#%
#>

In the next step of data preparation we create a data frame with the necessary variables which we will need for drawing the figures.

First, we generate a data frame with the two columns pd_Elec and pd_Gas which contain the estimates $\hat\delta$ (measured in percentage differences). The function cbind() combines the two vectors by columns. Look at http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html for further information.

Second, we generate a time column date which contains the months of the year. Last, we add four columns representing the upper and lower limits of the confidence intervals (also in percentage differences).

Press check to create the data frame dat_plot. If you click on the button data the Data Explorer will open and you will see the complete data frame.

#< task
# create a data frame with the necessary variables for plotting 
dat_plot = as.data.frame(cbind(pd_Elec,pd_Gas)) %>%
  # add time column
  mutate(date=seq(from=as.Date("2002-01-01"), to=as.Date("2002-12-01"), by="month")) %>%
  # add columns with upper and lower borders of confidence intervals
  mutate(CI_pdE_low, CI_pdE_up, CI_pdG_low, CI_pdG_up)

# show a part of the data frame
head(dat_plot)
#>

Now, all preparations are done and we can plot the figures.

Plotting the figures

The first figure will illustrate the monthly effects of the code change on electricity consumption and the second graph will show the results for the natural gas consumption.

The code for plotting these two figures contains a lot of functions. In the following paragraph you will find a short explanation of the commands we have not used so far in this problem set.

The function Sys.setlocale() is used to get month names in English instead of German. More details you can find at https://stat.ethz.ch/R-manual/R-devel/library/base/html/locales.html. The function ggtitle("...") adds a title to the figure and with the command ylim() we specify the boarders of the y-axis. You can find further information about the function ylim() at http://docs.ggplot2.org/0.9.3/xylim.html. The command theme_bw() creates a white background of the plot with black grid lines. Another new function is geom_hline(). We want to add a horizontal line at $y=0$ and thus we use this function with the argument aes(yintercept=0). You can find further information at http://docs.ggplot2.org/0.9.3.1/geom_hline.html. Additionally, we can specify the line type with the command lty whereas lty=2 creates a dashed line. Last, we add the confidence intervals by using the function geom_errorbar(). With the argument aes(ymin=..., ymax=...) we can specify the lower and upper borders of the confidence intervals. You can read more about this function at http://docs.ggplot2.org/0.9.3.1/geom_errorbar.html.

Now, just press check to draw the figures.

#< task_notest
# get month names in English
Sys.setlocale("LC_ALL", "English")

# create the figure for electricity consumption
e = ggplot(data=dat_plot, aes(x=date)) + xlab("") + ylab("percent differences") + 
  ggtitle("electricity consumption") + ylim(-14,8) + theme_bw() 

e1 = e + geom_point(aes(y=pd_Elec),size=4,col="red") + 
  geom_line(aes(y=pd_Elec),size=1,col="red") +
  scale_x_date(labels=date_format("%b"), breaks = date_breaks("1 month")) +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
        plot.title = element_text(size=17))

e2 = e1 + geom_hline(aes(yintercept=0), lty=2, size=1)

e3 = e2 + geom_errorbar(aes(ymin=CI_pdE_low, ymax=CI_pdE_up),lty=2)

# create the figure for natural gas consumption
g = ggplot(data=dat_plot, aes(x=date)) + xlab("") + ylab("percent differences") + 
  ggtitle("natural gas consumption") + ylim(-35,25) + theme_bw() 

g1 = g + geom_point(aes(y=pd_Gas),size=4,col="red") + 
  geom_line(aes(y=pd_Gas),size=1,col="red") + 
  scale_x_date(labels=date_format("%b"),breaks = date_breaks("1 month")) +
  theme(axis.text = element_text(size=12), axis.title = element_text(size=15), 
        plot.title = element_text(size=17))

g2 = g1 + geom_hline(aes(yintercept=0), lty=2, size=1)

g3 = g2 + geom_errorbar(aes(ymin=CI_pdG_low, ymax=CI_pdG_up),lty=2)

# draw the figures 
grid.arrange(e3,g3)
#>

These graphics are based on figures 1 and 2 of the paper.

Now, you see the figures which illustrate the monthly effects of the code change on residential electricity and natural gas consumption. Additionally, the figures show the $95\%$ confidence intervals of the estimates.

In the following paragraph we will describe and interpret these figures.

First, answer the questions of the following quiz.

< quiz "plot code change effect on month"

question: What do you find? Two answers are correct. mc: - Post-code residences consume between 4 and 8 percent less electricity in the warmer months in comparison to pre-code buildings. - The code change effects regarding improved efficiency of air-conditioning systems are mainly present in the winter months of the year.
- In December, January and February you find an increase of the natural gas consumption of about 14 to 24 percent. - The effects of the code change regarding efficiency gains for heating cause a reduction of natural gas consumption during the winter months.

success: Great, all answers are correct! failure: Not all answers correct. Try again.

>


A detailed interpretation follows now.

Interpretation of the regression results

First, we take a look at the electricity results and then at the results of natural gas consumption.

Overall, we find that post-code buildings consume less electricity during the warmer months than pre-code buildings. In the winter months all residences consume roughly the same amount of electricity. When we additionally consult the output of our regression analysis of the beginning of this exercise (reg1a_month) we find that the coefficient estimates for the warmer months -- from April to October -- are significant at a $10\%$-level or at an even higher level. To sum up, we find that residences built after the code change consume between $4.7\%$ and $8\%$ less electricity in the warmer months in comparison to pre-code buildings (according to paper, p. 40). In contrast, the results for the colder months are not significant but this fact doesn't cause problems here.

A reason for the previous results is the central air-conditioning of the residences and its varying usage during the year. A more detailed explanation follows now. First of all, we already know from exercise 1.2 that most of the residences of our data set have central air-conditioning. Additionally, we mentioned several times before that a huge part of the residential electricity consumption is used for central air-conditioning. For example, residences in the South Atlantic Region of the U.S. needed in $2009$ $21.2\%$ of their total electricity consumption for air-conditioning (according to U.S. EIA (2009), Table HC1.10). Last, air-conditioning is only used during the warmer months and not during the colder ones. Hence, the effects of the code change -- especially the effects regarding improved efficiency of residential air-conditioning systems -- are mainly present in the warmer months and therefore lead to a reduced electricity consumption in those months. This is exactly what the first figure shows (according to paper, p. 40).

Now, let's turn to the second figure.

The results for natural gas consumption look contrary. The post-code residences consume less natural gas during the winter months in comparison to the pre-code buildings. In all other months of the year the residences built after the code change consume more natural gas. However, only three estimates are significant (at the $1\%$-level). These estimates are the estimates for the colder months -- January, February and December and they indicate a reduction in the natural gas consumption and reach from $14.3\%$ to $23.6\%$.

An explanation for these results is that natural gas is mainly used for space and water heating. We already mentioned this relationship in exercise 1.2. Exact numbers you can find in U.S. EIA (2009), Table HC1.10. Additionally, we all know that heating is only necessary during the winter months. Hence, the effects of the code change -- especially the efficiency gains for heating -- cause savings in the natural gas consumption of these residences during the winter months. This is shown by the second figure (according to paper, p. 40).

To sum up, we performed a monthly comparison of the differences in the energy consumption of pre- and post-code residences. We found out that the building code change leads to a reduction of residential electricity (in the summer months) and of natural gas consumption (in the winter months) because of more efficient central air-conditioning systems and improved heating systems, respectively.

This exercise refers to page 40 and to figures 1 and 2 of the paper.

Exercise 4.4 -- Possible limitations

In the last two exercises we observed a downward trend in the energy consumption of the Gainesville residences. We assigned the reduction in energy consumption to the building code change. However, are there other possible reasons for the reduced residential energy consumption instead of the code change? We want to find an answer to this question.

Jacobsen and Kotchen -- the authors of the paper -- already tried to avoid the problem of possibly other reasons by their empirical strategy. To be more precise, they only used data for residences which were built within three years before and three years after the code change. Additionally, they set up regression models with uniform specific time effects (we also did in exercise 4.1) and zip-code specific time effects (compare exercise 4.2). We found out that the resulting estimates of exercises 4.1 and 4.2 were really close to each other. However, the authors also tested for another possible reason for the downward trend in energy consumption. They considered a further regression model with the explanatory variable EYB instead of the variable code_change and analysed if there could be "a potential time trend independent of the code change" (paper, p. 41).

Regression model -- model 3

To be more precise, they estimated the differences in energy consumption separately for each effective year built (EYB) of the residences. Hence, they regressed EYB on either elec or gas. They used the same control variables as in the previous two exercises. Therefore, the new regression model -- model 3 -- looks like $$Y_{it} = \theta \cdot EYB_i + \beta \cdot X_i + v_t + \epsilon_{it}.$$ The new parameters of this model are $EYB_i$ which are seven different dummy variables for the EYB of each residence (from the years $1999$ to $2005$) and $\theta$. The vector $\theta$ is of primary interest. It contains the EYB specific estimates for the differences in energy consumption.

Jacobsen and Kotchen wanted to find out if the estimates $\hat\theta$ would show a downward trend for different entries of the variable EYB. However, all resulting estimates were not significant and did not show a downward trend. Hence, we do not replicate this regression analysis in this problem set. You can take a look at pages 41 and 42 and at figures 3 and 4 of the paper if you are interested in the exact regression results.

To sum up, the authors considered a potential limitation of their empirical strategy and searched for other possible reasons for the reduction in residential energy consumption instead of the building code change. However, they did not find any other reason. Hence, we can conclude that our regression results resulted not just because of an unobserved time trend but because of the code change.

This exercise refers to pages 41 and 42 and to figures 3 and 4 of the paper.

Exercise 5 -- Effects of the code change for hot and cold days

In this exercise we turn to the second empirical strategy of the authors. This strategy is used to determine the effects of the code change on energy consumption related to weather variability. To be more precise, we analyse the "differences in the responsiveness of energy consumption to variability in weather" (paper, p. 47). We can interpret this strategy as a robustness check against the first empirical strategy -- especially against the estimation of the effects of the code change on different months (according to paper, p. 47). We performed the monthly estimations in exercise 4.3.

In exercise 5.1 you will learn about the theory of difference-in-differences estimation and of fixed effects regression. Then, in exercise 5.2 we will introduce a regression model and perform estimations with our data. We will examine the effects on electricity consumption and natural gas consumption separately.

Exercise 5.1 -- Theory of fixed effects regression and difference-in-differences estimation

In this exercise we will describe the two estimation methods which we will use in the following exercise. These methods are called difference-in-differences estimation and fixed effects regression. We did not use either of these two methods in the previous exercises. Hence, we will get detailed information about them now.

The structure of this exercise is the following: first, we will get information about the weather variables, second we will learn about difference-in-differences estimation and last, the method of fixed-effects regression will be explained.

Weather variables

Now, we will get some more information about the weather variables of this data set. The changes in weather are measured by the two weather variables AHDD (average heating degree days) and ACDD (average cooling degree days). We already considered these variables in exercise 1.3. However, you find a short repetition of the most important facts of these variables in the following few lines.

The weather data has been collected by one weather station and it has been averaged to monthly values. The two corresponding variables are called HDD and CDD in our data frame. We already plotted these variables in exercise 1.3 and found a high variability of the weather variables. We additionally found that ACDD is much higher than AHDD and that ACDD peaks in the summer months whereas it is close to zero during the winter months. AHDD behaves exactly the opposite. Furthermore, we know that the variability of weather is the main reason of fluctuations in residential energy consumption. However, if you need further information about these variables you can take a look at exercise 1.3 again.

Difference-in-differences estimation

You can find details about the difference-in-differences estimation method in the following info box.

< info "Difference-in-difference estimation"

This estimation method requires a situation "in which one group has experienced the treatment, whereas another, comparable group has not" (Kennedy (2008), p. 382). Then, we can use this method to estimate the causal effect of the treatment (according to Stock and Watson (2007), p. 480).

The difference-in-difference estimator is the difference between the average changes in $Y$ (the response variable) in the two groups -- the treatment and control group -- over the same time (according to Kennedy (2008), p. 382 and Stock and Watson (2007), pp. 480-481). The exact formula of the difference-in-differences (DiD) estimator looks like $$\hat\beta^\textrm{DiD}_1 = (\overline{Y}^\textrm{treatment,after} - \overline{Y}^\textrm{treatment,before}) - (\overline{Y}^\textrm{control,after} - \overline{Y}^\textrm{control,before}) = \Delta\overline{Y}^\textrm{treatment} - \Delta\overline{Y}^\textrm{control}$$ with $\overline{Y}^\textrm{treatment,after}$ being the sample average of $Y$ in the treatment group after the experiment and $\Delta\overline{Y}^\textrm{treatment}$ being the average change in $Y$ in the treatment group. The other averages are defined analogously (Stock and Watson (2007), p. 481).

If you like further information about the difference-in-differences estimation take a look at Stock and Watson (2007), pp. 480-484.

>

Now, we adapt this method to our data.

First, we define pre-code residences (that are the residences which were built within three years before the code change) as control group and post-code residences (that are residences which were constructed within three years after the code change) as treatment group (according to paper, p. 47). Second, we note that our data only contains monthly utility bills for the years $2004$ to $2006$. That means that the billing records are only collected within the years after the code change. Hence, we have $36$ time periods instead of two -- where two is the typical number of time periods for difference-in-differences estimation.

However, the greater number of time periods causes no problems. We can estimate the coefficients by using a fixed effects regression model (according to Stock and Watson (2007), pp. 483-484, 517).

Fixed effects regression

Now, we will learn about fixed effects regression.

First, we will state why we use fixed effects in our data. In the two regression models (model 4 and model 5) of exercise 5.2 we will use residence-specific intercepts as fixed effects. The aim of these intercepts is to control for "any unobserved, time-invariant heterogeneity among residences" (paper, p. 43). In the following part you will get a detailed explanation. First of all, you can take a look at the following info box.

< info "Definitions - panel data, cross-sectional data and heterogeneity"

This info box provides some basic definitions and explanation about panel data, cross-sectional data and about heterogeneity.

Our underlying data is panel data. This means we have "data for multiple entities in which each entity is observed at two or more time periods" (Stock and Watson (2007), p. 13). We later also mention cross-sectional data or cross-sectional units. The difference between panel data and cross-sectional data is that cross-sectional data only contains observations of different entities for one time period and not for two or more time periods (refer to Stock and Watson (2007), pp. 11, 15).

Another important term is heterogeneity. It means that the "micro units [of panel data, A/N] are all different from one another in fundamental unmeasured ways" (Kennedy (2008), p. 282). In our case, we have heterogeneity among residences. This means, in a plot of the data we would observe the following situation: the observations of one residence would lie on one line. However, if we consider several residences all lines would have the same slope but different intercepts. Hence, we would observe parallel lines and one line per residence.

The reason for the different intercepts are unobserved variables that influence the response variable (according to Kennedy (2008), pp. 282-283) -- in our case, it is the energy consumption. For example, in our situation there could be "an unobserved trend in average energy consumption based on EYB" (paper, p. 43) where EYB stands for effective year built. We already considered this possible influence in exercise 4.4.

>

We already mentioned in the info box that heterogeneity in our data comes from unobserved variables that influence the energy consumption and are correlated with other explanatory variables.

Hence, the general problem are omitted variables which cause our OLS estimate to be biased (according to Kennedy (2008), p. 283). To be more exact, we have omitted variables that "vary across entities but do not change over time" (Stock and Watson (2007), p. 356). Thus, these variables are time-invariant. To avoid this problem we will include dummy variables for each residence (according to Stock and Watson (2007), p. 356 and Kennedy (2008), pp. 282-283, 290).

To sum up, "fixed effects regression is a method for controlling for omitted variables in panel data when the omitted variables vary across entities but do not change over time" (Stock and Watson (2007), p. 356). You can shortly describe the fixed effects regression model as a regression model that contains one intercept for each entity -- in our case, for each residence. Each intercept can be represented by one dummy variable and these dummy variables control for the influence of omitted variables.

Hence, a fixed effects estimator is an OLS estimator "applied to the fixed effects model" (Kennedy (2008), p. 283). The fixed effects estimator is also called within estimator because it "uses variation within each cross-sectional unit" (Kennedy (2008), p. 286).

Detailed information about the fixed effects regression model you find in the info box below.

< info "Fixed effects regression model"

This info box provides some more information about the fixed effects regression model. You already got some basic knowledge. However, this info box will show you how the fixed effects regression model is exactly set up.

We start with an ordinary regression model like $$Y_{it} = \beta_0 + \beta_1 \cdot X_1 + \epsilon_{it}.$$ The index $i=1,...,n$ refers to entity (in our case, it is the index of the residence) and the index $t=1,...,T$ refers to time (in our case, it is the month and year of the billing record). For simplicity we only use a model with one explanatory variable $X_1$.

Then, we include the (unobserved) variable $Z_i$ which varies between different residences but not over time. Hence, the new regression model looks like $$Y_{it} = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot Z_i + \epsilon_{it}.$$ Our aim is to estimate $\beta_1$. To be more precise, we want to estimate the effect of $X_1$ on $Y$, holding $Z$ constant.

We can transform this new model to a model which has $n$ different intercepts -- in our case, each intercept standing for one residence. The transformed model has the form $$Y_{it} = \beta_1 \cdot X_1 + \alpha_i + \epsilon_{it} ~~\textrm{with}~~ \alpha_i = \beta_0 + \beta_2 \cdot Z_i$$ and is called fixed effects regression model with the (unknown) parameters $\alpha$ as residence-specific intercepts. We also want to estimate these parameters.

The $\alpha$'s are often referred to as residence-specific intercepts because the slope $\beta_1$ of the regression line is the same for all residences but the intercepts are different for each residence. The reason for the variation of the intercepts is the omitted variable -- in our case, it is $Z_i$.

There is another, but equivalent method to define the fixed effects regression model: we include dummy variables for each residence instead of the residence-specific intercepts.

We have to be careful because our regression model already includes a common intercept. To avoid multicollinearity we have to omit one arbitrary dummy variable. We choose the first one. Therefore, the model looks like $$Y_{it} = \beta_0 + \beta_1 \cdot X_1 + \gamma_2 \cdot D2_i + \gamma_3 \cdot D3_i + ... + \gamma_n \cdot Dn_i + \epsilon_{it}$$ with the dummy variables $D1_i$,...,$Dn_i$, where $D1_i = 1$, if $i=1$ and $0$ otherwise. Furthermore, $D2_i = 1$, if $i=2$ and $0$ otherwise and so on.

Now, we can estimate the coefficients by OLS estimation. The estimation gets easier if you transform the data and subtract the corresponding averages because you avoid the estimation of all dummy variables. However, in the following exercise we will use the function felm() for performing the OLS estimation. Hence, we have no problem to include the dummy variables because we will project them out.

When we comparing both fixed effects regression models we find the following relationship: $\alpha_1 = \beta_0$ and for all $i \ge 2$ it holds that $\alpha_i = \beta_0 + \gamma_i$.

If we have more than two time periods or unobserved variables that vary over time we can expand our model and include further dummy variables -- so called time fixed effects. You can get more information in Stock and Watson (2007), chapter 10.4 (pp. 361-364).

On the opposite, there are also disadvantages of the fixed effects regression model. One disadvantage is the loss of degrees of freedom by inclusion of dummy variables and thus the reduction of the power of the test. You can read more in Kennedy (2008), pp. 283-284.

references: Stock and Watson (2007), pp. 356-359, 372, Kennedy (2008), pp. 282-283, 292-293

Further information about fixed effects estimation you can find in Andreß et al. (2013), pp. 128-147.

>

You can test your knowledge by answering the following quiz.

< quiz "fixed effects regression theory"

question: Denote the correct statement. sc: - The fixed effects estimator is also called "between estimator". - The fixed effects regression is a method that controls for omitted variables which vary across entities but are constant over time.*
- The fixed effects regression model contains a different intercept for each entity and in our case we have weather-specific intercepts.

success: Great, your answer is correct! failure: Try again.

>

< award "Quiz master - fixed effects regression "

Great! You understood the theory of fixed effects regression.

>

Now, we are able to apply this knowledge to our data. We will determine the effects of the code change for hot and cold days in the following exercise.

This exercise refers to pages 37, 39, 42 and 43 and to figure 5 of the paper.

Exercise 5.2 -- Estimating differences towards variability in weather

In the previous exercise you have learned a lot about the theory of difference-in-differences estimation and of fixed effects regression.

In this exercise we will introduce a fixed effects regression model based on our data and then we will perform the estimation. We will divide the regression analysis into two parts: part a) will deal with the effects of the code change on electricity consumption and part b) with the effects on natural gas consumption. Additionally, each part will contain a robustness check. In part c) of this exercise we will compare all results and draw a conclusion.

Now, we will determine the effects of the code change on energy consumption with the main focus on the relationship of weather variability and residential energy consumption.

Regression model -- model 4

Thus, we consider the following regression model which we denote by model 4: $$Y_{it} = \beta \cdot [CDD_t,HDD_t] + \delta \cdot code_change_i \times [CDD_t,HDD_t] + month_t + year_t + \mu_i + \epsilon_{it}.$$ We already know five of these parameters: first, the weather variables $CDD_t$ and $HDD_t$. Second, the variable $code_change_i$ which indicates if the residence was built after the code change or not. Last, the two time parameters $month_t$ and $year_t$ which are transformed into dummy variables. A so far unknown parameter is $\mu_i$. It is the residence-specific intercept and thus the fixed effect of our fixed effects regression.

When estimating this model we will again use clustered standard errors. As result we will get estimates for the coefficients $\beta$ and $\delta$. The $\beta$'s are estimates for the uninteracted weather variables and thus belong to the pre-code residences. As opposed to these estimates, the $\delta$'s are the coefficients of the interacted weather variables (interacted with the variable $code_change_i$) and therefore belong to the post-code residences. n our regression model we denote the interacted variables by CodeCDD and CodeHDD. The coefficient estimates of these variables are of primary interest for us. To be more precise, these estimates are difference-in-differences estimates and thus indicate "how residences before and after the energy code change differ in their energy consumption responses to changes in weather" (paper, p. 43).

Now, we start with the analysis of our data. Therefore, we have to load the data.

The data frame is called DiD_regression.dta. The abbreviation DiD stands for difference-in-differences. This data frame consists of eleven different columns which contain information about the energy consumption, the weather situation, the time and the two variables home_id and code_change. Especially important are the four weather variables CDD, HDD, CodeCDD and CodeHDD because we want to determine the effects of the code change for hot and cold days.

Enter your commands and then click on the button check. But first of all, press edit.

#< task
# load the data set DiD_regression.dta and assign it to dat
# enter your code here...
#>
dat = read.dta("DiD_regression.dta")
#< hint
display("Your command should look like: dat = read.dta(\"???\").")
#>

If you like to see the first few lines of the data frame solve the following task by clicking check.

The following task can be solved optionally.

#< task
head(dat)
#>

a) Electricity consumption

In this part of the exercise we will determine the differences in electricity consumption of pre- and post-code residences towards variability in weather. Hence, our response variable $Y_{it}$ is elec.

Before we start with the analysis we will state our expectations towards the results.

< quiz "expectations regarding electricity consumption"

question: Complete the following sentence. Regarding electricity consumption we expect that... sc: - ... post-code residences are less responsive to increases in AHDD (average heating degree days). - ... post-code residences are less responsive to increases in ACDD (average cooling degree days).*

success: Great, your answer is correct! failure: Try again.

>


Now, we perform the regression analysis.

Regression analysis

We use the function felm() to estimate the coefficients. We are only interested in the coefficients of the weather variables CDD and HDD and the coefficients of the interacted variables CodeCDD and CodeHDD. These coefficients were called $\beta$ and $\delta$ in model 4. Thus, we project the other explanatory variables out. These variables describe the time (month and year) and also the residence (home_id) because we use fixed effects estimation. We additionally need the variable home_id in the last part of the formula to get clustered standard errors.

The R code is already given. Just press check to run the regression reg4_1 and show its result. If you haven't solved the previous task you first have to press edit.

#< task
# run the regression
reg4_1 = felm(elec ~  CDD + HDD + CodeCDD + CodeHDD | month + year + home_id | 0 | 
                home_id , data=dat) 

# show the results
summary(reg4_1)  
#>

Here, you find the results of regression reg4_1.

Interpretation of regression results

All four coefficient estimates are significant. First, we consider the uninteracted weather variables CDD and HDD. Both estimates are positive which indicate that an increase of ACDD or AHDD leads to a higher electricity consumption. These results are conform with the fact that residences in Florida mainly use electricity for both cooling and heating (according to paper, p. 43). To be more precise, air-conditioning accounted for $6.7\%$ of the electricity consumption of households in Florida in the year $2009$. Furthermore, space heating accounted for $5.8\%$ and water heating for $6.2\%$ (according to U.S. EIA (2009), Table HC1.10).

Second, we take a look at the interacted variables CodeCDD and CodeHDD. The two coefficients have different signs. The coefficient estimate of CodeCDD is roughly $-2.5$ and thus shows that the electricity consumption of post-code residences is "less responsive to an increase in ACDD" (paper, p. 43). To be more exact, an increase of ACDD by one unit is associated with a $2.5$ kWh decrease in electricity consumption of post-code residences per month. On the other hand, the coefficient estimate for CodeHDD is approximately $2.4$ which shows an increasing responsiveness of post-code buildings towards rising AHDD and thus suggests lower efficiency of the heating systems of post-code residences. However, this result is not really problematic because in Florida there are only a few days where space heating is necessary (according to paper, p. 44). We already observed this situation in the plot of exercise 1.3.

To conclude, the results are conform with our expectations. The electricity consumption of post-code residences is by $7.8\%$ less responsive to increases in ACDD compared to pre-code buildings. The regression results are also conformable to our results of exercise 4.3 and again show that the energy code change leads to a higher efficiency of central air-conditioning of the residences (according to paper, p. 44).

However, if you take a look at the estimation results illustrated in table 4 of the original paper you will notice that the results for the clustered standard errors are different in comparison to our results. In the following info box you will find the reason of the differences and a method how to correct the standard errors.

< info "Correction of standard errors"

The results in the previous code chunk are a little bit higher than the results of the paper. The reason is that R and Stata use slightly different calculation methods. Hence, also the (adjusted) $R^2$ is higher than in the paper and the t-statistics and p-values are not exactly the same.

However, it is possible to correct our standard errors to get the same results as Jacobsen and Kotchen. We have to multiply the standard errors by the term $c = \frac{1}{\sqrt{\frac{N-1} {N-J-1}}}$ (according to Burling (2016) and Princeton University), where $N$ is the number of observations and $J$ the number of clusters -- which is in our case the number of residences. The authors of the paper used the Stata function xtreg to calculate the fixed effects regression. However, in this problem set we perform the calculations with the R function felm(). These two functions don't produce the same errors. However, we know that the two estimation functions felm() and areg, which is a Stata function, calculate the same standard errors (compare Princeton University). According to Fiona Burling (2016) we only have to multiply the areg standard errors by the correction term c to get the same results as using the Stata function xtreg.

More details about the Stata functions xtreg and areg you can find in Baum (2006), pp. 221-224. Additionally, on pages 136 to 139 you can find the formulas for robust and clustered variance-covariances of the regression coefficient estimators.

In the following code chunk we will correct the standard errors of regression reg4_1. With the command reg4_1$cse we get the clustered standard errors of our regression. We are only interested in the standard errors of the coefficients for CDD, HDD, CodeCDD and CodeHDD. Therefore, we choose the first four entries. Then, we multiply our standard errors with the correction term c.

This task can be solved optionally.

Press check to perform the calculations.

# correction of standard errors
N = 64471 # number of observations
J = 2239 # number of residences

# correction term: 
c = 1/(sqrt((N-1)/(N-J-1)))

# multiply standard errors by c
correct_se = c * reg4_1$cse[1:4]
correct_se

>

Now, we introduce a further regression model in this exercise to be able to make comparisons and to test if the results are robust between the two specifications.

Robustness check

In this new regression model we only estimate coefficients for the interacted weather variables. We additionally use month-year dummies (monthyear) instead of single dummy variables for the billing months and years. We also need dummy variables for the different residences to be able to perform a fixed effects regression. Hence, the regression model -- which we denote by model 5 -- looks like $$Y_{it} = \delta \cdot code_change_i \times [CDD_t,HDD_t] + v_t + \epsilon_{it}.$$ For estimation we also need clustered standard errors.

Press check to perform the regression. If you haven't solved the previous task you have to press edit first.

#< task
# run the regression
reg4_2 = felm(elec ~ CodeCDD + CodeHDD | monthyear + home_id | 0 | home_id , data=dat)

# show the results
summary(reg4_2)  
#>

The results of reg4_2 are similar to the results of reg4_1. The corresponding estimates have equal signs and the same relation. However, the coefficient estimates of model 5 are larger, meaning that differences between the pre- and post-code residences are greater than in model 4.

b) Natural gas consumption

In this part of the exercise we will determine the differences in natural gas consumption of pre- and post-code residences towards changes in weather. Hence, we will use the variable gas as response variable in our regression models.

Before we start with the regression analysis we will state our expectations towards the results first.

< quiz "expectations regarding natural gas consumption"

question: Complete the following sentence. Regarding natural gas consumption we expect that... sc: - ... post-code residences are less responsive to increases in ACDD (average cooling degree days). - ... post-code residences are less responsive to increases in AHDD (average heating degree days).*

success: Great, your answer is correct! failure: Try again.

>


Regression analysis

Task: Determine the difference in natural gas consumption of pre- and post-code residences towards changes in weather according to model 4. Use the function felm() to estimate the coefficients. We are only interested in the coefficients of the variables CDD, HDD, CodeCDD and CodeHDD. Hence, project out the other variables. Don't forget to use fixed effects estimation and clustered standard errors at the residence level.

Replace the ??? with correct commands and uncomment the code. Then, press check. If you need some advice press hint.

#< task
# run the regression
# reg4_3 = ???
#>
reg4_3 = felm(gas ~ CDD + HDD + CodeCDD + CodeHDD | month + year + home_id | 0 | 
                home_id , data=dat) 
#< hint
display("Your code should have the following structure: reg4_3 = felm(??? ~ ??? | month + year + home_id | 0 | home_id , data=dat). Replace the ??? by the correct response and explanatory variables.")
#>

< award "Regression master - fixed effects regression"

Great! You successfully performed a fixed effects regression on your own.

>

Now, we will show the regression results of reg4_3. The R code is already given. Just press check.

#< task
# show regression results
summary(reg4_3)
#>

Here you see the results of the regression. In the following paragraph we will interpret them.

Interpretation of regression results

All four coefficient estimates are highly significant. First, we consider the uninteracted weather variables CDD and HDD. The estimates have different signs. The coefficient of CDD is roughly $-0.27$ and HDD is approximately $2.33$. It indicates that natural gas consumption of pre-code residences is decreasing with rising ACDD and on the opposite, that gas consumption is increasing with increasing AHDD. This goes along with our expectations because natural gas is mainly used for space and water heating in residences. You can find suitable data in U.S. EIA (2009), Table HC1.10.

Second, we consider the interacted variables CodeCDD and CodeHDD. The two coefficient estimates are both negative. The estimate of CodeHDD is approximately $-1.37$ which shows that the natural gas consumption of post-code residences is decreasing in AHDD. When comparing pre-code residences with residences built after the code change we can conclude that the responsiveness of gas consumption towards AHDD differs by roughly $58.8\%$ (calculation: $58.8\% = \frac{1.37}{2.33} \cdot 100 \%$). On the other hand, the estimate of CodeCDD is roughly $-0.17$. Hence, it indicates that an increase of ACDD by one unit is associated with a decrease of $0.17$ therms per month in the natural gas consumption of post-code residences. One might assume that the latter estimate might be problematic but it is not since the effect of AHDD predominates because natural gas is primary used for heating (according to paper, p. 44).

Hence, the results suggest that the heating systems of post-code residences are more efficient and therefore the natural gas consumption of these residences is lower. Additionally, we can conclude that the results are conform with our expectations.

We also test the models for natural gas consumption for robustness.

Robustness check

Hence, we will perform another regression analysis for natural gas consumption -- based on model 5. This regression should be denoted by reg4_4.

Furthermore, we will show the output of the regression. However, we will additionally show the results of reg4_3 in order to compare both regression results. Therefore, we will use the function screenreg().

The code is already given. Press check to see the solution.

#< task
# robustness check
reg4_4 = felm(gas ~ CodeCDD + CodeHDD | monthyear + home_id | 0 | home_id , data=dat)

# show the regression results of reg4_3 and reg4_4 in one table
screenreg(list(reg4_3, reg4_4), 
          custom.model.names=c("Natural gas (model 4)", "Natural gas (model 5)"),
          omit.coef="(factor)", 
          digits=3, stars = c(0.01, 0.05, 0.10))
#>

Now, we can compare the results of both regressions. Answer the question of the following quiz.

< quiz "robustness check - natural gas, fixed effects regression"

question: Denote the correct statement. sc: - The two regression models are not robust. - The coefficient estimates of both specifications are very similar. Hence, the results are robust.*

success: Great, your answer is correct! failure: Try again.

>

The following and last part of this exercise is a summary of the regression analyses of this exercise.

c) Comparison of regression results of part a) and b)

We will shortly compare our expectations with the regression results and draw a conclusion if our expectations match with the results. First, we take a look at the results of part a).

expectations with respect to electricity | results | conclusion -----------------------------------------|---------------|------------------------------------- CodeCDD: post-code residences are less responsive to increases in ACDD |-2.498 kWh per month| results and expectations are consistent CodeHDD: smaller responsiveness of post-code buildings to increases in AHDD |+2.45 kWh per month| not conform, but HDD less frequent in Florida


Second, we consider the results of part b).

expectations with respect to natural gas | results | conclusion -----------------------------------------|--------------|------------------------------------- CodeHDD: smaller responsiveness of post-code buildings to increases in AHDD |-1.37 therms per month| results and expectations are consistent CodeCDD: post-code residences are less responsive to increases in ACDD |-0.17 therms per month| not conform, but the effect of ACDD is predominated by the effect of AHDD


Overall, we can conclude that the central air-conditioning and the heating systems of post-code residences are much more efficient than the ones of pre-code residences.

This exercise refers to pages 39 and 42 to 44 and to table 4 of the paper.

Exercise 6 -- Conclusion

This last exercise will present you some final information. First, you will get a few details about a cost-benefit analysis of the code change. Second, you will see a summary of the results and last, some restrictions will be mentioned.

Costs and benefits

In this section we will compare the costs and benefits of the code change.

Generally, a cost-benefit analysis is "a systematic categorization of impacts as benefits (pros) and costs (cons), valuing in dollars (assigning weights)" (Boardman et al. (1996), p. 2). Furthermore, a comparison of these weights helps to make a decision (according to Zerbe and Dively (1994), p. 2). Now, let's define the costs and benefits of the code change. On the one hand, the tightening of the building code leads to "increased compliance costs" (paper, p. 45) and on the other hand, to "lower expenditures on utility bills and avoided social costs of air pollution emissions" (paper, p. 45).

In the following info box you can find a more detailed description and a calculation of the costs and benefits of the code change. However, if you just like to see the results then take a look at the table below.

Further information about the steps of a cost-benefit analysis you can find in Boardman et al. (1996), pp. 6-19 and in Zerbe and Dively (1994), pp. 2-3.

< info "Calculation of costs and benefits"

First, we estimate the higer compliance costs. The code change in the year 2002 provides a "reduction in the solar heat gain coefficient (SHGC) on windows from $0.61$ to $0.4$" (paper, p. 45). Hence, builders can comply with the new guidelines of the building code by putting in windows with "low-emissivity (low-E) coating" (paper, p. 46). However, these windows are more expensive than regular windows -- typically by $10\%$ to $15\%$. Thus, the construction costs will go up by \$$675$ to \$$1012$ (according to paper, p. 46).

Second, we estimate the savings on utility bills. The authors assumed \$$0.146$ per kWh and \$$1.22$ per therm as average marginal price for electricity and natural gas, respectively (according to paper, p. 46). Furthermore, we recall the results of exercise 4.1: we estimated monthly electricity savings of roughly $48$ kWh and monthly natural gas savings of approximately $1.5$ therms. Thus, the averaged annual savings are $576$ kWh and $18$ therms. Hence, we can conclude that an average household of the post-code regime saves totally roughly \$$106$ per year on its electricity and natural gas utility bills. To be more precise, the savings of about \$$106$ are composed of \$$84.10$ per year on the electricity bill and of \$$21.96$ per year on the natural gas bills.

Third, we estimate the social benefits of reduced emissions. The authors used a "standard benefits-transfer approach for four categories of emissions" (paper, p. 46) in order to estimate these benefits. However, we will only present the results: the avoided emissions due to lower consumption of electricity and natural gas cause social benefits which roughly range between \$$14$ and \$$85$ per residence and year. A reduction in electricity consumption constitutes more than $90\%$ of the totally avoided damages. If you like to get more information about the calculations take a look at pages 46 and 47 and at table 5 of the paper.

>

The table below summarises the results of the costs and benefits analysis. Additionally it shows the payback period which "is defined as the time required for a project's total benefits to exceed its total costs" (Zerbe and Dively (1994), p. 194). You can find more details in Zerbe and Dively (1994), pp. 194-198.

The assumptions of the best-case scenario are: $10\%$ premium for low-emissivity windows (means an increase of \$675 compared to ordinary windows) and a discount rate of $0$ (paper, p. 46).

costs | benefits | payback period under best-case scenario ---------------------- | ---------------------- | ------------------------ higher construction costs because of windows with low-emissivity coating: increase of \$$675$ to \$$1012$ | savings on utility bills of about \$$106$ per year (for both electricity and natural gas) | private payback period: roughly $6.4$ years ~ | social benefits of reduced air-pollution emissions range between \$$14$ and \$$85$ per residence and year (for both electricity and natural gas)| total or social payback period: $3.5$ years (without carbon dioxide benefits: $5.3$ years)

You can find details about possible reasons for the exclusion of carbon dioxide benefits in the social payback period on page 47 of the paper.

This section refers to pages 44 to 47 and to table 5 of the paper.

Summary

The aim of the paper "ARE BUILDING CODES EFFECTIVE AT SAVING ENERGY? EVIDENCE FROM RESIDENTIAL BILLING DATA IN FLORIDA" -- and thus also of this problem set -- was to determine the effects of the building code change on the actual energy consumption of Gainesville residences. The evaluation was based on monthly billing data for electricity and natural gas consumption for the years after the code change. We used different methods to compare the pre- and post-code residences -- that are residences which were built within three years before and three years after the code change, respectively.

The structure of this problem set was the following: in exercise 1 we got familiar with the data, especially by calculating summary statistics and by visualising the variables of our data set with different plots. In exercise 2 we learned more details about the code change and performed initial comparisons of pre- and post-code residences by using two sample t-tests among others. We didn't only consider the energy consumption but also several housing characteristics. In exercise 3 we studied regression theory and introduced a simple regression model where the results corresponded to the t-statistics of exercise 2.2. In exercise 4 we proceeded the regression analysis and performed comparisons of the energy consumption on either annual or monthly level. In order to get reliable results we assumed that we could appropriate control for the housing characteristics we observe and second, that there would be no unobserved housing characteristic which differs among the code regimes and significantly influences the residential energy consumption. In exercise 5 we used difference-in-differences estimation and fixed effects regression to determine the effects of the code change for hot and cold days. The advantage of the latter method was that we could control for omitted variables which vary between different residences but not over time. We implicitly used the regression models of exercise 5.2 to check for robustness of the monthly estimates from the regression models of exercise 4.3. Last, we compared the costs and benefits of the code change.

Now, let us summarise the most important findings. We obtained the main result in exercise 4.1. The regression analysis showed the following outcome: the code change led to a decrease in the annual energy consumption. To be more precise, it caused a reduction of roughly $4\%$ in the annual electricity consumption and a decrease of approximately $6\%$ in the annual natural gas consumption.

The following task can be solved optionally.

If you like to see the exact results for the decreases in energy consumption then take a look at the following code chunk. First, click on the button edit and then press check to see the results. We copied the coefficient estimates delta_elec and delta_gas from exercise 4.1.

#< task
# coefficient estimate of the variable code_change, regressed on electricity consumption
delta_elec = -47.6
# average electricity consumption
mean_elec = 1146.4 

# result in percentage differences:
(1/mean_elec*delta_elec) * 100#%


# coefficient estimate of the variable code_change, regressed on natural gas consumption
delta_gas = -1.5
# average natural gas consumption
mean_gas = 23.6 

# result in percentage differences:
(1/mean_gas*delta_gas) * 100#%
#>

Furthermore, the regression analyses which determined the monthly effects of the code change (see exercise 4.3) showed that the reduction of residential electricity and natural gas consumption was caused by more efficient central air-conditioning systems and improved heating systems, respectively. Moreover, the fixed effects regression analyses in exercise 5.2 supported our findings of exercise 4.3 and indicated that post-code residences are less responsive to increases in AHDD and ACDD due to electricity and natural gas consumption, respectively.

Hence, we can give an answer to the initial question: if the building energy code actually reduces energy consumption of residences in Gainesville, Florida. And the answer is yes. It holds true for the building code change in the year $2002$.

Restrictions

In this last section we think about possible restrictions in order to be sure that there is no other explanation for our findings.

A possible explanation could be "a shift from natural gas to electric heating, caused for reasons independent of the code change" (paper, p. 47). In order to examine this possibility data for the individual type of heating of each residence would be needed. However, such data is not available. But even if this possibility would hold true the shift of the heating type would not change the main result -- of a decreasing overall energy consumption caused by the code change.

Furthermore, a limitation of the paper can be seen in the generalizability of the results because we only considered Florida's building codes and data of Gainesville residences. However, based on this study we can deduce some generally valid facts about the effects of building energy code (according to paper, p. 48). One reason is that a lot of other U.S. states lie in the same climate region as Gainesville. Thus, we can conclude that the code change in Gainesville is "somewhat representative of how energy codes affect more general regions of the country" (paper, p. 49). More details you can find on pages 48 and 49 of the paper.

However, for more northern states -- with a different climate and thus different needs for air-conditioning and heating and also varying sources of energy -- we cannot deduce the effects of the code change on these states if we only have data of Gainesville residences. Therefore, further research should be conducted.

If you like to see an overview of all your gained awards then press check. If you haven't solved the previous task you first have to click on the button edit.

#< task
awards()
#>

< award "Finisher"

Congratulations! You have finished this problem set about the effects of the building code change. I hope you have enjoyed it and also learned about programming in R and about some econometric methods.

>

Exercise 7 -- References

Bibliograhpy

R packages



LEilts/RTutorBuildingCodes documentation built on May 7, 2019, 12:33 p.m.