Wall Street and the Housing Bubble - An Interactive Analysis with R

Author: Marius Wentz

< ignore

library(RTutor)
library(yaml)
setwd("C:/Users/Marius/OneDrive/Bachelorarbeit/WallStreet/Final")
ps.name = "WallStreet"; sol.file = paste0(ps.name,"_sol.Rmd")
libs = c("foreign", "ggplot2", "stargazer", "dplyr",  "yaml", "scales", "lfe", "lmtest", "reshape2", "sandwich", "broom", "ggmap", "directlabels", "grid", "ggrepel", "gridExtra") # all packages used in the problem set

create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL, libs=libs, extra.code.file="extracode.r", addons = "quiz", var.txt.file="variables.txt")

#opens the set in the default browser
show.ps(ps.name,launch.browser=TRUE, load.sav=FALSE, sample.solution=FALSE,catch.errors = TRUE, is.solved=FALSE, warning=FALSE)

>

I would like to welcome you to this interactive Problem Set based on "Wall Street and the Housing Bubble". The Problem Set, which is a part of my bachelor thesis (Ulm University) will guide you through the paper "Wall Street and the Housing Bubble" by Ing-Haw Cheng, Shail Raina and Wei Xiong, published in the "American Economic Review" in 2014. This can, alongside with the data sets used for the calculations, be downloaded here: link. It is not necessary to download the provided data to work through this Problem Set but if you want to perform your own calculations, the data is available using the link from above.

To work properly, this Problem Set requires an internet connection!

Exercise 0 - Content

1 Introduction

2 Sampling

3.1 The Strong Full Awareness Hypothesis

3.2 Omitted Variable Bias

3.3 Cluster Robust Standard Errors

4 The Weak Full Awareness Hypothesis

5 Performance

6.1 Income Shocks

6.2 Consumption

7 Financing

8 Conclusion

9 References & Changes on Data sets

About the Exercises

It is not necessary to solve the exercises in the original order, but since later exercises may require knowledge acquired in earlier exercises, it is recommended to solve them as planned.

Within an exercise, you must solve one code block after another except the infoboxes and quizzes which are optional. You will get an introduction on how to solve those code blocks within the first exercise.

To ease calculation and visualization, several changes on the original data sets have been made. Those changes are documented in Exercise 9.

Exercise 1 - Introduction

The paper "Wall Street and the housing bubble" (Cheng, Raina and Xiong, 2014) deals with the question if Wall Street, or the people working there to be more precise, showed signs of awareness of the bubble and an upcoming crisis in the US real estate market. Answering this question is helpful if we want to understand the reasons that led to the financial crisis, since it is widely acknowledged that the securitization business, especially the originate to distribute model contributed largely to the crisis. This is not only due to enabling excessive credit expansion but also deteriorating credit quality and therefore financial stability as mentioned by Bernanke (2008) or Jobst (2008). Awareness of the bubble would reveal even more serious incentive problems as already thought.

The Process of Securitization

As you can see in Figure 1.1, the process of securitization can be described as a pass-through of assets from the originator to the market. The originator of assets like housing loans, pools and sells them to an arranger (usually an investment bank), who constructs a special purpose vehicle ("SPV"). The SPV builds principal-bearing securities from those pooled assets and sells them through asset managers or other services to investors. Those securities may be structured into various classes or tranches of different asset quality which are then rated by several rating agencies and may be insured by insurers. These securities are backed by the underlying assets as collateral, in our case the homes for which mortgages were sold. (Paligorova, 2009)

Figure 1.1: The process of securitization (Paligorova, 2009)

This process has several advantages for the origination institutions. Assets are taken off their balance sheet after passing them through into the markets. Institutions can exploit a different funding channel and borrowing costs are reduced. (Jobst, 2008) At the same time, it can lead to severe incentive problems. If the institution simply originates the securities and passes them through (originate to distribute), it is not necessarily in their interest to originate high quality securities. (ECB, 2008) Deteriorating credit quality was also possible since the rating agencies could not or did not want to evaluate the quality of the assets properly which could be caused by poor valuation methods or an incentive problem of rating agencies since they were mostly paid by the issuer rather than by the investor (IMF, 2009). Additionally, regulatory institutions couldn't properly monitor the possible misbehavior of the originating and issuing institutions (IMF, 2013). Altogether, these factors led to instability in the financial sector.

Why is it interesting to conduct such an analysis?

After the housing bubble and the financial crisis, there was a lot of research that implied that distorted beliefs may have had a significant impact on forming of the bubble. Benabou (2013) found that overoptimism concerning prices might have arisen from wishful thinking of people working in financial services. Additionally, Brunnermeier and Julliard (2008) emphasized the effect of money illusion, implying that decisions whether to buy or rent were based on nominal interest rates, underestimating the effect of future mortgage payments in an environment of low inflation. Shiller (2007) even called the boom a "social epidemic that encourages a view of housing as an important investment opportunity". Even though Smith and Smith (2006) saw homes as a robust investment, they also mentioned "unrealistic expectations" about the housing market. But even though, relatively little research dealt with beliefs of people working at Wall Street. This gives rise to the question at hand. If we find that Wall Street managers were aware, there would be an even more severe stimulus problem than previously thought. This would make changes in contracts necessary. If they were not aware even though there were signs that could have led to them recognizing that the bubble exists, there must be a problem in the way information is processed and beliefs are formed in financial enterprises. (Cheng et al, 2014)

How will we do this analysis?

Throughout this problem set we will roughly follow the structure of "Wall Street and the Housing Bubble" (Cheng et al, 2014). We got an overview over the situation that led to the housing bubble. Next, we will look at the history of the bubble, analyzing data from the Case-Shiller House Price Index.

Afterwards we will analyze differences in housing transactions, purchase behavior and financing terms between three different groups concerning home transactions between 2000 and 2010 (since the year 2010 is the last year for which data for the complete year is available). The group of informed agents, consists of securitization agents (non-executives working in the field of securitization), included are both investors in and issuers of securitized products. The group of S&P500 equity analysts forms the first control group, excluding those who cover firms involved in building real estate. This group is chosen since it's a self-selected group that is comparable to the members of the securitization group (e.g. concerning career risk or life cycle) without having access to the specific information that the group of securitization agents can access and furthermore experienced income shocks comparable to those the securitization agents experienced. The third group, the second control group, includes lawyers selected such that the location and age matches those of the securitization agents. They cover the wealthy part of the society that doesn't have access to specific information or special financial education and is therefore another suitable control group.

Like Cheng et al. (2014), we will compare the three groups, regarding their exposure to the housing market, the performance of their "investment portfolio", their financing terms and their consumption regarding their income by testing the following hypotheses.

Hypotheses to test:

1) Full Awareness I. Market Timing: The securitization agents timed the market by divesting in the boom period.

2) Full Awareness II. Cautious Form: The agents didn't increase the exposure of their portfolio to the housing market.

3) Performance: The portfolio of securitization agents performed significantly better than these of the other groups.

4) Conservative Consumption: The securitization agents used less of their available income for their purchases.

After testing these four hypotheses, we will check if another factor, the interest rates faced by the three mentioned groups could have influenced their investment behavior. Before starting the analysis, what do you think, will the analysis present evidence that the securitization agents showed signs of awareness concerning the housing bubble?

< quiz "outcome"

question: Do you think the analysis presents evidence that the securitization agents showed signs of awareness?

sc: - No evidence is found.* - Evidence is found. - Strong evidence is found.

success: Nice work, you are right, no evidence was found in the paper! failure: The answer is incorrect. Please try again.

>

< award "Master of Quizzes lvl. 1"

Congratulations you solved your first single choice quiz! Within this problem set there will appear some other quizzes, including single choice, multiple choice and some quizzes where you have to calculate a value and type it in.

Throughout this problem set you will get awards for solving tasks on your own. To see which awards you already got, type in awards() to show all your awards.

>

Overview over the development of the Housing Bubble

Before we look at the data we will try to get a quick overview over the development of the bubble. We will look at the 20-Composite Case-Shiller Home Price Index, the main real estate index of the USA, distributed by S&P, which compromises the 20 most important regions in the USA (see http://us.spindices.com/indices/real-estate/sp-corelogic-case-shiller-20-city-composite-home-price-nsa-index). We want to visualize the development of those indices so we have to load the data in R and plot it. I prepared a data set that contains three regions that are of interest, namely New York, Los Angeles and Chicago and the 20-Composite. We will focus on those regions since most of the properties we will analyze is located there. (Cheng et al, 2014a)

Map of the Case-Shiller Home Price Indices Regions

Before plotting the indices of the regions New York, Los Angeles, Chicago and the 20-Composite we will take a look at the locations of the regions included in the 20-Composite illustrated in Figure 1.2. The code that was used to create this map is visible in a so-called infobox below the map. infoboxes will be used throughout this problem set to add information about R functions, used code or additional explanations.

Figure 1.2: Illustration of the locations of the Case-Shiller Composite-20 home price index. Created using ggmap which accesses Google Maps.

< info "Map of Case-Shiller Locations"

The map above was created as follows:

# Below are the library's we need for the plot:
library(ggmap)
library(grid)
library(directlabels)
library(ggrepel)
# The data of the locations has to be loaded (contains city and state name, taken from S&P (link see above), longitude and latitude, a color specification and a marker specification, in our case circle (16) or rectangle (15)). Longitude and Latitude were obtained using geocode() from the ggmap package.
mapdata <- read.table("caseshiller_location_data.txt")

# The command below gets map data of the USA from the google servers.
us_map <- get_map(location = "USA", maptype = "roadmap", source = "google", zoom = 3)

# The commands below create the plot of the map.
ggmap(us_map)+
  # The limits of x and y are changed to obtain the map we need.
  scale_x_continuous(limits = c(-127,-60), expand = c(0, 0)) +
  scale_y_continuous(limits = c(23,50), expand = c(0, 0)) +
  # The x and y axis are omitted.
  theme_void() +
  # The legend is omitted.
  theme(legend.position = "none") +
  # Points contained in mapdata are plotted.
  geom_point(data = mapdata, aes(x = lon,y = lat, color = col), size = 3, pch = pch) +
  # Labels are added, it is ensured that they do not overlap.
  geom_label_repel(data = mapdata, aes(x = lon, y = lat, label = city, color = "white"), fill = "white", box.padding = unit(0.2, "lines"), label.padding = unit(0.15, "lines"), max.iter = 1000000) +
  # The color of the labels is changed.
  scale_color_manual(values = c('red', 100,'grey20'))

>

The Case-Shiller Home Price Index from 2000 to 2010

First, we will use the command read.table() to load the caseshiller data set which is stored in .txt format.

< info "read.table()"

The read.table() function reads data stored in table format (in our case a txt file) and creates a data frame consisting of the columns and rows of the files specified by a separator (by default space).

It is used as follows:

data <- read.table("c:/path/dataset.txt")

If you already set the working directory to the folder the data set you want to load is located, you can simply type in:

data <- read.table("dataset.txt")

See: https://www.rdocumentation.org/packages/utils/versions/3.4.1/topics/read.table

>

Task: Load the Case-Shiller data stored in caseshiller.txt data set using read.table() and store it in cs (cs will then be the name of the object of class data frame). First, you have to press edit to be able to type your code, then you have to type your code and press check when finished. If you made a mistake and can't figure out how to solve it, you can get a hint by pressing the hint button. If you can't figure it out despite the hint, you can always check the solution button to jump to the solution immediately. If you want to take a look at the data you are working on, press data after solving the chunk to see the data set. You will be leaded to the Data Explorer tab and have to click on the tab you were working on after you finished looking at the data set to get back to the exercise.

#< task
# Enter your code below:
#>
cs <- read.table("caseshiller.txt")
#< hint
display("If you can't solve the chunk, take a look at the infobox above again.")
#>

< award "Table Loader"

Congratulations, you successfully loaded your first Stata data set in R!

>

Now, let's look at the loaded data. We will not use the data button, but the function head() which gives a brief overview by showing the first few rows contained in cs. It may be that you already used the function colnames(). The edge of head() is that it shows the first entries rather than only the names of the columns, which gives us a nicer overview than colnames().

Task: Give out the names and the columns as well as the first entries of cs. You can simply press edit and check since the code is already provided.

#< task_notest
colnames(cs)
head(cs)
#>
#< hint
display("The code is already provided, simply press check!")
#>

As you can see cs contains the date and indices data for New York, Chicago, Los Angeles and the 20-Composite, normalized to 100 in January 2000.

< info "Long and wide format"

We will use the cs data set which you already got to know to demonstrate the difference between long and wide data:

First let's look at the wide form of cs:

cs <- read.table("caseshiller.txt")
head(cs)

The wide data set consists of one column indicating the date and four columns containing the values of the indices for the three areas Chicago, Los Angeles, New York and the Composite-20 Case-Shiller indices. This is the typical structure of a wide data set with one key (in our case the area) combined with the values (the indices values).

Let's look at the long form of cs:

cs2 <- melt(cs, id = "date")
colnames(cs2) <- c("date", "area", "value")
head(cs2)

This data set of the so-called long format has the two key variables date and area and one respective value, the indices at the given date for the given area. Now, we have two separate key variables with only one corresponding value.

We will use a data sets of long data structure as input to ggplot(), to create plots. Furthermore, long data creates clarity since it is intuitive that data and area represent the key for the variable.

(Ejdermy, 2016)

>

Before plotting the data series, they must be converted from wide to long format using the R function melt() from the reshape2 package, which enables us to use the resulting data frame as input for ggplot(). (Wickham, Chapter 7, 2009)

< info "melt"

The reshape2 package offers a nice function called melt(), which converts data from wide to long format, which can then be easily processed by ggplot().

It is used as follows:

library(reshape2)
data2 <- melt(data, id = "id")

Calling the library is only one time necessary, once you called the library, it is active until you deactivate it or terminate the R session.

The input needed is data, which represents the data set and id which stands for the column that will later be the first column and the x axis in our plot.

See: https://www.rdocumentation.org/packages/reshape2/versions/1.4.2/topics/melt

>

After melting the data set, we will use tail() to show the last entries of the molten data frame. tail() is used analogously to head(), the input needed is simply the data set.

Task: Use melt() to bring cs in long format (the id is the date), store it in the object cs_melted and show the last rows of that data set. Before using melt() the package reshape2 must be loaded. Visualize cs_melted afterwards using tail().

#< task
# Enter your code below:
#>
library(reshape2)
cs_melted <- melt(cs, id = "date")
tail(cs_melted)
#< hint
display("Take a look at the infobox above, the code is quite similar. After the similar code, think about the function used earlier to see the first entries of a data frame.")
#>

< award "Melting Pot"

Congratulations, you melted you first data set and visualized it using head()!

>

< award "Output lvl. 1"

You visualized your first data set using tail()!

>

After preparing our data, we are ready to plot the Case-Shiller Home Price indices which is now stored in cs_melted. To do so, we will use ggplot() a R function which enables us to do a lot of different, professional plots which can be modified adding commands. We will plot time series for all three areas and the composite 20. If you require more knowledge about ggplot(), take a look at the associated infobox.

< info "ggplot()"

ggplot() is a function that allows us to do neat plots. It is possible to manipulate almost everything like the scale, names and labels or the color of the lines.

A simple plot is written as follows:

library(ggplot2)
ggplot(data = data, aes(x = x, y = y, color = color))+
  geom_line()

The input is data, representing your data, and aes, the aesthetics requiring the x values, y values and a variable to group by, which is assigned to color. The added geom_line() draws a line between the y values of the three groups specified in color. Other characteristics like title, axis labels or the position of the legend can also be changed from default (by adding those commands with a +). A small overview on functions that manipulate the plot:

labs() - Enables the user to manipulate the title and the label names of the plot. theme_bw() - A nice theme for scientific plots. scale_x_date() - With that command, the user can manipulate the x scale if it is in date format. She can set the format to plot and the breaks. scale_x_continuous - Similarly to scale_x_date, can be used if the x values are numeric. scale_y_continuous - Similarly to scale_x_continuous, just for y values. theme()- Allows the user to manipulate the theme. Useful for the positioning of the legend and adjustments like a rotation of the axis labels. annotate() - With that command, the user is enabled to add a shaded region to the plot. geom_point() - Draws points, specified by x and y. geom_errorbar() - Draws error bars or confidence intervals.

If you want further information how to build nice plots with ggplot2, take a look at Wickham (2009). But there are also some features explained in the analysis.

>

Task: Load the library ggplot2, create a plot of cs_melted and store it in p1 using ggplot(). Use the command as.Date() on the x values while determining the aesthetics to tell ggplot() to handle those values like dates. Add a line with geom_line() and change the theme to theme_bw(), a theme often used to display scientific data by adding that command with a +. Output p1 afterwards by simply typing p1 in.

#< task
# Enter your command below:
#>
library(ggplot2)
p1 <- ggplot(data = cs_melted, aes(x = as.Date(date), y = value, color = variable))+
  geom_line()+
  theme_bw()
p1
#< hint
#library(ggplot2)
#p1 <- ggplot(data = cs_melted, aes(x = as.Date(date), y = #value, color = variable))+
#  geom_line()+
#  theme_bw()
#p1
#>

< award "ggplotter lvl. 1"

Congratulations, your first plot using the function ggplot()!

>

Figure 1.3: Development of the Case-Shiller Home Price Indices Composite-20 and New York, Los Angeles and Chicago. Based on Figure 1 - Home Price Indices from "Wall Street and the Housing Bubble" (p. 2801).

Figure 1.3 is sufficient if you only want to glance quickly at some data series but it isn't professional yet. There is no header, the title of the x and y axis are not precise and the labels of the x axis are not yet sufficient numerous. We will start with the plot we did above and add the other features step by step. as.Date() tells ggplot() to handle the x values as if they were dates.

Task: Add a header and the axis labels using labs(). If we want to add the title we must set the parameter title to the name we want to assign to it. The same procedure is used for both the x- and y- axis. We will store it in object p2 and plot it afterwards. The code is provided, simply press check.

#< task_notest
p2 <- p1 + 
  labs(title = "Case-Shiller Home Price Indices", x = "Year", y = "Normalized Index Value")
p2
#>
#< hint
display("The code is already provided, simply press check!")
#>

Figure 1.4: Development of the Case-Shiller Home Price Indices Composite-20, New York, Los Angeles and Chicago. Differs from Figure 1.3 in such way that the names of the x- and y- axis and the title are changed. Based on Figure 1 - Home Price Indices from "Wall Street and the Housing Bubble" (p. 2801).

To enhance clarity, we modify the plot such that the every year is displayed.

Task: Modify the x axis in a way that every year is displayed, use scale_x_date which configures default scales for the data/time classes. Store it in p3 and plot it again afterwards. To be able to use scale_x_date, we must load the package scales beforehand. The code is already provided, so simply press check.

#< task_notest
library(scales)
p3 <- p2 + 
  scale_x_date(labels = date_format("%b %Y"), breaks = date_breaks("1 year"))
p3
#>
#< hint
display("The code is already provided, simply press check!")
#>

Figure 1.5: Development of the Case-Shiller Home Price Indices Composite-20, New York, Los Angeles and Chicago. Differs from Figure 1.4 in such way that the x axis is manipulated that every year is shown. Based on Figure 1 - Home Price Indices from "Wall Street and the Housing Bubble" (p. 2801).

The problem of Figure 1.5 is obviously the position of the labels of the x axis, the space is simply not enough. We can solve that issue by rotating the labels by 45 degrees. Additionally, we can gain more space for the plot by putting the legend on top and remove useless information by removing the title of the legend.

Task: Rotate the x axis labels by 45 degrees, put the legend on top and remove the title of the legend using theme(). Since you don't have to write any code here, simply press check.

#< task_notest
p4 <- p3 + 
  theme(axis.text.x = element_text(angle=45, hjust = 1), legend.position = "top", legend.title = element_blank())
p4
#>
#< hint
display("The code is already provided, simply press check!")
#>

Figure 1.6: Development of the Case-Shiller Home Price Indices Composite-20, New York, Los Angeles and Chicago. Differs from Figure 1.5 in such way that the legend is on top, the title of the legend is deleted and the labels of the x axis are rotated by 45 degrees. Based on Figure 1 - Home Price Indices from "Wall Street and the Housing Bubble" (p. 2801).

Figure 1.6 is a nice plot with a fitting header, axis titles, axis labels and a legend that allows the plot to be larger.

Why is this plot helpful for our analysis?

As you can see in Figure 1.6, all four indices climbed moderately from 2000 to 2003, rose steep until mid-2006 and plunged after that date. This leads to the three periods the authors focus on. The pre-boom (from 2000 to 2003), the boom period (from 2004 to 2006) and the bust period (from 2007 to 2010). This is important for the analysis since we will not only analyze the difference between the three groups, but also differences in time of the securitization group. Especially the boom period will be of interest for us.

We will add a shaded region to the plot to account for our three periods.

Task: Add a shaded region to the plot using annotate(). Press check to do so.

#< task_notest
p5 <- p4 +
  annotate("rect", xmin = as.Date("2004-01-01", "%Y-%m-%d"), xmax = as.Date("2007-01-01", "%Y-%m-%d"), ymin = 75, ymax = 300, alpha = .2, fill = "red")
p5
#>
#< hint
display("The code is already provided, simply press check!")
#>

Figure 1.7: Development of the Case-Shiller Home Price Indices Composite-20, New York, Los Angeles and Chicago. Differs from Figure 1.6 in such way that the boom period is shaded. Based on Figure 1 - Home Price Indices from "Wall Street and the Housing Bubble" (p. 2801).

This exercise refers to the pages 2797 - 2805 of "Wall Street and the Housing Bubble".

Exercise 2 - Sampling

After getting a first overview over the situation that gives rise to our analysis, let's take a look on how the samples of the three groups were constructed before turning our focus towards the four hypotheses.

Cheng et al. (2014) sampled three groups (one group of informed agents and two control groups) consisting of 400 people each (and therefore 1200 overall) following some sampling rules.

The groups

The group of informed agents, the securitization agents group is sampled out of the 2006 American Securitization Forum's, attended by 1,760 people working for all kind of service providers in the securitization branch, including the most important international investment banks like Deutsche Bank or UBS, US investment banks like Lehman Brothers or Merrill Lynch, large commercial banks like Wells Fargo and monoline insurance companies like AIG (Cheng, Raina and Xiong, 2014a). The strategy of the authors was to sample enough people randomly so that they ended up with 400 people after excluding some due to several reasons that will be dealt with in the paragraph Sampling Rules. People working at large institutions and institutions particularly associated with the crisis were over-sampled.

The first control group, the equity analysts group, consists of analysts randomly chosen from those covering S&P 500 companies during 2006, excluding home-building companies. This group was chosen since it is a self selected group comparable to the informed agents concerning wealth and career risk while not having the specific information about the housing market, a securitization agent might have. Furthermore, they might show similar income patterns since they work for comparable companies. Here - again - enough people were selected so that after excluding people consistent with the rules we will deal later with, 400 equity analysts are sampled from a total of 2978 analysts.

The second control group, the lawyers were sampled randomly from the Martindale-Hubbell Law Directory, a national lawyer directory, matched on age and location of the securitization agents to cover the wealthy part of the society without access to housing market information. Real estate Lawyers were excluded. A total of 406 were sampled to obtain the desired 400.

All in all, we end up with 1200 people to be analyzed. The authors collected the data using the LexisNexis Public Records database. LexisNexis provides public records such as property records, addresses, vehicle titles, or business records, social media and employment information and thus enables the user to track down individuals. (https://www.lexisnexis.com/en-us/products/public-records.page) Additionally, they used data from the Home Mortgage Disclosure Act and LinkedIn. If you are more interested in the way the data was collected or want to collect the data yourself, take a look on Cheng et al. (2014a).

The people that were randomly selected are stored with keyident, age, age category, group information and a column named Mreason containing information about the exclusion rules in the data set person_dataset.dta.

< info "read.dta()"

The function read.dta() from the foreign package is designed to load data stored in the Stata binary format .dta.

It is used as follows:

# If we use a package we did not use before we must load it first
library(foreign)
# Then we read the dataset and store it in "data"
data <- read.dta("c:/path/dataset.dta")

If you did set the current working directory to the location where the data set that should be loaded is stored, you can simply type in the name of the file instead of using the whole path:

library(foreign)
data <- read.dta("dataset.dta")

After performing that operation, the data set is stored as a data frame in data.

See: https://www.rdocumentation.org/packages/foreign/versions/0.8-69/topics/read.dta

>

Task: Load the data set person_dataset.dta using read.dta() from the foreign library and store it in pers. To do so replace the question marks and remove the leading # in the code below.

#< task
# li???(for???)
# pe?? <- rea?.???("perso?_????.d??")
#>
library(foreign)
pers <- read.dta("person_dataset.dta")
#< hint
display("Remove the leading # and replace the question marks. If you don't know how to use the function take a look at the infobox above!")
#>

< award "Stata Data Loader"

Congratulations, you loaded your first data set stored in the Stata binary format!

>

Task: Show the first rows of pers.

#< task
# Enter your code below:
#>
head(pers)
#< hint
display("Think about Exercise 1, how did we do it there?")
#>

< award "Output lvl. 2"

Congratulations, you used head() for the first time!

>

Sampling Rules

Why are some people not included in the final sample?

To answer that question, let's take a look at the variable Mreason, which specifies the reasons why someone is excluded. Be aware that not every entry necessarily is a reason to exclude a person. We will extract the unique entries of that variable using unique() from the base package (it's not necessary to load that package since it is already loaded by default).

< info "column reference in R"

If you want to extract a column of e.g. a data frame, this can be done either by referring to the name...

dataframe$column

... or the number of the column.

dataframe[,number_of_column]

I would recommend you to use name reference since it enhances the clarity of your code, not only for you but also for others who would like to understand the code.

>

Task: Extract the unique characteristics of Mreason from pers. Extract them by referring to the name of the column.

#< task
# Enter your code below:
#>
unique(pers$Mreason)
#< hint
display("If you don't know how to refer to a specific column of a data frame, look at the infobox above!")
#>

< quiz "inclusion"

question: Which entries do you expect to be a reason to exclude a person? You must check six answers! mc: - - International - Not in housing - Multiple by same idents - No properties - Not found - C[E/F/O]O - Deceased success: Nice work, you were able to find the relevant variables! failure: The answer is incorrect. Please try again.

>

< award "Master of Quizzes lvl. 2"

You solved your first multiple choice quiz, congratulations!

>

What's the reasoning of Cheng et al. (2014) behind this exclusion?

People that have the entry international do not live in the US. Those not working in housing are excluded because they don't have sensitive information about the housing business. Those who are multiple by same ident could not be clearly assigned. The people that are not found are obviously useless for any analysis while those with the entry C[E/F/O]O are not mid-level but top-level managers. The people with the entry deceased are also useless for our analysis since they don't provide data for the whole era.

Why does it make sense to focus on mid-level and exclude top-level managers?

As discussed by the authors on page 2801 and 2802 of the paper, mid-level managers are very familiar with securitized securities since they are selling and buying them and are therefore closer tied to those securities. Thus, they have the possibility to identify problems that might exist in this sector way earlier than any C level executive.

After applying those rules, we end up with 400 securitization agents, 400 equity analysts and 400 lawyers to be analyzed.

This exercise refers to the pages 2801 - 2807 of "Wall Street and the Housing Bubble".

Exercise 3.1 - The Strong Full Awareness Hypothesis

Now we shift our focus towards the analysis about possible differences in home transactions between securitization agents and their controls during the boom period.

We will start with the first hypothesis. The first hypothesis is that the securitization agents timed the market by divesting during the boom period. We will therefore try to answer the question if securitization agents tried to ride the bubble by divesting more during boom (2004 to 2006) compared to both their own pre-boom numbers and their controls. Why is this called the strong full awareness hypothesis? As mentioned by Cheng et al. (2014) this is due to the costs associated with selling a home and possible problems to time the market properly. If they avoided increasing the wealth invested in real estate rather than divesting due to these reasons, we end up with testing the weak form of full awareness which we will do in exercise 4.

Before doing so, let's take another look at the plot from Exercise 1.

Task: Plot the Case-Shiller Home Price Indices. The code below performs the same steps as the ones we did in Exercise 1, press check to plot the indices.

#< task_notest
# First we load the caseshiller dataset
cs <- read.table("caseshiller.txt")

# Then we convert it to long format
cs_melted <- melt(cs, id = "date")

# And we finally plot the indices like in Exercise 1
ggplot(data = cs_melted, aes(x = as.Date(date), y = value, color = variable)) +
  geom_line() +
  theme_bw() +
  labs(title = "Case-Shiller Home Price Indices", x = "Year", y = "Normalized Index Value") +
  scale_x_date(labels = date_format("%b %Y"), breaks = date_breaks("1 year")) +
  theme(axis.text.x = element_text(angle=45, hjust = 1), legend.position = "top", legend.title = element_blank()) +
  annotate("rect", xmin = as.Date("2004-01-01", "%Y-%m-%d"), xmax = as.Date("2007-01-01", "%Y-%m-%d"), ymin = 75, ymax = 300, alpha = .2, fill = "red")
#>

Figure 1.7: Development of the Case-Shiller Home Price Indices Composite-20, New York, Los Angeles and Chicago. Based on Figure 1 - Home Price Indices from "Wall Street and the Housing Bubble" (p. 2801).

As you can see in Figure 1.7, during the boom period (red), prices were high, so it would have been a good decision to sell the properties especially from 2005 until the end of 2006, given awareness of the bubble and the ability to time the market. Even after 2006, there is a time frame where it was possible to sell at a high price (approximately until September 2007), so why did the authors not set the time frame symmetrically around the peak? They imply why they didn't do so. It is mentioned in the paper (see page 2821) and they present evidence in online appendix B (Table B12) that divestment behavior of the securitization group after 2006 is mainly driven by those who lost their jobs after 2006.

3.1.1 Divestiture Intensity

We will answer the question raised in the paragraph above by analyzing the so-called divestiture intensity which is defined as follows:

$$\textrm{Divestiture Intensity}_t =\frac {\textrm{Divestitures}_t} { \textrm{People Eligible for Divestiture}_t}\ .$$

(Cheng et al, 2014)

The divestiture intensity is simply all divestitures (selling of homes without buying another one) divided by the people eligible for divestitures, namely all people that currently own a home. But this formula has a disadvantage. As mentioned by the Cheng et al. (2014), a person that buys a home in January may divest in November, but due to the nature of our formula the person will be excluded in our considerations. But why should this intensity represent a good measurement for full awareness?

As mentioned by Cheng et al. (2014) the reason is that the agents had maximum incentive to avoid losses in their house portfolio since it usually represents a significant share of their wealth. If they were aware that the bubble existed, they would have tried to decrease the share of wealth exposed to the housing market to minimize the potential losses faced.

< quiz "Divestiture Intensity"

question: A person who assumes awareness, expects that the divestiture intensity of our informed agents during boom is... sc: - lower compared to the control groups - higher compared to the control groups* - equal compared to the control groups

success: Congratulations, you were able to find the right answer! failure: You checked the wrong answer, try again!

>

< award "Master of Quizzes lvl. 3"

Congratulations, you understood which effect we are looking for!

>

3.1.2 Divestiture Intensities from 2000 to 2010

First, we will look at the raw (no controls included) divestiture intensities by simply computing the intensities for every year and every group and plot them.

Task: Load the data set personyear_trans_panel.dta and store it in trans. This date set is modified in such ways that the levels of group are in a different order which was done to ease visualization of the regressions later. Also, the column names are changed. Store the data set in trans. Simply press check to do so.

#< task_notest
trans <- read.dta("personyear_trans_panel.dta")
#>
#< hint
display("The code is already provided, simply press check!")
#>

Task: Visualize the first rows of trans using head(). Press check to do so.

#< task_notest
head(trans)
#>
#< hint
display("The code is already provided, simply press check!")
#>

< info "personyear_trans_panel.dta"

As you can see the data set contains 12 variables. To get a description of the variables, move your mouse cursor over the header of the columns of the table above. The data set is a transaction panel, where the number of transactions of every person is stored every year, alongside with personal information.

Cheng et al. (2014) used LexisNexis Public Records database to collect the transaction history. LexisNexis provides public records for, inter alia properties, employment information and vehicle titles.

See: https://www.lexisnexis.com/en-us/products/public-records.page

>

If we want to extract specific rows defined by characteristics of different variables, the dplyr package provides a function called filter(), that extracts rows from an existing data frame. If you didn't use this function yet or if you are not sure how to use it any more, click on the infobox below to get additional information including some examples about how to use it.

< info "filter()"

filter(), a function from the dplyr package can be used to create a data frame from an existing one, containing entries with desired characteristics. The input is a data table and logical expressions referring to the columns of this data table. It is possible to connect several logical expressions using the & operator.

To extract the transaction information from 2005, we do as follows:

# First, we load the package
library(dplyr)
# Now we can use filter()
filter(trans, Year == 2005)

To extract and save the transaction information from 2005, with the precondition that the information should also be from securitization agents only, we type in:

data <- filter(trans, Year == 2005 & group == "Securitization")

"Securitization" has to be put in quotation marks because the column group contains strings and strings must be referred to as a string.

filter() works not only with the logical operator ==, but also with other logical operators like !=, >= or > and the expressions can not only be connected with &, but also | (or) or xor() (where the syntax is xor(expression1, expression2)).

See: https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/filter

>

Now we need a data frame that contains only those entries, where the person is eligible for divestiture, namely those where homeowner is equal to 1 (since only people already owning a home are eligible for divestitures). That allows us to take the mean of the divestiture variable of the created data frame as divestiture intensity, since all people in this sample are eligible for divestiture.

Task: Construct a data frame that contains all entries from trans and has the properties that the variable homeowner is equal to 1 and the variable Year is not equal to 2011 using the filter() function. Store it in trans again. To be able to use filter(), load the dplyr package first. To do so, delete the leading # and replace the question marks with your own code.

#< task
#library(????)
#trans <- filter(???, ??? == 1 & ! ??? == 2011)
#>
library(dplyr)
trans <- filter(trans, homeowner == 1 & !Year == 2011)
#< hint
display("If you don't know how to use the filter() function, take a look at the infobox above.")
#>

< award "Extractor"

Well done, you are now able to filter data sets for specific characteristics.

>

Now that we have the required data set for our plot we want to get the divestiture intensities for every year and every group. To do so, the two R functions group_by() and summarize() will be helpful. We will first group the set trans regarding the two variables Year and group and summarize afterwards to get the divestiture intensity for every year per group and plot it.

< info "group_by()"

The R function group_by() from dplyr turns an existing table into a grouped table, enabling the user to use manipulate the data set by group.

It is used as follows:

group_by(dataset, variable1, variable2, ...)

Where dataset is, as the name implies, the existing data table, and variable1, variable2, ... are the variables we group by.

See: https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/group_by

>

< info "summarize()"

The R function summarize() from dplyr can perform operations like calculation of mean or median. If applied on grouped objects, those statistics will be computed by group.

It is used as follows:

summarize(dataset, mean = mean(variable), median = median(variable))

See: https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/summarize

>

Task: Group the trans data set by group and Year and summarize the mean of divestitures, save it in div and output it. The code is already provided, simply press check.

#< task_notest
div <- group_by(trans, group, Year)%>%
  summarize(mean = mean(divestitures))

div
#>
#< hint
display("The code is already provided, simply press check!")
#>

You are probably confused why the %>% operator was used and what it does. The so-called pipe operator connects several operations you want to run, so that one is not forced to first assign the group_by() function an object and afterwards again assign the summarize() function another one. Here summarize() just takes the object created by group_by() and does its own operation to this table. The outcome is then saved in div. It was not necessary to pass the data set to summarize() due to the pipe operator %>%. (Wickham, 2017)

Task: Plot the divestiture intensities contained in div using ggplot(). You can plot all three groups at once, since we used group_by() earlier. Input the data, the aes (x values (Year) and y values (mean), the x values have to be changed to numeric (using as.numeric()) values since they are stored as characters to be easier processable by our regression function lm() later), color (the different groups) and draw a line. Set the theme to theme_bw(). Change the title to "Divestiture Intensity", the x axis to "Year" and the y axis to "Divestiture Intensity". Rotate the x axis by 45 degrees, put the legend on top and remove the title of the legend. We don't use a data axis but a normal continuous one here, since we only have the years and not complete dates so we don't have to use scale_x_date, but scale_x_continuous, where we will fill in the argument breaks which is a vector of the years contained in the data (in our case 2000 to 2010). Finally, add the shaded region from 2003.5 - 2006.5 (so that the values of 2004, 2005 and 2006 lie within the rectangle).

#< task
# Remove the leading # and replace the question marks:
#ggplot(data = ???, aes(x = as.????(???), y = ????, color = ????)) +
#  geom_line() +
#  theme_bw() +
#  labs(title = "???", x = "???", y = "????") +
#  theme(axis.text.x = element_text(angle = ??, hjust = 1), legend.position = "top", legend.title = element_blank()) +
#  scale_x_continuous(breaks = c(????:????))+
#  annotate("rect", xmin = ????, xmax = ????, ymin = 0, ymax = 0.07, alpha = .2, fill = "red")
#>
ggplot(data = div, aes(x = as.numeric(Year), y = mean, color = group)) +
  geom_line() +
  theme_bw() +
  labs(title = "Divestiture Intensity", x = "Year", y = "Divestiture Intensity") +
  theme(axis.text.x=element_text(angle=45,hjust = 1),legend.position = "top", legend.title=element_blank()) +
  scale_x_continuous(breaks = c(2000:2010))+
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = 0, ymax = 0.07, alpha = .2, fill = "red")
#< hint
display("If you can't figure out how to do the plot take a look at the information in the task or at the plot at the beginning of the exercise again!")
#>

Figure 3.1: Raw Divestiture Intensities from 2000 to 2010. Replication of Figure 2 - Transaction Intensities, Panel A from "Wall Street and the Housing Bubble" (p. 2812).

< award "ggplotter lvl. 2"

Congratulations, your first advanced ggplot!

>

What is this plot telling us?

According to Figure 3.1, the divestiture intensities of securitization agents were lower than those of the equity analysts from 2000 - 2006 and higher afterwards. Compared to the lawyers, the intensities were higher from 2000 - 2004, lower in 2005 and again higher after 2006.

Can those findings imply awareness?

If the informed agents were aware, we would have expected their divestiture intensities to be way higher than those of their controls and higher compared to the other years from 2004 to 2006. This is obviously not the case since the divestiture intensity of securitization agents reached its all-time low in 2005. But it could be that the difference is due to factors different from being member of one of those groups. To rule out that other factors are responsible for that outcome we will have to run a regression controlling for those factors.

3.1.3 Regression analysis

We will run a regression using the following linear model:

$$\textrm{E[Divestitures}{i,t}|\textrm{HO}{i,t-1} = 1] = \alpha_t+\beta_t \times Securitization_i + \sum_{j=1}^7\delta_j Age_j(i,t)+ \lambda MultiHO_{i,t-1}\ .$$

(Cheng et al, 2014)

Before we do those regressions, you might have a look at the infobox below. It provides basic knowledge about multiple linear regressions and ordinary least squares.

< info "Multiple Linear Regression using OLS"

Multiple Linear Regression

A multiple linear regression assumes the dependent variable to have a linear relationship with the independent variables so that a model can be constructed that is written down in the following form:

$$y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + ... +\beta_k x_{k,i} + u_i \ .$$

which can also be expressed in matrix form as:

$$ Y = X \beta + U \ .$$

In this model, $\beta_0$ is the constant, $\beta_1$ to $\beta_k$ are the effects of the independent variables on the dependent variables $x_{1,i}$, $x_{2,i}$, ... ($X$ in matrix form) and $u_i$ is the error term ($U$ is the vector of the error terms).

This model relies on several assumptions:

A. Assumptions + A1: Every relevant independent variable is included in the formula and no independent variable in the formula is irrelevant. + A2: There is a linear relationship between the independent and dependent variable. + A3: $\beta_0$, $\beta_1$, $\beta_2$, ..., $\beta_k$ are constant for all observations.

B. Assumptions + B1: $\mathbb{E}\left[u_i\right] = 0 \quad \forall i\ $ - The error term has expected value of 0 for all observations. + B2: $\mathrm{Var}[u_i] = \sigma^2 \quad \forall i\ $ - Constant variance of the error term throughout the sample. + B3: $\mathrm{Cov}[u_i, u_j] = 0 \quad \forall i,j \quad \textrm{where} \quad i \neq j\ $ - Uncorrelated error terms. + B4: $u_i \sim N(\mu, \sigma^2) \quad \forall i \ $ - Normal distribution of the error terms.

C. Assumptions + C1: The exogenous variables are not random variables. + C2: The exogenous variables are not a linear combination of each other for all observations.

(von Auer, 2016)

Ordinary Least Squares

The method of ordinary least squares (OLS) is one of the most important and most used ones in empiric literature. It determines the values of $\beta_0$, $\beta_1$, $\beta_2$, ... in such a way that the sum of all squared residuals is minimized.

$$min {\sum_{i}(y_i - \hat{y})^2} \quad \hat{=} \quad min {\sum_{i}(\hat{u_i})^2}$$

This allows to take bigger residuals more into account by squaring them.

For models that use ordinary least squares to estimate the coefficients $\beta$ (which is a vector containing $\beta_0$, $\beta_1$, ..., $\beta_k$) it is well known, as mentioned in von Auer (2016) or Stock and Watson (2007), that the coefficients and their variances can be calculated as in the following paragraphs.

The estimate can be computed by calculating:

$$\hat {\beta} = (X'X)^{-1} X'y \ .$$

For the variance of $\hat{\beta}$ holds:

$$Var(\hat{\beta}) = (X'X)^{-1} X' E[uu'] X (X'X)^{-1}\ ,$$

leading to:

$$\widehat{Var}(\hat{\beta}) = \hat{\sigma}^2 (X'X)^{-1}\qquad \textrm{where} \qquad \hat{\sigma}^2 = (1/(N-k-1)) \sum_{i=1}^N{\hat u_i^2}\ .$$

Under the assumptions A1 to C2 except B4, the OLS estimate provides the so-called BLUE (Best Linear Unbiased Estimator), if B4 is also fulfilled we can speak of the OLS estimator as BUE (Best Unbiased Estimator).

See von Auer (2016)

>

Since we already loaded the data set containing the transactions and filtered so that only homeowners are in the set, we can skip this part and continue with additional data manipulation. For the first regression, we need only the securitization agents and the equity analysts where an age category is available.

Task: Use filter() to subset the data frame trans and store it in dta so that only securitization agents and equity analysts are included whose age_cat is available (meaning not equal to "NA").

#< task
# Enter your code below:
# dta <- filter(????, ???? %in% c("Securitization","Equity Analysts"), ???? != "NA"
#>

dta <- filter(trans, group %in% c("Securitization","Equity Analysts"), age_cat != "NA")
#< hint
display('If you dont know how to use filter(), take a look at the infobox above!')
#>

< info "lm()"

The R function lm() from the stats package (you don't have to load that one since it is already loaded by default) allows you to fit linear models such as a linear regression.

It is used as follows:

regression <- lm(formula, data, weights)

Here, formula stands for the regression formula you want to use. If you simply want to regress one variable against another it would be $y \sim x$, if controls are used it would be something like $y \sim x1 + x2 + ...$. If you want to have coefficients for each year since you are interested in some years particularly, use a formula analogously to $y \sim x : year + year$ .data specifies the data set and weights is an optional input which enables you to specify weights and perform a weighted least squares regression. If the user does not specify weights, lm() uses OLS.

See: https://www.rdocumentation.org/packages/stats/versions/3.4.1/topics/lm and https://www.rdocumentation.org/packages/stats/versions/3.4.1/topics/formula

>

Now we will run the first regression. We won't apply the whole formula stated above immediately, but rather start by simply regressing the variable divestitures against group. Additionally, we will take the years into account. This is because it is not very helpful to take all years together, since we would like to take a closer look at the boom period from 2004 to 2006.

If we need the effect in a given year, in our case the effect of membership to a group in a given year, we have to use Year:group. Since we additionally want to control for the years, we also have to add + Year as a control. The years, age category and the groups are stored as character in the data set I created which is necessary, since we don't have "normal" input values like in other regressions where we have something like population numbers as input and GDP as output. We have so called categorical variables like group membership, an age category or multi homeownership which has to be somehow encoded to make it usable in R. This is done by converting the variable type to a factor. When a variable is stored as character, the lm() function automatically converts them to a factor when running the regression, allowing us to forgo the step of converting them to a factor. (See: https://www.rdocumentation.org/packages/stats/versions/3.4.1/topics/formula)

Task: Run the regression of divestitures against Year:group including Year as a control using lm() and store it in r1. Use the object dta to do so.

#< task
# Enter your command below:
#>
r1 <- lm(divestitures ~ Year:group + Year, data = dta)
#< hint
display("If you can't figure out how to solve the task, take a look at the infobox below!")
#>

< award "Regressor lvl. 1"

Well done, your first regression with year effects!

>

You might be familiar with some functions like summary() to output regression results, but different from a lot of those functions the function stargazer() enables us to extract the coefficients and statistics, which are of interest for us. Furthermore, it is possible to add information about controls.

< info "stargazer()"

We will use the in the package stargazer contained eponymous function stargazer() to output regression or analysis results side-by-side by producing LaTeX code, HTML code or ASCII text.

It is used as follows:

# First, we load the stargazer package
library(stargazer)
# After loading the package we can use stargazer
stargazer(regression1, regression2, ..., type = "type", keep = c("variable1", "variable2"), report = "vct*", add.lines = list(c("Expression","Yes/No")), column.labels = c("namec1","namec2"))

Here, regression stands for the regression you would like to output, type for the output type, by default latex, in our case html, keep is a vector stating those independent variables for which the $\beta$ should be shown and report can be a combination of v (variable names), c (coefficients), s (standard errors/confidence intervals), t (t-statistics), p (p-values) and * (the letter followed by the asterisk is the one next to which the significance level will be reported). add.lines enables the user to add a line at the bottom to report e.g. effects controlled for but not included in the output. column.labels() assigns the columns names (if we have more than one regression to show)

To learn more about stargazer and its possibilities, look at http://www.jakeruss.com/cheatsheets/stargazer/.

>

Task: Load the library stargazer and use the function stargazer() to output the regression results of r1. Ignore keep and report, only input the regression and set the type to "html". Your computer may need a moment to process the command.

#< task
# Enter your code below
#>
library(stargazer)
stargazer(r1, type = "html")
#< hint
display("If you can't figure out how to use stargzer, the infobox above provides an example!")
#>

Table 3.1: Regression adjusted differences of Divestiture Intensities of securitization agents and lawyers (without controls). Based on Table 4 - Divesting Houses of "Wall Street and the Housing Bubble" (p. 2814).

< award "Stargazer lvl. 1"

Congratulations, you used stargazer() for the first time in this problem set!

>

As you can see, this command produces Table 3.1, which is quite large and shows entries for the effects of the years, information we are not interested in. We can use keep to include only those entries that are of particular interest and add the information that year effects are applied using add.lines().

Task: Output r1 using stargazer so that only the cross of year and group are reported (use group as input for the keep vector). Add the information that year effects were applied.

#< task
# Remove the leading # and replace the question marks below
# stargazer(????, type = "html", keep = c("????"), add.lines = list(c("Year Effects?", "Yes")))
#>
stargazer(r1, type = "html", keep = c("group"), add.lines = list(c("Year Effects?", "Yes")))
#< hint
display("If you are not familiar with the required arguments i recommend you to take a look at the infobox of stargazer!")
#>

Table 3.2: Regression adjusted differences of Divestiture Intensities of securitization agents and lawyers (without controls). Differs in such way that the output is limited to the effects we are interested in. Based on Table 4 - Divesting Houses of "Wall Street and the Housing Bubble" (p. 2814).

< award "Stargazer lvl. 2"

Congratulations, your first output of a regression with an added line and specified parts of the regression information using stargazer().

>

The terms of interest in Table 3.2 are those entries named Year(20XX):groupSecuritization, the $\beta_t$. They shows that the securitization agents had lower divestiture intensities than the equity analysts from 2000 to 2006 and higher ones afterwards. That means that we cannot derive evidence that the securitization agents showed awareness of the bubble from that regression. There might be other factors that could influence the outcome of the regression.

What could those other factors be?

To identify the other factors which might influence the regression outcome, let's take a look at the trans data set once again. To do so we will use the function sample_n() which is described below.

< info "sample_n()"

The dplyr package contains the function sample_n() which allows us to get random rows from a table. It can be used to display rows like head() or tail() with the advantage that rows are randomly picked. This allows to have more variety in the variables if the data set is sorted somehow.

It is used as follows:

sample_n(dataset, size = size)

dataset specifies the data set you want to display and size the size of the sample.

See: https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/sample

>

Task: Use sample_n() to select and show ten random rows from trans.

#< task
# Enter your command below:
#>
sample_n(trans, size = 10)
#< hint
display("Take a look at the infobox above again!")
#>

< award "Output lvl. 3"

Well done, you showed ten random rows using samle_n()!

>

< quiz "regression variables"

question: Check all variables that you think will serve as controls later. You can get a description of the variables by moving your mouse cursor over the variable names in the output above. mc: - keyident - added_houses - houses_bought - age_cat - multi_homeowner - prop_NYC - prop_SoCal - company_id

success: Nice work, you were able to find all relevant variables! failure: Not all answers correct. Please try again.

>

< award "Master of Quizzes lvl. 4"

You successfully selected the controlling variables for our regression.

>

Why should the age distribution and the multi homeownership be considered in our analysis at all?

First, people of different age show different patterns in risk aversion, older people tend to be more risk averse than younger ones (Albert and Duffy, 2012). Second, as stated by Cheng et al. (2014), there could be life cycle and career selection risk. To control for those possible effects, it is necessary to take the age into our regression. Being a multi homeowner could influence divestiture behavior in such way, that a person who owns more than one home might be more likely to sell one of those homes, while a person owning only one home might not be so fast in selling it since it most probably is the house she is living in rather than a speculative object. Additionally, transaction costs are lower for selling a second home than selling the home the person is living in, since the person would have to move and find another place to live (p. 2803).

If we fail to include all the relevant independent variables, our analysis is biased, an effect called omitted variable bias occurs (Stock and Watson, 2007).

This exercise refers to the pages 2802, 2803, 2808 - 2816 of "Wall Street and the Housing Bubble".

Exercise 3.2 - Omitted Variable Bias

What is omitted variable bias and when does it occur?

Excluding a variable from the regression leads to an omitted variable bias if the excluded variable fulfills two conditions. As mentioned by Stock and Watson (2007), the variable must

Leaving a relevant variable out violates the in the infobox Multiple Linear Regression using OLS (in exercise 3.1.3) postulated assumption A1 (no relevant variable is left out of the regression). Thus, the estimator that OLS yields is biased.

How can we explain the existence of omitted variable bias mathematically?

If we assume that we have a linear regression model with one independent variable $X_1$, which follows the formula:

$$(1) \qquad y_i = \beta_0^+\beta_1^ \times X_{1,i} + u_i^ \qquad \textrm{leading to the estimated regression formula} \qquad \hat y_i = \hat \beta_0^ + \hat \beta_1^ \times X_{1,i} + \hat u_i^\ ,$$

where $\hat\beta_1^*$ is estimated as follows:

$$(2) \qquad \hat \beta_1^ = \frac {cov(X_1,\hat y)} {var(X_1)} = \frac {cov(X_1, \hat \beta_0^ + \hat \beta^_1 X_1 + \hat u^)}{var(X_1)}\ .$$

If the real relationship includes another independent variable $X_2$ and not only $X_1$, the regression formula from $(1)$ changes to $(3)$:

$$(3) \qquad y_i = \beta_0+\beta_1 \times X_{1,i} + \beta_2 \times X_{2,i}+ u_i \qquad \textrm{leading to the estimated regression formula} \qquad \hat y_i = \hat \beta_0 + \hat \beta_1 \times X_{1,i} +\hat \beta_2 \times X_{2,i} + \hat u_i\ .$$

If we calculate $\hat \beta_1^*$ like in formula $(2)$ even though we would have to calculate $\beta_1$, that fits formula $(3)$, we end up with:

$$(4) \qquad \hat \beta_1^* = \frac {cov(X_1,\hat \beta_0 + \hat \beta_1 \times X_{1} + \hat \beta_2 \times X_{2}+ \hat u)} {var(X_1)}\ .$$

After performing some arithmetical operations, we end up with:

$$(5) \qquad \hat \beta_1^* = \beta_1 + \beta_2 \frac{cov(X_1,X_2)}{var(X_1)}\ .$$

Meaning that our estimate for $\beta_1$, $\hat \beta_1^*$ is biased by $b_2 \frac{cov(X_1,X_2)}{var(X_1)}$ due to omitting the variable $X_2$.

Taking a closer look also explains why variables must be correlated ($cov(X_1,X_2)\neq0$) and a determinant of y ($\beta_2\neq0$). If one of those conditions is violated, no omitting variable bias occurs since the part of $(4)$ responsible for the bias would be $0$.

Von Auer (2016) and Williams (2015).

Calculation of the omitted variable bias for multi homeownership in 2006

To start, we have to load the data set personyear_trans_panel.dta. Then we will and extract all transaction information from securitization agents and equity analysts that were homeowners in 2006 where an age category is available. Note that the year 2006 was selected randomly, every other year would serve as well to emphasize the effect of omitted variable bias.

Task: Load personyear_trans_panel.dta and store it in trans. Then extract the transaction information of securitization agents and equity analysts having age information for 2006 from trans where homeowner is equal to one and store it in ovb.

#< task
# Enter your code below:
#>
trans <- read.dta("personyear_trans_panel.dta")
#< hint
display("Load the dataset and use filter to create the desired data frame!")
#>
#< task
#>
ovb <- filter(trans, Year==2006 & group %in% c("Securitization","Equity Analysts") & age_cat != "NA", homeowner == 1)
#< hint
display('Your command should look like that: ovb <- filter(trans, Year==2006 & group %in% c("Securitization","Equity Analysts") & age_cat != "NA", homeowner == 1)!')
#>

To calculate the estimated omitted variable bias, we need the regression without the relevant variable $X_2$, the regression including the relevant variable $X_2$, $Var(X_1)$, and $Cov(X_1, X_2)$.

In order to be able to compute those statistics, a dummy variable for the group was introduced while changing the dataset personyear_transaction_panel.dta. This is necessary because R fails to compute $Var(X_1)$ and $Cov(X_1, X_2)$ otherwise because $X_1$ is the membership to the group which is stored as character, the R data class that stores string values.

< info "dummy variable"

A so-called dummy variable is a variable that becomes one or zero depending on a condition. It is used to encode characteristics like membership to a group or gender (e.g treatment group = 1, control group = 0). It allows us to include those factors in the regression.

See Stock and Watson (2007).

>

Task: Run two regressions, both of divestitures against group, don't control for multi_homeownership in reg1, control for it in reg2. Both regressions use the data set ovb. Compute the variance of groupd (the group dummy variable) and the covariance of groupd and multi homeowner and output both regressions using stargazer. Press check to do so.

#< task_notest
reg1 <- lm(divestitures ~ group, data = ovb)

reg2 <- lm(divestitures ~ group + multi_homeowner, data = ovb)

var(ovb$groupd)

cov(ovb$groupd, ovb$multi_homeowner)

stargazer(reg1, reg2, type = "html")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 3.3: Regression adjusted difference in divestitures in 2006, one omitting multi_homeownership (column one), one including multi_homeownership as a control (own illustration).

The estimated omitted variable bias is equal to $\hat\beta_2 \frac{cov(X_1,X_2)}{var(X_1)}$. Estimated since we are neither able to use the real $\beta_2$, nor the real $cov(X_1,X_2)$ and $var(X_1)$ (we only have the sample variance, not the variance of the population) so we have to use their estimates and sample variance respectively. Use the information above and the chunk below (where you can perform your calculations) to calculate the estimated bias. You may skip this part if you are not willing to do it.

#< task_notest
# Enter your command below

#>
#< hint
display("You can calculate the omitted variable bias here using the formula above!")
#>

< quiz "Omitted Variable Bias"

question: How big is the estimated omitted variable bias? answer: -0.006 roundto: 0.001 success: Nice work, you were able to calculate the omitted variable bias! failure: Take another look at the formulas above and try again, if you can't figure our how to do it, take a look at the infobox below!

>

< award "Master of Quizzes lvl. 5"

Well done, you calculated the omitted variable bias!

>

< info "Calculation of the Omitted Variable Bias"

The calculation of the estimated Omitted Variable Bias for 2006 for omitting multi homeownership looks as follows:

$\hat \beta_2 \frac{cov(X_1,X_2)}{var(X_1)} = 0.096 \times \frac{-0.01564173}{0.249774} = -0.006\ .$

>

Excluding multi home ownership would lead to an estimated bias of -0.006 while the real value is -0.005. So the negative effect of the membership to the group is reduced if we take multi homeownership in account.

Figure 3.2: Own illustration of the effect of including multi_homeownership to the regression.

As you can derive from the regression output and as you see in Figure 3.2, including multi homeownership reduces the direct negative effect (which used to be -0.011, see at the left of Figure 3.2) because the membership to the securitization group is negatively correlated with multi homeownership. Multi homeownership has a positive effect on divestitures and therefore being a securitization agent has a negative indirect effect on divestitures through multi homeownership.

Regression Including Multi Homeownership

Now let's run two regressions, one where we control for multi_homeowner, the other where we don't. To do so we must first construct the same data set as in exercise 3.1.3 (the one saved in dta).

Task: Extract the entries from trans that are from securitization agents or equity analysts, ensure that only the years of 2000 to 2010 are included, that the persons were homeowners and that an age category exist. Store it in the object dta. Simply press check to do so.

#< task_notest
dta <- filter(trans, group %in% c("Securitization", "Equity Analysts") & Year != 2011 & homeowner == 1 & age_cat != "NA")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Do the regressions r1 where you don't control for multi_homeowner (like we did earlier) and r2 where you control for multi_homeowner using dta. Output r1 and r2 with stargazer() so that only the effect of multi_homeownership and the cross of Year and group are reported. Add the information that year effects (ensure that you write "Year Effects") were applied.

#< task
# Remove the leading '#' and replace the question marks below:
# r1 <- lm(??? ~ ???:??? + ???, data = dta)
# r2 <- lm(??? ~ ???:??? + ??? + ???, data = dta)
#>
r1 <- lm(divestitures ~ Year:group + Year, data = dta)
r2 <- lm(divestitures ~ Year:group + Year + multi_homeowner, data = dta)
#< hint
display("The first regression is the same like in Exercise 3.1.3, the second also controls for multi_homeowner. The visalization will be done with stargazer!")
#>
#< task
# Remove the leading '#' and replace the question marks below to output the regressions:
# stargazer(??, ??, type = "html", keep = c("group", "multi_homeowner"), add.lines = list(c("???", "???", "???")))
#>
stargazer(r1, r2, type = "html", keep = c("group", "multi_homeowner"), add.lines = list(c("Year Effects", "Yes", "Yes")))

Table 3.4: Regression adjusted differences of Divestiture Intensities of securitization agents and lawyers (the first column without controls, the second controlling for multi homeownership). Based on Table 4 - Divesting Houses of "Wall Street and the Housing Bubble" (p. 2814).

< award "Regressor lvl. 2"

Congratulations, your first regression with controls!

>

In Table 3.4 you can see what changed by including multi_homeownership as a control in our regression. The outcome of this regression is slightly different from the one we did without controlling for multi_homeownership.

$\lambda$ - the effect of multi_homeownership - is significantly positive (at the 1% level), which indicates that people are more likely to divest if they are multi homeowners, which is consistent with what we expected earlier.

$\beta_t$ increased for every year, meaning that we overestimated the negative effect of being a member of the securitization group. The intensities are still not significantly larger, between 2004 and 2006 they are even smaller for the securitization group compared to the equity analysts group.

The Age Distribution

Let's turn towards the second control in the regression, the age distribution of the three groups. As stated earlier, to be a factor in the analysis, the age distribution must have a reasonable effect on the divestiture behavior and has to be correlated with the independent variable, the membership to the securitization group. As mentioned earlier, different age patterns lead to differences regarding risk aversion and the different groups might face different life cycles and career risk. Before running the regression, we will look on the age distributions of the groups.

Task: Visualize the age distributions of the three groups using the variable age_cat from the person data with ggplot(). We will visualize the distribution in a histogram rather than in a line diagram. Since the code is already provided, you can simply press check to run it.

#< task_notest
# First of all we have to load the dataset person_included.dta which was already manipulated so that it contains only the 400 people per group that were not excluded due to the sampling rules stated in Exercise 2, since this data allows us to observe the age distribution of our sample doing less data manipulation
person <- read.dta("person_included.dta")
# Now we have to filter so that only people with an available age category are in the resulting dataset
per <- filter(person, age_cat != "NA")
# After that step we will compute the relative distribution of age_cat for every group. We take the object per, group() it by age_cat and group, summarize() it by age_cat and group, group() it by again by group and create a new column that computes share/sum(share) (the relative quantities) for every age category per group.
distribution <- per %>%
  group_by(age_cat, group) %>%
  summarize(share = n()) %>%
  group_by(group)%>%
  mutate(share = round(share/sum(share)*100,2))
# And we finally plot it with a header, axis titles and the legend on top
ggplot(data = distribution, aes(x = age_cat, y = share, fill = group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Age Distribution", x = "Age Category", y = "Relative Frequency in %") +
  theme_bw() +
  theme(legend.position = "top", legend.title = element_blank())
#>

#< hint
display("Simply press check!")
#>

Figure 3.3: Illustration of the age distribution of the three groups. The illustration is based on Table 1 - People, Panel B from "Wall Street and the Housing Bubble" (p. 2806).

As you can see in Figure 3.3, even though being quite similar distributed, there are a lot of differences between the different groups when it comes to age distribution. While the securitization agents tend to be the youngest group on average, the lawyers tent to be the oldest and the equity analysts the middle group.

Now we will apply the whole formula stated above and regress divestitures against group controlling for multi_homeowner and age_cat including Year effects.

< info "t-statistic"

The t-statistic is a measure that allows to test if an observed effect is significant.

It is defined as follows:

$$ t = \frac {\textrm{estimator}-\textrm{hypothesized value}}{\textrm{standard error of the estimator}}\ .$$ If we test the hypothesis if an effect is different form 0, it is significantly different if the absolute t-statistic is bigger than a specific value (for a significance level of 5% (10%, 1%), this would be 1.96 (1.645, 2.575)).

So - as you can see in the formula - a different way to compute standard errors corresponds with a different t-statistic and therefore a different significance level. Keep that in mind since we will get back to this at the end of this exercise.

See Stock and Watson (2007)

>

Task: Run the regression r3 (using dta), where you control for both age_cat and multi_homeowner. Output r2 and r3 afterwards using stargazer(). Add the information that year effects and age indicators were applied and report the t-statistics. Simply press check to do so.

#< task_notest
r3 <- lm(divestitures ~ Year:group + Year + age_cat + multi_homeowner, data = dta)

stargazer(r2, r3, type = "html", keep = c("group","multi_homeowner"), add.lines = list(c("Year Effects?", "Yes", "Yes"), c("Age Indicators?", "No", "Yes")), report = "vct*")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 3.5: The regression adjusted differences between securitization agents and equity analysts in divestiture intensity controlling for multi homeownership only (column one) and both multi homeownership and age category (column two). Based on Table 4 - Divesting Houses of "Wall Street and the Housing Bubble" (p. 2814).

Table 3.5 makes it evident that including age category had minor influences on the $\beta_t$ (recall, $\beta_t$ was the effect of being a member to the securitization group in year t) in each year. We still observe only one significantly positive effect (at the 10% level) in 2005.

Divestiture intensities of Securitization Agents and Lawyers

Until now, our analysis was limited to the regression adjusted differences between securitization agents and equity analysts. However, remember we also have a second control group, the lawyers, so we have to repeat the regression for the securitization agents and the lawyers as well.

We start by filtering the trans data set in such way that only entries of securitization agents and lawyers with information about an age category from 2000 to 2010 are included in the data set.

Task: Press check to create a data set from trans including only securitization agents and lawyers with an age category where the year is between 2000 and 2010 and that are homeowners. It will be stored it in dta2.

#< task_notest
dta2 <- filter(trans, group %in% c("Securitization", "Lawyers") & age_cat != "NA" & Year %in% c(2000:2010) & homeowner == 1)
#>
#< hint
display("Simply press check, the code is provided!")
#>

After filtering we can easily regress using the same formula like in r3, changing only the underlying data from dta to dta2.

Task: Do the regression for the securitization agents against the lawyers. Control for both age_cat and multihomeowner and use dta2 as data input.

#< task
# Enter your code below:

#>
r4 <- lm(divestitures ~ Year:group + Year + age_cat + multi_homeowner, data = dta2)

#< hint
display("If you are not sure how to do the regression, take a look at r3, the code is quite similar.")
#>

< award "Complete Control"

Congratulations, you controlled for both relevant variables!

>

Task: Output the two regressions r4 and r5 side by side using stargazer. Show only the entries for, Year:group and multi_homeowner in the table. Report the t-statistics and add the information that year effects are considered. Name the columns of the regression output Equity Analysts and Lawyers. Simply press check, the code is already provided.

#< task_notest
stargazer(r3, r4, type = "html", keep = c("group", "multi_homeowner"), report = "vct*", column.labels = c("Equity Analysis","Lawyers"), add.lines = list("Year Effects", "Yes", "Yes"))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 3.6: The regression adjusted differences between securitization agents and equity analysts (lawyers) in divestiture intensity controlling for age category and multi homeownership. Based on Table 4 - Divesting Houses of "Wall Street and the Housing Bubble" (p. 2814).

While we kept the outcome of the regression concerning securitization agents and equity analysts in column one of Table 3.6, column two represents the regression of securitization agents and lawyers. In column two we can observe that there is no significantly positive difference if we compare securitization agents to lawyers from 2004 to 2006. It is even significantly lower (at the 5% level) in 2005. This leads to the conclusion that the informed agents were rather not aware of the bubble or at least didn't react by divesting real estate.

This table seems to be exactly like Table 4 - Divesting Houses from the paper I am referring to. Yet, if you take a closer look, the t-statistic and the significance levels differ from the ones reported in the paper. This is due to differences in the standard errors as mentioned in the infobox above. This issue will be handled in the next exercise.

This exercise refers to the pages 2806, 2808 - 2816 of "Wall Street and the Housing Bubble".

Exercise 3.3 - Cluster Robust Standard Errors

Cluster Robust Standard Errors

In Multiple Linear Regressions Using OLS (exercise 3.1.3), we postulated several assumptions. We dealt already with a violation of assumption A1, but what if assumption B2 (the error terms are normally distributed with $\mu$, $\sigma^2$) is violated in such way that the variance $\sigma^2$ is not constant, but also relies on the independent variables $X$? In that case, we speak of so called heteroscedasticity (Von Auer, 2016). If assumption B3 is violated so that $cov(e_i, e_j) \neq 0$, autocorrelation exists (Von Auer, 2016).

According to Zeileis (2008) economic data has the characteristic to show at least some autocorrelation and heteroscedasticity. He states that, if the form of heteroscedasticity is unknown, the OLS estimator still yields useful outcome. But the calculation of standard errors has to be changed to obtain heteroscedastic robust outcomes. If autocorrelation exists, it additionally makes sense to use clustered standard errors (Stock and Watson, 2007). In our case, it seems plausible that autocorrelation exists, since we observe behavior from the same persons in different years.

To do so, heteroscedasticity consistent (HC) covariance matrix estimations have been designed. (Zeileis, 2008)

Recall, in Exercise 3.1 we stated that the standard error of the OLS regression is:

$$ \widehat{Var}(\hat{\beta}) = \hat{\sigma}^2 (X'X)^{-1} \qquad \textrm{where} \qquad \hat{\sigma}^2 = \frac{1}{(N-k-1)}\sum_{i=1}^N{\hat u_i^2}\ .$$

Stata's way to compute heteroscedasticity robust standard errors is to scale the variance matrix with $\frac{n}{n-k-1}$. Additionally they use the calculated residuals $\hat u_i$ meaning that they us HC1 robust standard errors. (Zeileis, 2004, p. 4 and Stata Manual regress, p. 3)

This leads to the variance being computed as follows:

$$ \widehat{Var}(\hat{\beta}) = (\textrm{X'X})^{-1} * \bigg[ \sum_{i=1}^N(\hat u_i * x_i)' * (\hat u_i * x_i) \bigg] *(\textrm{X'X})^{-1}\ .$$

If there is intra group correlation, meaning that we got several uncorrelated groups, whose values are correlated (in our case, one group would be a person defined by the corresponding keyident), a cluster robust variance matrix is computed:

$$ \widehat{Var}(\hat{\beta}) = (\textrm{X'X})^{-1} * \bigg[ \sum_{j=1}^{n_c}v_j' * v_j \bigg] *(\text{X'X})^{-1} \qquad \textrm{where} \qquad v_j = \sum_{j_{cluster}}\hat u_i * x_i\ .$$

See Sribney (StataCorp)

This can be done in R using the felm() function from the lfe package. With felm() it is possible to specify a cluster which leads to the same outcome like in Stata. (Gaure , 2016)

< info "felm()"

The R function felm(), contained in the lfe package enables users to include clustered standard errors in their OLS regression.

It is used as follows:

# First, we load the lfe package
library(lfe)
# Now we can use felm()
regression <- felm(formula, data)

A simple formula would be something like y ~ x | 0 | 0 | cluster which represents the regression of y against x with no other factors that are projected out (the first 0), no IV specification (the second 0) and standard errors clustered by cluster.

(Gaure, 2016)

>

Before we start with the regressions using clustered robust standard errors, we have to load and modify the dataset again like in the exercise before.

Task: Press check to load personyear_trans_panel.dta and store it in trans. This chunk will construct the two data sets dta and dta2 from trans, which are identical to the ones we used one exercise earlier.

#< task_notest
trans <- read.dta("personyear_trans_panel.dta")
dta <- filter(trans, Year %in% c(2000:2010) & group %in% c("Securitization","Equity Analysts") & age_cat != "NA", homeowner == 1)
dta2 <- filter(trans, Year %in% c(2000:2010) & group %in% c("Securitization","Lawyers") & age_cat != "NA", homeowner == 1)
#>

Now that we have our two data sets we can continue with performing the regressions with clustered standard errors using felm().

Task: Load lfe and use felm() to do a regression with person clustered standard errors. Run one regression for the securitization agents and the equity analysts and one for the securitization agents and the lawyers, using the data sets dta and dta2 and the formulas from Exercise 3.2. Save them in r3 and r4.

#< task
# Delete the leading '#' and repace the question marks:
#library(???)
#r3 <- felm(????? | 0 | 0 | keyident, data = ???)
#r4 <- felm(????? | 0 | 0 | keyident, data = ???)
#>
library(lfe)
r3 <- felm(divestitures ~ Year:group + Year + age_cat + multi_homeowner | 0 | 0 | keyident, data = dta)
r4 <- felm(divestitures ~ Year:group + Year + age_cat + multi_homeowner | 0 | 0 | keyident, data = dta2)
#< hint
display("Take a look at the infobox above to learn more about felm()!")
#>

< award "Regressor lvl. 3"

Your first regression replicating Stata robust clustered standard errors with felm()!

>

Task: Press checkto output r3 and r4 using stargazer().

#< task_notest
stargazer(r3, r4, type = "html", keep = c("group", "multi_homeowner"), report = "vct*", column.labels = c("Equity Analysts", "Lawyers"), add.lines = list(c("Year Effects?", "Yes", "Yes"), c("Age Indicators?", "Yes", "Yes")))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 3.7: Regression adjusted difference in Divestiture Intensities of the securitization agents and the equity analysts (column one) and lawyers (column two) respectively. Replicated Table 4 - Divesting Houses of "Wall Street and the Housing Bubble" (p. 2814).

Findings

Table 3.7 replicates the regression adjusted differences from Table 4 of "Wall Street and the Housing Bubble" with the correct t-statistics reported below the estimates. We still observe insignificant lower divestiture intensities for the informed agents compared to the equity analysts and two insignificant higher and one lower intensity (significant at the 10% level) compared to the lawyers during boom. We can conclude that the agents didn't divest significantly more, but rather divested insignificantly less than the equity analysts group. Compared to the lawyers group, they divested insignificantly more in 2004 and 2006 and significantly less (at the 10% level) in 2005.

Confidence Intervals of Regression Adjusted Divestiture Intensities

Now, that we replicated the outcomes of the regression adjusted divestiture intensities, we will look at confidence intervals. This is an approach to visualize significance. To learn more about confidence intervals take a look at the corresponding infobox below.

< info "confidence intervals"

Confidence intervals are intervals with center $\widehat\beta$ (the estimate of $\beta$) which have the property that they cover the real value with a probability of 1-$a$, were $a$ denotes the so-called significance level (usually set to 5%, but can also be 10% or 1%, the rule is that the smaller the level, the lower the probability that a significant effect is random and not systematically). A significance level of $a$ leads to a $1-a$ confidence interval.

Its notation is as follows: $$ [\widehat\beta-k_{a/2}\ ; \ \widehat\beta+k_{a/2}] \qquad \textrm{where} \qquad P{\widehat\beta-k_{a/2}\ \leq \ \beta \ \leq \ \widehat\beta+k_{a/2}} = 1- a\ .$$

$\widehat\beta$ is the estimated effect and the center of the interval, $k_{a/2}$ defines the bounds and therefore the width of the interval and is dependent on the distribution of $\widehat\beta$ and the significance level $a$ (basically arbitrary, but usually set to 5%).

(Von Auer, 2016)

>

To be able to plot the estimates alongside their confidence intervals, we must use tidy() to create a tidy regression output that is usable with ggplot().

< info "tidy()"

The function tidy() from the broom package creates tidy outputs from the outputs of R functions like felm(), t.test() or lm(). It creates and returns a data frame which is easily processable by graphic functions like ggplot().

It is used as follows:

library(broom)
tidy(output)

If you require confidence intervals, simply type in:

library(broom)
tidy(output, conf.int = TRUE, conf.level = 0.95)

conf.level specifies the confidence level which is 0.95 by default.

See: https://cran.r-project.org/web/packages/broom/vignettes/broom.html

>

Task: Load broom and use tidy() on r3 to create tidy regression output with confidence intervals, for further use with ggplot(). Store the output in td. Since we only want to print the confidence intervals and the estimates for the difference between the groups for the given years, we only need the entries from row 19 to 29. Output the first rows of td afterwards.

#< task
# Enter your code below
#>
library(broom)
#< hint
display("Load broome, use tidy(regression, conf.int = TRUE)[x:y,] and output the first rows of the resulting dataframe!")
#>
#< task
#>
td <- tidy(r3, conf.int = TRUE)[19:29,]

head(td)

< award "tidy"

You successfully converted a regression output to a tidy regression output usable for graphical analysis.

>

As you can see, td contains seven variables. The term, estimate, standard error, t-statistic, p-value, the left bound of the confidence interval and the right bound of the confidence interval. Of interest for our visualization is the estimate and the bounds of the confidence interval. Before plotting those figures, we will substitute the term column with the years from 2000 to 2010 to have a clearer plot later.

Task: Substitute the term column of td with a vector from 2000 to 2010. Press check since the code is provided.

#< task_notest
td$term <- c(2000:2010)
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Plot the estimates for the effect of being member of the securitization agents group with their corresponding confidence intervals. Press check to use ggplot() on td and add points with geom_point().

#< task_notest
ggplot(td,aes(term, estimate)) +
    geom_point() +
    theme(legend.title = element_blank()) +
    geom_hline(yintercept =  0, color = "red") +
    scale_x_continuous(breaks = c(2000:2010)) +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), alpha = .4, color = "black") +
    labs(x = "Year", y = "Estimated Difference", title = "Estimated Difference of Divestiture Intensities between Securitization Agents and Equity Analysts") +
    annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = -0.1, ymax = 0.15, alpha = .2, fill = "red")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 3.4: Regression adjusted difference in Divestiture Intensities between securitization agents and equity analysts with their corresponding confidence intervals (on the 5% significance level). Based on Table 4 - Divesting Houses from "Wall Street and the Housing Bubble" (p. 2814).

Figure 3.4 illustrates the estimates alongside with their respective 95% confidence intervals. What does that imply?

As you might already know, the 95% confidence interval means that the probability that the interval covers the real $\beta$ is 95%. We find a significant positive (or negative) effect if the confidence interval lies completely above (or below) the red line (y = 0). In the boom period, this is not the case for any year, leading to the assumption that there is no positive relationship that is significant at the 5% level between divestitures and membership to the securitization group. (Von Auer, 2016)

Now we perform the exact same steps as before for the other regression, we only change the significance level to 10%.

Task: Press check to use tidy() to create a tidy regression output for r4 and produce the same plot as before with the difference that the significance level is set to 10% (equals 0.9 as input).

#< task_notest
# First we create tidy regression output including bounds of the confidence interval 
td2 <- tidy(r4, conf.int = TRUE, conf.level = 0.9)[19:29,]

# We substitute the first row
td2$term <- c(2000:2010)

# And plot the confidence intervals
ggplot(td2,aes(term, estimate)) +
  geom_point() +
  theme(legend.title = element_blank()) +
  geom_hline(yintercept =  0, color = "red") +
  scale_x_continuous(breaks = c(2000:2010)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), alpha = .4, color = "black") +
  labs(x = "Year", y = "Estimate Difference", title = "Estimated Difference of Divestiture Intensities between Securitization Agents and Lawyers") +
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = -0.1, ymax = 0.15, alpha = .2, fill = "red")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 3.5: Regression adjusted difference in Divestiture Intensities between securitization agents and lawyers with their corresponding confidence intervals (on the 10% significance level). Based on Table 4 - Divesting Houses from "Wall Street and the Housing Bubble" (p. 2814).

As you can see in Figure 3.5, there are significant effects for the years 2005 (negative), 2007 and 2009 (both positive). The advantage of that visualization is clarity and comprehensiveness, but recall, in the table above we were able to observe different significance levels, here we can see only one. As we already know, 2009 is significant at the 5% level, but if we would only look at this visualization, we would not be able to see that (this holds also true for the other way around, if we set the level to 1% for example, we also won't be able to observe the significance of values significant at the 5% level).

< award "Stamina"

You made it through the longest exercise. Don't worry, the other exercises will be shorter than Exercise 3!

>

This exercise refers to the pages 2813 - 2816 of "Wall Street and the Housing Bubble".

Exercise 4 - The Weak Full Awareness Hypothesis

After rejecting the first, we shift towards the second hypothesis. This is the weak form, assuming awareness which manifests in such way that the agents didn't divest due to their uncertainty concerning the point of time when the bubble will blow up and to transaction costs associated with moving out of one's home. (Cheng et al, 2014) It is assumed that they rather avoided buying additional homes or swapping up during the boom period (2004 - 2006) compared to both the pre-boom period (2000 - 2003) and their control groups. The reason why Cheng et al. (2014) excluded first home purchases is that those purchases are rather made out of necessity than out of logical calculus.

Before we start with our regression, we will take a look on how the Second Home Purchase and Swap Up Intensity developed from 2000 to 2010. The intensity is defined analogously to the divestiture intensity.

4.1 Second Home Purchase and Swap Up Intensity

As implied by the Cheng et al. (2014), the formula for the Second Home Purchase and Swap Up Intensity is as follows:

$$\textrm{Second Home Purchases and Swap Up Intensity}_t =\frac {\textrm{Second Home Purchases or Swap Ups}_t} { \textrm{People eligible for Second Home Purchases and Swap Ups}_t}\ .$$

According to this formula, the Second Home Purchase and Swap Up Intensity consists of all second home purchases (buying a new home given that the person already owns one or more homes) and swap ups (swapping to a more expensive home) divided by the people eligible for second home purchases or swap ups, namely all people that currently own a home. Why should this intensity represent a good measurement for weak full awareness?

As discussed earlier, the agents had maximum incentive to avoid losses in their house portfolio since it usually represents a significant share of their wealth. If they were aware that the bubble existed, but didn't think that they are able to ride the bubble, they may have avoided to increase the share of wealth exposed to the housing market so that they don't have additional money at stake.

4.2 Second Home Purchase and Swap Up Intensities from 2000 to 2010

We will plot that intensity for every year and every group throughout 2000 to 2010. We will reload the data set personyear_trans_panel.dta and perform similar steps as before.

Task: Load the data set personyear_trans_panel.dta and store it in trans by pressing edit and check afterwards.

#< task_notest
trans <- read.dta("personyear_trans_panel.dta")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Construct a data frame that includes only homeowners and the years from 2000 - 2010 and store it in trans. Press check to do so.

#< task_notest
trans <- filter(trans, homeowner == 1 & Year %in% c(2000:2010))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Now there will be a small difference to Exercise 3. In Exercise 3 the variable divestitures was the dependent variable we were interested in, but which is the one for second home purchase and swap up intensity. Let's look at the trans data set again to identify the variable of interest.

Task: Show ten randomly selected rows of trans by simply pressing check.

#< task_notest
sample_n(trans, size = 10)
#>
#< hint
display("Simply press check, the code is provided!")
#>

As stated earlier, it is possible to get a brief description of a variable if you move your mouse over the header of one column. Move over the headers of the trans data set and try to determine the variable we are interested in.

< quiz "Search Variable"

question: Which variable stands for second home purchases and swap ups?

sc: - added_houses - houses_bought* - age_cat - multi_homeowner - prop_NYC - prop_SoCal - company_id

success: Nice work, you were able to find the relevant variable! failure: Not all answers correct. Please try again.

>

< award "Master of Quizzes lvl. 6"

Congratulations, you successfully found the dependent variable for our next regressions!

>

Task: Construct a table that contains the mean of houses_bought contained in trans for every group and every year. To do so, use the functions group_by() and summarize() with the pipe operator %>% and save it in second:

#< task
#second <- group_by(.data, var1, var2) %>%
#  summarize(mean = ???(???))
#>
second <- group_by(trans, group, Year) %>%
  summarize(mean = mean(houses_bought))
#< hint
display("Think about the two variables we want to group by, they are represented by var1 and var2. If you struggle with summaraise, which variable did you check in the quiz above?")
#>

< award "Summarizer"

Congratulations, you successfully applied group_by and summarize() to get descriptive statistics!

>

Task: Press check to produce a plot based on the data set second using ggplot(). This will be done similarly to the plot of the divestiture intensities in exercise 3.1

#< task_notest
ggplot(data = second, aes(x = as.numeric(Year), y = mean, color = group)) +
  geom_line() +
  theme_bw() +
  labs(title = "Second Home Purchase and Swap Up Intensity", x = "Year", y = "Second Home Purchase and Swap Up Intensity") +
  theme(axis.text.x = element_text(angle = 45,hjust = 1), legend.position = "top", legend.title = element_blank()) +
  scale_x_continuous(breaks = c(2000:2010)) +
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = 0, ymax = 0.125, alpha = .2, fill = "red")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 4.1: Raw Second Home Purchase and Swap Up Intensities for the three groups. Replication of Figure 2 - Transaction Intensities, Panel B from "Wall Street and the Housing Bubble" (p. 2812).

What do those findings imply?

If we look at the raw intensities in Figure 4.1, we find that the securitization agents' intensities were higher throughout the whole boom period compared to both controls. Actually, if you remember that the divestiture intensity reached its low in 2005, we find that the second home purchase and swap up intensity reaches it's all time high in exactly the same year. But this graph cannot account for possible different explanatory factors, so we will have to focus on regression adjusted differences which we will in the following paragraph.

4.3 Regression analysis

We will run a regression using the following linear model:

$$\textrm{E[BuySecondHomeOrSwapUp}{i,t}\textrm{|HO}{i,t-1} = 1] = \alpha_t+\beta_t \times Securitization_i + \sum_{j=1}^7\delta_j Age_j(i,t)+ \lambda MultiHO_{i,t-1}\ .$$

The term of interest is $\beta_t$, expressing the difference in buying a second home or swapping up regarding membership to the group of informed agents.

Task: Construct two data sets from trans, one including only the securitization agents and the equity analysts, named set1 and the other one containing only securitization agents and lawyers, named set2. Both sets should contain only rows with persons for which an age category is available.

#< task
#set1 <- filter(???, group %in% c("?????", "Equity Analysts"), ??? != "NA")
#set2 <- filter(???, group %in% c("?????", "Lawyers"), ??? != "NA")
#>
#< hint
display("Remove the leading # and replace the question marks!")
#>
set1 <- filter(trans, group %in% c("Securitization", "Equity Analysts"), age_cat != "NA")
set2 <- filter(trans, group %in% c("Securitization", "Lawyers"), age_cat != "NA")

Now that we have got the two data sets for our regression we can do them.

Task: Press check to regress houses_bought against Year:group using the controls Year, age_cat and multi_homeowner with both sets using felm().

#< task_notest
r1 <- felm(houses_bought ~ Year:group + Year + age_cat + multi_homeowner | 0 | 0 | keyident, data = set1)
r2 <- felm(houses_bought ~ Year:group + Year + age_cat + multi_homeowner | 0 | 0 | keyident, data = set2)
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Output the two regressions using stargazer(). Report t-statistics and set the type to text. Add the formation that we controlled for age_cat and Year. Press check, the code is already provided.

#< task_notest
stargazer(r1, r2, type = "html", keep = c("group", "multi_homeowner"), report = "vct*", column.labels = c("Equity Analysts", "Lawyers"), add.lines = list(c("Year Effects?", "Yes", "Yes"), c("Age Indicators?", "Yes", "Yes")))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 4.1: Regression adjusted difference in Divestiture Intensities of the securitization agents and the equity analysts (column one) and lawyers (column two) respectively. Replicated Table 5 - Buying a Second Home or Swapping Up of "Wall Street and the Housing Bubble" (p. 2815).

Findings

In contrast to the expectation that the agents divested more, we observe in Table 4.1 that $\beta_t$ is lower for both control groups between 2004 and 2006, but only significantly for the difference between equity analysts and securitization agents in 2005 (at the 1%-level). This leads to the conclusion that, instead of decreasing the amount at stake in the real estate market, securitization agents were rather buying and swapping-up. This is consistent with the findings we had in Exercise 3. Before we continue with hypothesis three let's look at the confidence intervals of that regression once again.

< quiz "Confidence Intervals"

question: Given Table 4 above, for which years do you expect to be able to observe a significant effect with the default significance level of 5%? mc: - 2002 - 2003 - 2004 - 2005 - 2006 - 2007

success: Well done, you understood the kind of plot we will produce! failure: Try again, you might need to take a look at the concept of confidence intervals again!

>

< award "Master of Quizzes lvl. 7"

Congratulations, you understand how confidence intervals work!

>

Task: Plot the estimates and confidence intervals of r1. Create a tidy regression output using tidy(), extract the rows from 19 to 29, rename the first column with the years 2000 - 2010 and plot the estimates and the confidence intervals. Simply press check to do so.

#< task_notest
td <- tidy(r1, conf.int = TRUE)[19:29,]

td$term <- c(2000:2010)

ggplot(td,aes(term, estimate)) +
  geom_point() +
  theme(legend.title = element_blank()) +
  geom_hline(yintercept =  0, color = "red") +
  scale_x_continuous(breaks = c(2000:2010)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), alpha = .4, color = "black") +
  labs(x = "Year", y = "Estimate Difference", title = "Estimated Difference of Second Home Purchase and Swap Up Intensities between Securitization 
Agents and Equity Analysts") +
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = -0.1, ymax = 0.15, alpha = .2, fill = "red")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 4.2: Regression adjusted difference in Second Home Purchase and Swap Up Intensities between securitization agents and equity analysts with their corresponding confidence intervals (on the 5% significance level). Based on Table 5 - Buying a Second Home or Swapping Up from "Wall Street and the Housing Bubble" (p. 2815).

As already mentioned in the quiz above, we can observe in Figure 4.2 that two estimates are significant at the 5% significance level at least. Still, from our table we know that they are even significant at the 1% level.

Task: Plot the confidence intervals for the regression with the control group lawyers. Create a tidy regression output using tidy() on r2. Then extract the rows from 19 to 29, rename the first column with the years 2000 - 2010 and plot the estimates and the confidence intervals. Press check to do so.

#< task_notest
td2 <- tidy(r2, conf.int = TRUE)[19:29,]

td2$term <- c(2000:2010)

ggplot(td2,aes(term, estimate)) +
  geom_point() + 
  theme(legend.title = element_blank()) + 
  geom_hline(yintercept =  0, color = "red") +
  scale_x_continuous(breaks = c(2000:2010)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), alpha = .4, color = "black") +
  labs(x = "Year", y = "Estimate Difference", title = "Estimated Difference of Second Home Purchase and Swap Up Intensity between Securitization 
Agents and Lawyers") +
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = -0.1, ymax = 0.15, alpha = .2, fill = "red")
#>

Figure 4.3: Regression adjusted difference in Second Home Purchase and Swap Up Intensities between securitization agents and lawyers with their corresponding confidence intervals (on the 5% significance level). Based on Table 5 - Buying a Second Home or Swapping Up from "Wall Street and the Housing Bubble" (p. 2815).

Figure 4.3 yields the exact same results like the table above, since the only significance level that we observe in both illustrations is the 5% significance level for 2002. In this special case, the table doesn't give us a lot more information.

This exercise refers to the pages 2802, 2812 - 2816 of "Wall Street and the Housing Bubble".

Exercise 5 - Performance

After testing and rejecting the first two hypotheses, let's focus on the third one. We will try to answer the question if the portfolios of the informed agents performed better on average, by comparing portfolios, constructed in accordance with some assumptions stated below. Sales and purchase information, the timing of sales and purchases and the location of the properties will be used to determine the value of the properties and constructed portfolios every year.

Cheng et al. (2014) made several assumptions to construct the individual portfolios:

The whole data set, which can be created using these assumptions, already exists, and is provided Cheng et al. (2014). It can be downloaded following the link in Exercise 0 of this problem set.

We will include the two different strategies analyzed by Cheng et al. (2014) in our analysis. First, the strategy our groups followed in reality by doing their transactions and second, the buy-and-hold strategy. The buy-and-hold strategy is the strategy where the initial supply of homes and cash is held from 2000:I and 2010:IV. Based on that, we will look at regression adjusted differences between the total return and the buy-and-hold return (this difference will be called "performance index) and regression adjusted differences in total return for the time between 2006:IV and 2010:IV.

5.1 Performance indices and accumulated return

Let's look at the performance indices of our three groups. As implied by the authors on page 2822, the performance indices for each person $i$ at time $t$ are calculated as:

$$\textrm{Performance Index}{i,t} = \frac {\textrm{totalvalue}{i,t}-\textrm{totalvalue}{i,t_0}} {\textrm{totalvalue}{i,t_0}} - \frac {\textrm{totalvalue_buyhold}{i,t}-\textrm{totalvalue_buyhold}{i,t_0}} {\textrm{totalvalue_buyhold}_{i,t_0}}\ .$$

Verbally, the individual performance index is the accumulated difference of the return between the trading and the buy and hold strategy. This formula can be simplified using the property that in 2000:I ($\textrm{totalvalue}{i,to} = \textrm{totalvalue_buyhold}{i,t_0} \forall i$, this holds because the initial portfolio value is simply the initial value of the buy and hold portfolio, since no transactions are made yet in $t_0$):

$$\textrm{Performance Index}{i,t} =\frac {\textrm{totalvalue}{i,t} - \textrm{totalvalue_buyhold}{i,t}} {\textrm{totalvalue}{i,t_0}} \ .$$

As mentioned by Cheng at al. (2014) on page 2822 of the paper, the weighted cumulative performance index (for all individuals together) is the sum of all individual performance indices at time t weighted by the initial portfolio values, or mathematically, the weighted arithmetic mean of the individual performance indices:

$$\textrm{Performanceindex}t =\frac {\sum{i}{\textrm{(w}{i} \times \textrm{Performanceindex}{i,t})}} {\sum_{i}{\textrm{w}{i}}}, \quad w_i = \textrm{totalvalue}{i,t_0} \ .$$

(Dodge, 2008)

The second indicator we are interested in is calculated as the weighted arithmetic mean of the cumulative returns of each individual (from 2006:IV on):

$$ \textrm{Return}t = \frac {\sum{i}{\textrm{(w}{i} \times \textrm{Return}{i,t})}} {\sum_{i}{\textrm{w}{i}}}, \quad \textrm{Return}{i,t} = \frac {\textrm{totalvalue}{i,t} - \textrm{totalvalue}{i,t2006end}} {\textrm{totalvalue}{i,t2006end}},\quad w_i = \textrm{totalvalue}{i,t_0}\ .$$ Remark: The weights are still the values of the initial portfolio in 2000:I.

5.2 Visualization of the Performance Index and Return

Now we will visualize the performance index from 2000:I to 2010:IV and the return from 2006:IV to 2010:IV. We start by loading the data sets performance_index.txt, which we store in index and indivperformance_wide.dta which we store in perfo. indivperformance_wide.dta is the original data set from which the performance_index.txt data set was derived.

Task: Load the two data sets performance_index.txt and performance.dta and store them in index and perfo. Simply press edit and check since the code is already provided.

#< task_notest
index <- read.table("performance_index.txt")
perf <- read.dta("performance.dta")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Press check to output ten random rows of the two data sets using samle_n().

#< task_notest
sample_n(index, size = 10)
sample_n(perf, size = 10)
#>
#< hint
display("Simply press check, the code is provided!")
#>

< info "indivperformance_wide.dta"

The original dataset indivperformance_wide.dta is derived from the property_data.dta dataset by the authors by applying the assumptions stated above alongside with information about the region in which the properties are located.

It contains the keyident, the number of properties in each quarter, the value of properties, the value of the cash account, the resulting total value, the total value of the buy-and-hold strategy and the share of the wealth invested in housing.

>

As you can see, the index data set contains only four columns, the column with the date and the one with the index levels of the three groups which will make it easy and fast to plot while, in the original data set, the index levels are only included at a person level, but can be calculated using the formulas above. The data set is already in long format and therefore no further steps are necessary to plot the index.

Task: Press check to plot the indices using ggplot().

#< task_notest
ggplot(data = index, aes(x = as.Date.character(date), y = index, color = group)) +
  geom_line() +
  theme_bw() +
  scale_x_date(labels = date_format("%Y"), breaks = date_breaks("1 year")) +
  labs(title = "Trading Performance Indices", x = "Year", y = "Index Value") +
  theme(axis.text.x = element_text(angle = 45,hjust = 1), legend.position = "top", legend.title = element_blank()) +
  annotate("rect", xmin = as.Date("2004-01-01", "%Y-%m-%d"), xmax = as.Date("2007-01-01", "%Y-%m-%d"), ymin = -0.04,ymax = 0.08, alpha = .2, fill = "red") +
  scale_y_continuous(breaks = seq(-0.04, 0.08, by = 0.02))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 5.1: Performance Indices consisting of the accumulated difference of the returns of trading and buy-and-hold strategy. Replication of Figure 4 - Trading Performance Indices from "Wall Street and the Housing Bubble" (p. 2823).

Task: Press check to plot the cumulative return from 2006 to 2010 using ggplot().

#< task_notest
ggplot(data = index, aes(x = as.Date.character(date), y = return2006, color = group)) +
  geom_line() +
  theme_bw() +
  scale_x_date(labels = date_format("%Y"), breaks = date_breaks("1 year")) +
  labs(title = "Cumulative Return from 2006 to 2010", x = "Year", y = "Cumulative Return") +
  theme(axis.text.x = element_text(angle = 45,hjust = 1), legend.position = "top", legend.title = element_blank()) +
  scale_y_continuous(breaks = seq(-0.09, 0.03, by = 0.02)) +
  geom_hline(yintercept = 0, alpha = .4)
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 5.2: Cumulative Return of the trading strategy after 2006:IV. Based on of Table 7 - Performance Index, Panel B from "Wall Street and the Housing Bubble" (p. 2824).

What are the implications of those findings?

As you can see in Figure 5.1, the performance indices for securitization agents and lawyers moved quite similar until 2006:IV. The equity analysts performed slightly worse during this period, but all groups generated an increasing additional value compared to the buy-and-hold-strategy. After that date, all three groups lost ground compared to the buy and hold strategy. Lawyers and equity analysts performed more and more similar while securitization agents ended up being the worst of the three groups. If we take a look at the cumulative return after 2006:IV in Figure 5.2, even though performing slightly better than the lawyers, the securitization agents performed worse compared to the equity analysts. Which of those two indicators is the better one? Since the performance index is a benchmark against a constructed buy-and-hold portfolio, it may capture differences due to trading better than the return does as implied by Cheng et al. (p. 2822, 2014). These findings might indicate optimism about the housing market which supports the earlier findings but this is raw data without any controls for other possible factors. To rule out that those factors are responsible for the observed difference, we must conduct deeper analysis.

5.3 Regression Analysis

We will run four regressions all in all, two for performance indices and two for the returns between 2006:IV and 2010:IV. Those regressions will be:

1) Weighted Performance index in 2010:IV for...

2) Weighted Return between 2006:IV and 2010:IV for...

As mentioned above, we will use weights to ensure that every portfolio is represented properly for the regression by adding weights. The weights are represented by the value of the portfolios at the end of the first quarter, 2000:I.

First, we have to load the additionally needed data set person_dataset.dta and merge that data with perfo to get the age category, age_cat which is required for the regressions as a control variable.

< info "merge"

The function merge() from the base package merges two data frames by column or row names.

It is used as follows:

merge(dataframe1, dataframe2)

See: https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/merge

>

Task: Load person and use merge() to merge perf and person to a data frame which you assign to perfo.

#< task
# Enter your command below:
#>
#< hint
display("Take a look at the infobox above to learn more about merge()!")
#>
person <- read.dta("person_dataset.dta")
perfo <- merge(perf, person)

< award "Merger"

Well done, you merged two data frames into a new one!

>

Now we need two data sets, one containing the portfolios of securitization agents and equity analysts and the other containing the portfolios of securitization agents and lawyers with the additional condition that an age category must be available since we will control for age.

Task: Construct two data frames, one containing the portfolio development of securitization agents and equity analysts which will be stored in set1 and one containing the portfolio development of securitization agents and lawyers which will be stored in set2. Additionally, the age_cat must be available. You don't have to type in the commands yourself, just press check to proceed.

#< task_notest
set1 <- filter(perfo, group %in% c("Securitization", "Equity Analysts") & age_cat != "NA")
set2 <- filter(perfo, group %in% c("Securitization", "Lawyers") & age_cat != "NA")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Now we will start with the regressions 1) i) and ii), the performance index regressions. We estimate robust standard errors using coeftest() from the package lmtest and vcovHC() from the sandwich package and specify the way to compute those errors to HC1 (we don't need to cluster them though since we don't have multiple values of one person in the performance index).

< info "coeftest() and vcovHC()"

We can combine the R functions coeftest() with vcovHC() to get heteroscedastic robust standard errors. coeftest() performs a significance test on the coefficients while vcovHC() estimates the variance covariance matrix.

The functions are used together as follows:

library(sandwich)
library(lmtest)
coeftest(regression, vcov = vcovHC(regression, "type"))

See https://www.rdocumentation.org/packages/lmtest/versions/0.9-35/topics/coeftest and https://www.rdocumentation.org/packages/sandwich/versions/2.4-0/topics/vcovHC

>

Task: Run the regressions r1, r2. r1 for the regression adjusted differences in performance between securitization agents and equity analysts and r2 for the regression adjusted differences in performance between securitization agents and lawyers. Compute robust standard errors using the coeftest() from the lmtest package and sandwich() from the sandwich package.

#< task_notest
# First, we have to load the packages for the coefficient test
library(sandwich)
library(lmtest)

# Second the performance index for the securitization agents and equity analysts:
r1 <- lm(performanceindex ~ group + age_cat, weights = set1$totalvalue_buyhold_2000I, data = set1)
reg1 <- coeftest(r1, vcov = vcovHC(r1,"HC1"))

# Third the performance index for the securitization agents and lawyers:
r2 <- lm(performanceindex ~ group + age_cat, weights = set2$totalvalue_buyhold_2000I, data = set2)
reg2 <- coeftest(r2, vcov = vcovHC(r2,"HC1"))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Press check to output the two regressions side by side using stargazer().

#< task_notest
stargazer(reg1, reg2, type = "html", keep = c("group"), report = "vct*", covariate.labels = c("Securitization"), column.labels = c("Equity Analysts", "Lawyers"), add.lines = list(c("Age Category", "Yes", "Yes")))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 5.1: Regression adjusted differences of the performance indices between securitization agents and the control groups. This table replicates parts of Table 7, Panel B of "Wall Street and the Housing Bubble" (p. 2824).

< quiz "coefficients"

question: What does the coefficient of -0.027 in the regression of securitization agents and equity analysts tell us? sc: - The performance of the securitization agents was 2.7% and significantly worse.* - The performance of the securitization agents was 2.7% and significantly better. - There is no real difference since the numbers are so small. success: Congratulations, you are right! failure: Not correct yet, you may try again!

>

< award "Master of Quizzes lvl. 8"

Nice work, you know how to interpret the regression output

>

As you can see in Table 5.1, there is a difference of -2.7% (significant at the 5% level) in performance compared to the equity analysts and -1.8% (insignificant) compared to the lawyers. Hence, the trading behavior of the agents did hurt their performance compared to their controls.

Now let's run the regressions 2) i) and ii) to see if we can observe differences in the return between 2006IV and 2010IV.

Task: Regress return_from_2006 (which contains the individual returns) against group controlling for age_cat using the data sets set1 and set2 which we already created and store them in r3 and r4. Perform a significance test of the coefficients after each regression and save the outcome in reg3 and reg4 using the HC1 covariance matrix. Don't forget to specify the weights!

#< task
# Enter your command below
#>
#< hint
display('The regressions should follow the form: r3 <- lm(return_from_2006 ~ group + age_cat, data = set1, weights = set1$totalvalue_buyhold_2000I), reg3 <- coeftest(r3, vcov = vcovHC(r3,"HC1"))')
#>
r3 <- lm(return_from_2006 ~ group + age_cat, data = set1, weights = set1$totalvalue_buyhold_2000I)
reg3 <- coeftest(r3, vcov = vcovHC(r3,"HC1"))

r4 <- lm(return_from_2006 ~ group + age_cat, data = set2, weights = set2$totalvalue_buyhold_2000I)
reg4 <- coeftest(r4, vcov = vcovHC(r4,"HC1"))

< award "Robust"

Congratulations, your first calculation of robust errors using coeftest()!

>

Task: Use stargazer to output the regressions. Set the type to html, set keep so that only the effect of the group is shown, report so that the t-statistic is reported, add the information that we controlled for age category and add column labels with the two control groups. Press check to do so.

#< task_notest
stargazer(reg3, reg4, type = "html", keep = c("group"), report = "vct*", covariate.labels = c("Securitization"), column.labels = c("Equity Analysts", "Lawyers"), add.lines = list(c("Age Category", "Yes", "Yes")))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 5.2: Regression adjusted differences of the cumulative return after 2006:IV between securitization agents and the control groups. This table replicates parts of Table 7, Panel B of "Wall Street and the Housing Bubble" (p. 2824).

Table 5.2 shows that there is a difference of -2.2% (significant at the 1% level) in performance compared to the equity analysts and +0.4% (insignificant) compared to the lawyers.

Findings

As you can see in Table 5.1, the performance index where we find negative effects for being part of the securitization group while we find significant negative and an insignificant positive effect concerning the difference in return between 2006:IV and 2010:IV in Table 5.2. How is it possible, that the securitization agents received a higher return while being worse off concerning the performance index compared to the lawyers. This difference lies in the definition of the performance index. Recall, it was defined as the accumulated difference between return of the trading and the buy and hold strategy, while we only observe the return of the trading strategy in the return regression. As mentioned above, the performance index is the better indicator for the performance of the trading, so we can reject our third hypothesis since the agents did not perform significantly better compared to their controls.

This exercise refers to the pages 2822 - 2825 of "Wall Street and the Housing Bubble".

Exercise 6.1 - Income Shocks

Now we will deal with the question raised with hypotheses four. Were the agents aware that their high bonuses and their high income was not sustainable but rather temporarily. We do this to mitigate the concern that the agents may have bought second homes due to a consumption motive rather than the purchase being an investment decision which was assumed in exercise 4 (Cheng et al, 2014). We will test if the securitization agents might have been aware of the temporary income shock by analyzing the value-to-income (vti) of the securitization agents and their controls (equity analysts and control groups).

< info "vti"

The vti is a purchase based magnitude that stands for the value of a purchase, divided by the income of the person who purchased, which means that a low vti stands for a relatively small value compared to the income while a high vti indicates that the person made an expensive purchase relatively to their income.

As implied by the authors of "Wall Street and the Housing Bubble" on page 2803 the vti is calculated as follows:

$$\textrm{vti} = \frac {\textrm{value of purchased property}} {\textrm{income of person that bought}}\ .$$

>

If the securitization agents showed signs of awareness that the high income they received was temporarily, they would have decreased their vti significantly during the boom period and they would have way smaller vti compared to their controls and to the pre-boom period. (Cheng et al, 2014)

Before analyzing the vti, we will analyze the income the three groups received during our three different periods to examine whether our claim that the securitization agents received income shocks is true.

Income Shocks

To make data manipulation easier and shorter, I already customized the hmda_matches.dta data set and stored this customized data set in income_desc.dta. The main differences between those two data sets are that income_desc.dta is complemented by a column that indicates the period, that income information not in our observation period is excluded and that the income and vti is already collapsed on a period, person level.

Before we take a closer look at our income data, we have to load it.

Task: Load the data set income_desc.dta and store it in income. Press edit and check to do so.

#< task_notest
income <- read.dta("income_desc.dta")
#>
#< hint
display("Simply press check, the code is provided!")
#>

< info "income_desc.dta and vti_desc.dta"

Both data sets are derived from hmda_matches.dta, is documented in exercise 9. hmda_matches.dta was created by Cheng et al. (2014) by matching the information from the Home Mortgage Disclosure Act (HMDA) matched with the census track of the properties.

The derived data sets contain information about the group, keyident, year of purchase, income at purchase, the real income at purchase (base year 2006) and period.

>

Let's look at the data stored in income.

Task: Output ten random rows of income using sample_n(). Press check to do so.

#< task_notest
sample_n(income, size = 10)
#>

Now we would like to get some summary statistics on a period, group level. To do so, we will group the data by group and period and we will compute the mean, median and standard deviation of the real income which is stored in the variable income_real.

< info "grid.table()"

The function grid.table() from gridExtra enables the user to output nice formatted tables from data frames.

If we don't want to print row numbers it is used as follows:

library(gridExtra)
grid.table(dataframe, rows = NULL)

See: https://cran.r-project.org/web/packages/gridExtra/vignettes/tableGrob.html

>

Task: Load gridExtra and use group_by() to group the data set income by group and period and compute the mean, median and standard deviation of income_real using summarize(). Connect the two functions with the pipe operator %>% and store it in income1, round the values to one digit. After performing those two steps, output income1 using grid.table() from gridExtra. Simply press check, the code is already provided.

#< task_notest
library(gridExtra)
income1 <- group_by(income, group, period)%>%
  summarize(mean = round(mean(income_real),1), median = round(median(income_real),1), sd = round(sd(income_real),1), persons = length(income_real))

grid.table(income1, rows = NULL)
#>

Table 6.1: Mean and median income with standard deviation of the income in $k for the three periods. This table replicates parts of Table 3 from "Wall Street and the Housing Bubble" (p. 2811).

As you can see in Table 6.1, the difference for the securitization group from pre-boom to boom is \$92.4k which would indicate an increase by 37.5% while the difference for equity analysts is \$58.0k (+16.1%) and for lawyers \$3.6k (+2.1%). Since we don't know if the income shock for securitization agents is significant or just coincidence, we will do a regression to examine whether the income numbers from the boom period age significantly different from the income numbers of the pre-boom period. We could also do a t-test, but since we can easily get the clustered t-statistics doing a felm() regression, we will stick with felm(). As mentioned by Cheng et al. (2014), since we only observe taxable income, it is possible that the income is downward biased since the income data often covers only taxable income, which may lead to problems in comparing the vtis later if the bias is not constant over time.

But that table is clearly not the most elegant and clearest way to display outcome of the income analysis. Let's rather plot it using ggplot().

< info "geom_label_repel()"

The function geom_label_repel() from the ggrepel package can be used to add labels to a plot made with e.g. ggplot(). It arranges the labels so that they don't overlap and is used as follows:

library(ggrepel)
plot +
  geom_label_repel(aes(y = y, label = label))

Where y denotes the y value of the object to which we would like to add a label and label is a vector containing the name (or value) we want to label it with.

See: https://www.rdocumentation.org/packages/ggrepel/versions/0.6.5/topics/geom_label_repel

>

Task: Press check to use ggplot() to plot a scatter plot of the mean in each period for every group from the dataset income1 and to add labels to the data points using geom_label_repel. Press check to do so.

#< task_notest
library(ggrepel)

ggplot(data = income1, aes(x = period, y = round(mean, 2), color = group)) +
  geom_point(size = 3.5)+
  theme_bw()+
  labs(title = "Income", y = "Income")+
  scale_y_continuous(limits = c(100, 500)) +
  theme(legend.position = "top", legend.title = element_blank()) + 
  geom_text_repel(aes(y = mean, label = round(mean, 2)))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 6.1: Plot of the mean income for the three periods. Based on Table 3 - Income from "Wall Street and the Housing Bubble" (p. 2811).

Boom - Preboom

< quiz "income"

question: What do you think, did any of the groups experience a significant income shock? sc: - no - yes, the securitization group and the equity analysts - yes but only the securitization group* - yes but only the equity analysts success: Congratulations, you are right, the securitization agents received a significant income shock! failure: You are not right, you may try again or continue with the problem set.

>

< award "Master of Quizzes lvl. 9"

Well done, you expect the right outcome!

>

To check if boom - pre-boom is significant, we will construct a data frame containing only the income data from securitization agents in boom or pre-boom period, run a regression with clustered standard errors and output it.

Task: Construct a data frame called sample using filter() from income containing only those rows that have information about securitization agents and where the period is pre-boom or boom. Run a regression that estimates robust clustered errors with felm() and regresses income_real against period, clustered by keyident. Output the regression using stargazer() where t statistics are reported. Press check to do so.

#< task_notest
sample <- filter(income, group == "Securitization" & period %in% c("pre-boom", "boom"))  

r1 <- felm(formula = income_real ~ period | 0 | 0 | keyident, data = sample)

stargazer(r1, type = "html", report = "vct*")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 6.2: Difference between the mean income in pre-boom and boom, with significance test for the securitization agents. This Table replicates parts of Table 3 - Income of "Wall Street and the Housing Bubble" (p. 2811).

The difference in income between boom and pre-boom is significant for securitization agents on a 10% level (Table 6.2). Thus, we can conclude that the securitization agents received a significant income shock. If the other groups didn't receive that income shock (at all or in a smaller scale), the securitization agents should have decreased their vti during the boom period compared to their vti during the pre-boom period as well as compared to those of the control groups given awareness.

Before focusing on the vti analysis, let's first see if the income of the equity analysts and of the lawyers changed significantly from pre-boom to boom.

Task: Press check to perform the same regression done above for securitization agents for equity analysts and lawyers.

#< task_notest
sample2 <- filter(income, group == "Equity Analysts" & period %in% c("pre-boom", "boom"))

sample3 <- filter(income, group == "Lawyers" & period %in% c("pre-boom", "boom"))

r2 <- felm(income_real ~ period | 0 | 0 | keyident, data = sample2)

r3 <- felm(income_real ~ period | 0 | 0 | keyident, data = sample3)

stargazer(r2, r3, type = "html", report = "vct*", column.labels = c("Equity Analysts", "Lawyers"))
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 6.3: Difference between the mean in pre-boom and boom, with significance test for the equity analysts and lawyers. This table replicates parts of Table 3 - Income of "Wall Street and the Housing Bubble" (p. 2811).

What do those Findings Imply?

As you can see in Table 6.3, the effect is smaller and not significant for either of the two control groups, so we expect the vti for the securitization sample to be lower in the boom period compared to the pre-boom period and in a difference-in-difference analyzing group membership and the period.

This exercise refers to the pages 2810 - 2811 of "Wall Street and the Housing Bubble".

Exercise 6.2 - Consumption

Now that we know what we are looking for, let's check if we can find evidence for awareness, if the agents decreased their vti compared to the controls. We will look at the "raw" numbers, the significance of a potential difference between boom and pre-boom and a difference-in-difference analysis between the groups and the periods.

< info "difference-in-difference"

A difference-in-difference is a procedure that measures the effect of a treatment of one group compared to a control group.

In our case, the "treatment" is the period. In the boom period, the groups are "treated", in pre-boom they are "not treated". The group is the securitization group while controls are equity analysts and lawyers respectively.

(Torres Reina, 2015)

>

To analyze the vti, hmda_matches.dta was modified which led to the data set vti_desc.dta. This data set has one column indicating the period, excludes all entries with income<100 (to minimize the effect of outliers) and it is aggregated at a person, period level.

Task: Load the modified data set vti_desc.dta, store it in vti and show ten random entries. Simply press check to do so.

#< task_notest
vti <- read.dta("vti_desc.dta")
sample_n(vti, size = 10)
#>
#< hint
display("Simply press check, the code is provided!")
#>

As you can see, vti_desc.dta, just like income_desc.dta, contains the group, stored in group, the year of the purchase in purchaseyear, the vti, and the period which are all of interest for us since we want to take a closer look at vti using group and period.

Task: Construct a data table containing the vti for every group in every period and store it in vti1. Output vti1 afterwards using grid.table(). Round the values to one digit. Hint: It works quite similar like the one for income with group_by() and summarize().

#< task
# Enter your code below:
#>
vti1 <- group_by(vti, group, period)%>%
  summarize(mean = round(mean(vti),1), median = round(median(vti),1),sd = round(sd(vti),1), persons = length(vti))
#< hint
display('the part where you summarize should look like that: summarize(mean = round(mean(vti),1), median = round(median(vti),1),sd = round(sd(vti),1), persons = length(vti))')
#>
grid.table(vti1, row = NULL)

Table 6.4: Mean and median vti with standard deviation for the three periods. This Table replicates parts of Table 9 - Value-to-Income from "Wall Street and the Housing Bubble" (p. 2826).

< award "Describer"

Congratulations, you prepared and visualized several summary statistics using grid.table()!

>

Since this Table 6.4 is quite messy, we will again plot the mean using ggplot() to have a nicer and tidier visualization that is easier to grasp. This time we will plot the median of the vtis.

Task: Use ggplot() to plot a scatter plot of the mean in each period for every group from the dataset vti1. Give it the title "Vti" and change the name of the y axis to "VTI" as well. Set the limits of the y axis to 2.25 and 3.5 using scale_x_continuous, the size of the points to 3.5 and use the already earlier used black and white theme.

#< task
# Delete the leading '#' and replace the question marks:
# ggplot(data = ???, aes(x = ???, y = ???, color = group)) +
#   geom_point(size = ???)+
#   theme_bw()+
#   labs(title = "VTI", y = "VTI")+
#   scale_y_continuous(limits = c(2.25, 3.5)) +
#   theme(legend.position = "top", legend.title=element_blank()) +
#   geom_text_repel(aes(x = period, y = median, label = ???))
#>
ggplot(data = vti1, aes(x = period, y = mean, color = group)) +
  geom_point(size = 3.5)+
  theme_bw()+
  labs(title = "VTI", y = "VTI")+
  scale_y_continuous(limits = c(2.25, 3.5)) +
  theme(legend.position = "top", legend.title=element_blank()) +
  geom_text_repel(aes(x = period, y = median, label = median))
#< hint
display("When we plotted income, we performed almost the same steps!")
#>

Figure 6.2: Plot of the mean vti for the three periods. Based on Table 9 - Value-to-Income from "Wall Street and the Housing Bubble" (p. 2826).

< award "ggplotter lvl. 3"

Congratulations, you plotted the data using ggplot()!

>

< quiz "vti"

question: Are the outcomes of the raw analysis what we would expect assuming awareness that the income is transitory? sc: - yes, we would have expected the vti to increase - no, we would have expected the vti to decrease* - no, we would have expected the vti to be the same

success: Congratulations, we would have expected the vti to decrease! failure: You checked the wrong answer, try again!

>

< award "Master of Quizzes lvl. 10"

Congratulations, you got your expectations right!

>

According to Figure 6.2, the securitization agents increased their vti by 6.3% compared to the pre-boom period, the equity analysts by 6.9% and the lawyers by 13.8%, the securitization agents still do have the highest vti throughout our three groups while the equity analysts have the lowest. It is noticeable that, different from the other groups, the securitization agents' mean vti increased while their median decreased. This indicates that there must be some people in the sample that bought at relatively high vti's. The following question arises: What about the significance of the difference and what about the so-called difference-in-difference between the periods and groups?

Boom - Preboom

As you can see, the securitization agents did not decrease their vti from pre-boom to boom, but rather increased it, leading to the conclusion that they expected their high income not to be transitory but constant or even more increasing in the future. Let's check if that difference is significant or not.

We will perform the same steps like in Exercise 6.1 to obtain the significance of the difference between pre-boom and boom concerning vti. We will start with the difference for the securitization agents.

Task: Construct a data frame which will be called sample from vti in which all entries of securitization agents from pre-boom and boom are included. Run a felm() regression and output it with stargazer(). Press check to do so.

#< task_notest
sample <- filter(vti, group == "Securitization" & period %in% c("pre-boom", "boom"))

r1 <- felm(formula = vti~period | 0 | 0 | keyident, data = sample)

stargazer(r1, type = "html", report = "vct*")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Table 6.5: Difference between the mean of vtis in pre-boom and boom, with significance test for the securitization agents. This Table replicates parts of Table 9 from "Wall Street and the Housing Bubble" (p. 2826).

Even though the effect of the period is not significant, we can conclude that the securitization agents did not decrease their vti significantly.

Difference in Difference

How will we do a difference in difference?

We will refer to three columns. First, the column vti, which is our dependent variable, secondly the column period (our first independent variable), indicating the treatment and lastly the group column (our second independent variable) indicating the membership to our groups (see Torres Reina, 2015). Since we do have two control groups, we must perform two difference-in-difference analyses. For more information on difference-in-difference analysis see the infobox above.

Task: Filter the vti data sets two times. One time for the securitization agents and the equity analysts and the second time for the securitization agents and the lawyers. Ensure that we only have entries where period is pre-boom or boom and save the arising data frames in did1 and did2. Press check to do so.

#< task_notest
did1 <- filter(vti, group %in% c("Securitization", "Equity Analysts") & period %in% c("pre-boom", "boom"))
did2 <- filter(vti, group %in% c("Securitization", "Lawyers") & period %in% c("pre-boom", "boom"))
#>
#< hint
display("Use filter() to extract the entries needed!")
#>

Now we will run the two regressions using felm() from the lfe package which enables us to produce the exact same results as the authors results in the paper including errors clustered on a person level.

Task: Use felm() to run the two regressions of vti against period*group (one for did1 and one for did2) and store them in r2 and r3 specifications for covariates and iv are not needed, the standard errors should be clustered by keyident. Output the outcome of the regressions with stargazer().

#< task
#Enter your command below:
#>
#< hint
display("felm(dependent ~ independent1 * independent2 | 0 | 0 | cluster, data = data")
#>
r2 <- felm(vti ~ period * group | 0 | 0 | keyident, data = did1)

r3 <- felm(vti ~ period * group | 0 | 0 | keyident, data = did2)

stargazer(r2, r3, type = "html", report = "vct*", column.labels = c("Equity Analysts", "Lawyers"))

Table 6.6: Difference-in-Difference analysis between securitization agents and the control groups concerning pre-boom and boom. This Table replicates parts of Table 9 from "Wall Street and the Housing Bubble" (p. 2826).

< award "Regressor lvl. 4"

Congratulations, your first difference-in-difference using felm()!

>

Findings

The terms of interest here are the ones in the row periodboom:groupSecuritization. We don't see a significantly negative effect from pre-boom to boom and neither for the difference in difference. This leads to the conclusion that the securitization agents didn't expect their high income to be temporarily and therefore didn't decrease their vti, neither compared to their pre-boom vti nor compared to their controls.

This exercise refers to the pages 2825 - 2827 of "Wall Street and the Housing Bubble".

Exercise 7 - Financing

In this part, we will examine if another possible explanation not covered by our four hypotheses - the terms of financing - could influence our results. We will focus on two different concerns. First, on interest rates and second on the loan-to-value of the groups.

7.1 Interest Rates

It might be that the securitization agents had easier and cheaper ways to finance their purchases compared to their controls, the equity analysts and lawyers. If true, that might have offset the effect of possible awareness. To examine whether this is true or not, we will take a look at the interest rates faced by the securitization agents and the two control groups.

Task: Load the data set properties.dta and store it in prop. Show ten random rows of prop afterwards. Press check to do so.

#< task_notest
prop <- read.dta("properties.dta")
tail(prop)
#>
#< hint
display("Simply press check, the code is provided!")
#>

After loading the data successfully, we have to exclude the properties that were purchased before 2000 or after 2010. Additionally, we are only interested in those entries where a mortgage rate is available, otherwise we would not be able to use the data to visualize interest rates.

Task: Exclude all purchases that were not made between 2000 and 2010 with a mortgage interest rate, group it by group and store it in intrate using the pipe operator. Press check to do so.

#< task_notest
intrate <-  filter(prop, purchaseyear %in% c(2000:2010) & mrtgintrate != "") %>%
  group_by(group)
#>
#< hint
display("Simply press check, the code is provided!")
#>

< info "properties.dta"

The data set properties.dta is derived from property_data.dta which was constructed collecting property data from LexisNexis and matching them with the people in the sample.

It contains the keyident, purchaseprice, saleprice, the number of total houses, the mortgage interest rate, the group, purchaseyear, saleyear, the conforming interest rate, the jumbo interest rate, a dummy if the person had a mortgage, the real purchaseprice and the real saleprice.

>

< info "aggregate"

aggregate(), contained stats package enables the user to compute summary statistics from data frames. It's particularly useful since grouping is supported.

The function is used as follows:

aggregate(list(name = x), list(name = y, name = z), stat)

x is the variable to aggregate, y and z are the group variables and stat is the statistic we want (e.g. mean, median, sd, etc.).

See: https://www.rdocumentation.org/packages/stats/versions/3.4.0/topics/aggregate

>

Task: Construct a data frame named intrate_agg containing the mean of the interest rates (mrtgintrate) for every year and every group using aggregate() on the data set intrate. To do so remove the leading '#' and replace the question marks.

#< task
# Remove the leading '#' and replace the question marks below:
# intrate_agg <- aggregate(list(interestrate = intrate$m???te), list(year = ???$purch???, group = intrate$group), ???)
#>
intrate_agg <- aggregate(list(interestrate = intrate$mrtgintrate), list(year = intrate$purchaseyear, group = intrate$group), mean)
#< hint
display("Just replace the question marks and the leading # so that the operation constructs your data frame! Aditionally, aggregate() is specified in the infobox above.")
#>

< award "Aggregator"

Congratulations, you managed to use aggregate() in the right way!

>

Task: Press check to plot intrates using ggplot().

#< task_notest
ggplot(data = intrate_agg, aes(x = year,y = interestrate,color = group)) +
  geom_line() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "top", legend.title = element_blank()) +
  labs(x = "Year", y = "Interest Rate", title = " Mean Interest Rates Faced from 2000 to 2010") +
  scale_x_continuous(breaks = c(2000:2010)) +
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = 3, ymax = 9, alpha = .2, fill = "red") +
  scale_y_continuous(breaks = seq(3, 9, by = 1))
#>

Figure 7.1: Development of the interest rates for the three groups between 2000 and 2010. Replicates parts of Figure 3 - Financing, Panel A from "Wall Street and the Housing Bubble" (p. 2820).

Findings

As you can see in Figure 7.1, the interest rates are quite similar throughout the whole time with similar variation across time between the securitization agents and the equity analysts. The rates move also similar for securitization agents and lawyers, except 2006 and 2007. Still, those numbers should be treated with some, since the sample is very small. In 2007 there were only four purchases with information for the lawyer group for instance, which might be a reason why the authors decided not to include the lawyers group in their visualization.

7.2 Tail risk laid off to lenders (loan-to-value analysis)

Another strategy could have been that the securitization agents laid off their tail risk to lenders by increasing their loan-to-value (ltv). We will check that by computing the ltv for all the three groups for every year between 2000 and 2010 and compare them by plotting the three. But before doing so, what is tail risk? Tail risk describes the risk associated with extreme events, which are very unlikely like a crash of home prices. (see http://lexicon.ft.com/Term?term=tail-risk)

It could have been that the securitization agents showed awareness in such way that they expected that the tail risk comes into reality when the bubble crashes and therefore insured themselves against tail risk by decreasing their skin in the game which is analogously to increasing the ltv. (Cheng et al, 2014) This would have been possible especially for so called non-recourse debt, which grants the lender access to the collateral of the mortgage, but does not grant the lender access to the remaining wealth of the individual. (Gerardi, 2010)

< info "ltv"

The loan-to-value measures the risk of a mortgage for the lender by taking the ration of the loan and the value of the property. Mathematically spoken:

$$ \textrm{ltv} = \frac {\textrm{loan taken to finance property}}{\textrm{value of the property}}$$

See Reserve Bank of New Zealand (2016).

>

The first step in doing so is that we construct a data frame from prop that contains all relevant purchases. In our case the relevant purchases are all purchases made between 2000 and 2010 that have an available ltv (all ltv's bigger than 100 are already excluded to restrain the effect of outliers).

Task: Construct a data frame from prop that contains all properties purchased between 2000 and 2010 and have ltv information available and assign it to ltvdata.

#< task
#???dat?? <- ??????(prop, ???!="" & purch?????ar?c(2000:2010))
#>
ltvdata <- filter(prop, ltv!="" & purchaseyear %in% c(2000:2010))
#< hint
display("Just replace the question marks and the leading # so that the operation constructs your data frame!")
#>

Task: Use aggregate() to get the median ltv for all the three groups for every year from 2000 to 2010 and store the aggregated data set in ltvdata_agg. Press check, the command is already provided.

#< task_notest
ltvdata_agg <- aggregate(list(ltv = ltvdata$ltv), list(year = ltvdata$purchaseyear, group = ltvdata$group),median)
#>
#< hint
display("Simply press check, the code is provided!")
#>

Task: Plot the median ltv of all groups using ggplot().

#< task_notest
ggplot(data = ltvdata_agg, aes(x = year, y = ltv, color = group)) +
  geom_line() +
  theme_bw() +
  labs(x = "Year", y = "LTV", title = "Median LTV from 2000 to 2010") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom", legend.title = element_blank()) +
  scale_x_continuous(breaks = c(2000:2010)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  annotate("rect", xmin = 2003.5, xmax = 2006.5, ymin = 0, ymax = 1, alpha = .2, fill = "red")
#>
#< hint
display("Simply press check, the code is provided!")
#>

Figure 7.2: Development of the ltv for the three groups between 2000 and 2010. Replication of Figure 3 - Financing, Panel B from "Wall Street and the Housing Bubble" (p. 2820).

Findings

In Figure 7.2, there are only small changes in the median ltv of the three groups. These changes tend to be quite similar and they move in the same direction with slightly different amplitude. Thus, we cannot conclude that the agents laid their tail risk off to lenders which could have been a possible explanation of the home purchases observed earlier.

This exercise refers to the pages 2799, 2819 and 2820 of "Wall Street and the Housing Bubble".

Exercise 8 - Conclusion

Our goal was to test awareness of mid-level managers about the US housing bubble by testing four hypotheses. We tested two full awareness hypothesis. Firstly, we tested if the securitization agents divested more before 2007. Secondly, we focused on the question if they didn't increase their exposure to the market compared to two control groups. We tested if the performance of constructed portfolios was better. Additionally, we tested if they did foresee that the income during boom is not sustainable. We then considered, if they faced different interest rates that could have led them to invest more in housing and finally, if they tried to get rid of the tail risk associated with the purchases by excessively increasing the loan-to-value.

During our analysis we rejected all four hypothesis due to insignificant or significant effects which were mostly moving in the direction one would not expect them to move given awareness of the agents. It seems that they were divesting less, buying more second homes or swapping up more, were rather increasing the value-to-income which indicates, that they did not belief that their high income is temporarily). We also showed that they were neither facing different interest rates, nor trying to increase the loan-to-value in the boom period. This leads to the conclusion that, based on this analysis, the agents were rather not aware of the bubble but optimistic about the housing market.

This leads to some questions that were also considered by Cheng et al (2014). Are there information deficits? Should information processes and flow be under focus? How can the forming of beliefs be guided in a better way? Improving information flows and processes could enable analysts to identify potential bubbles earlier. Thus, market mechanisms would correct the prices earlier and the bust would maybe not materialize, but instead employees and their contracts are of main focus in this sector today.

But we should be aware of the fact that we did not study beliefs of the subprime segment of the market, since it is very unlikely that the securitization agents were subprime borrowers and we do not observe the whole balance sheet of a household, which should not be a big issue since it is remarkably hard to short the housing market. (Cheng et al, 2014)

< award "Finisher"

Congratulations, you finished the problem set! I hope it was interesting and you were able to learn some new functions and procedures in R!

>

This exercise refers to the page 2827 of "Wall Street and the Housing Bubble".

Exercise 9 - References and Changes on Datasets

Bibliography

R and R Packages

Changes on Data Sets

I made several changes on the data sets to ease the analysis. To enable you to do the same analysis on your own with the original data sets, below i will explain which changes were made on the original data sets to get the data sets used in this problem set.

i) casehiller.txt - Derived from: caseshiller_metros.dta - Changes made: Extraction and renaming of data from the three metropolitan areas New York City (original name: nyxr), Chicago (chxr), Los Angeles (lxxr) and the composite 20 (spcs20r), cutting them so that they start at 01.01.2000 and end at 01.01.2011 (using dt_m, which specifies the month, where dt_m = 480 stands for 01.01.2000 and dt_m = 612 stands for 01.01.2011)

ii) person_dataset.dta - Derived from: person_data.dta - Changes made: The fourth column renamed to group (original name datsource) and age_cat converted to class character.

iii) personyear_trans_panel.dta - Derived from: personyear_transactions_panel.dta - Changes made: The level rank of group changed so that equity analysts are first, lawyers second and securitization agents third, columns renamed to Year (year), added_houses (nummaddn_prev), group (datsource), divestitures (numdvst), houses_bought (numbuy_spec), homeowner (l_homeowner_adj), multi_homeowner (l_homeowner_adj_multi), prop_NYC (Lyrprop_NYC) and prop_SoCal (Lyrprop_SoCal), age_cat and Year converted to class character. A duummy variable that indicates membership to the securitization group is introduced (1 if yes, 0 if not).

iv) person_included.dta - Derived from: person_data.dta - Changes made: The fourth column is renamed to group; the data set is filtered so that only the people of the final sample are included in the resulting data set. age_cat is converted to class character.

v) performance_index.dta - Derived from: indivperformance_wide.dta - Changes made: The index levels are calculated by computing the weighted mean of totalvalue_eoq/totalvalue_buyhold_eoq for every quarter where the weights are given by totalvalue_eoq_10160 (the initial quarter 2000:I) for all three groups.

vi) performance.dta - Derived from: indivperformance_wide.dta - Changes made: Columns renamed.

vii) income_desc.dta - Derived from: hmda_matches.dta - Changes made: Rows without real income data and outside the observation period are excluded, a new column named period is assigned based on the period (pre-boom, boom, bust), values are aggregated on a person, period level, columns are renamed and unused columns dropped.

viii) vti_desc.dta - Derived from: hmda_matches.dta - Changes made: Rows where the income is below 100 or not available are excluded, as well as rows where the vti is not available or rows which are not inside the observation period, a new column named period is assigned based on the period (pre-boom, boom , bust), values are aggregated on a person, period level, columns are renamed and unused columns dropped.

ix) properties.dta - Derived from: property_data.dta - Changes made: Some columns are renamed specifically total_houses (housetot), group (datsource) and has_mrtg(hasmrtg) and the level rank of group changed so that equity analysts are first, lawyers second and securitization agents third.



mwentz93/RTutorWallStreet_old documentation built on May 15, 2019, 1:44 p.m.