In UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project: Replication of 'A New Approach to Estimating the production function for housing'

``` {r setup, echo = FALSE, message = FALSE, warning = FALSE}

Loading the libraries directly used in this Rmd

library(knitr) library(tidyverse) library(here) library(stargazer)

Setting global options for the document

knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, echo = FALSE, comment = "#>", fig.width = 7, fig.height = 4.5)

Reading the R scripts

knitr::read_chunk('001-estimate-supply-production-function-ols.R') knitr::read_chunk('002-HNIP-estimator.R')

A reference implementation of:

> Epple, Dennis, Brett Gordon, and Holger Sieg. 2010. *"A New Approach to Estimating the Production Function for Housing."* American Economic Review, 100 (3): 905-24.DOI: 10.1257/aer.100.3.905

# Introduction

The purpose of the paper is to do a replication of the paper "A New Approach to Estimating the Production Function for Housing" [@epple2010new] that estimates the production function for housing. Based on two factors - the observed variation in land prices and housing values per unit of land, the original paper has provided an algorithm to identify the housing supply function per unit of land. This in turn is used to derive an estimate of the underlying production function. The approach yielded plausible estimates for the price elasticity of housing supply per unit of land, based on data from Allegheny County in Pennsylvania. In the paper,we propose a reference implementation by adding two additional models to replicate an estimation function that relates land price and home value per unit land and produce the corresponding tables and plots.Estimating housing production functions is challenging, as the quantity and price per unit of the housing services are not observed by the econometrician. Replicating the study helps us to understand how the underlying production function is estimated by treating prices and quantities of housing services as latent variables, without relying on strong functional form assumptions.


# Methods

The authors in the paper[@epple2010new] proposed a flexible estimation procedure based on duality theory to estimate the production functions with minimal functional form assumptions. The underlying assumption is that the production function exhibits constant returns to scale and that there exists a monotonic relationship that captures a relationship $r(\nu)$ at equilibrium between the value of housing per unit of land $(\nu)$ and the price of land $(p_l)$. Thus estimating the function $r(\nu)$ becomes the first step in the process. The parametric functions used for the purpose is OLS and the paper has used multiple transformations on data to produce linear, log linear and polynomial(quadratic and cubic) models to best identify the relationship.

Since, the function $r(\nu)$ forms the base of further analysis and calculation, we decided to replicate the study of these models to test the robustness and the assumptions behind the models. For this purpose, we followed a two-variant approach. The first variant is to use a different model for replication and the second is to use a different loss function to calculate the estimates in the model.

For the first case, we decided to use Generalized Linear Model with different distribution families. Using different families allows us to verify the condition of normality of error in the models used by authors. We noticed that the gaussian distribution family with log independent and dependent variable produced a line that fitted the best with the data and the corresponding coefficients were very similar to the log linear model used in the paper.For the loss function, we wrote a function to implement log linear regression with gradient descent loss function. The gradient descent loss function provides a more flexible approach because of the presence of hyperparameters *learning rate* and *number of iteration*. Since the approach is not used in the paper, we ran a simulation study in order to determine the hyperparameters that delivered values of coefficients very similar to that of the models in the paper[@epple2010new]

Based on the equilibrium relationship $r(\nu)$,the authors proposed that there exists a differential equation that characterizes the normalized housing supply function $s(p)$. For this purpose, the base paper used Ordinary Differential Equations (using 'odesolve' package) for calculating the supply function, $s(p)$ for parametric, semi-nonparametric and nonparametric estimates of $r(\nu)$. However, since ‘odesolve’ has been removed from the CRAN repository and the more recent package 'deSolve' completely supersedes it,  we have used 'deSolve' for our replication.

The paper[@epple2010new] also estimated $r(\nu)$ using instrumental variable estimator suggested by Hausmann, Newey, Ichimura, and Powell (1991) [@hausman1991identification]. In this approach,the estimates for higher order polynomials are realized using a linear model. For our replication,we replaced the OLS based HNIP estimator with a bayesian regression function. Bayesian model is used as it is more flexible to further model development and can directly model posteriors of derived/calculated quantities. 'Bolstad' package provides a bayes.lm function which has been used to get the estimates for the cubic model. Upon analysis, we can conclude that the replicated estimator produced coefficients that are similar to the actual paper. 


## Data

The dataset used by the paper [@epple2010new] comes from Allegheny County Web site (Pennsylvania state) maintained by the Office of Property assessments, which is based on the appraisals conducted by Sabre Systems and uses land value appraisals as well as property appraisals. Using geocoding, property assignment to travel zones and subsetting only for functioning residential properties, the count came down to 358,677 properties. Of these properties,the authors used a subsample of housing units that were built after 1995 for the estimation procedure which was 6362 houses. The data captures some important metrics for the houses such as the price of land, value per unit land, plot area, travel time and its geo-location (latitude and longitude). We will be using the same data for our replication in this paper. The descriptive statistics of the dataset with 6362 residential properties is shown in Table 1.

```r

data = read.csv(here("Data/Pittsburgh_post1995.txt"), header=TRUE)
data %>% select(v,pland,lotarea,tcog) %>%
  stargazer(type = "text",summary.stat = c('mean','median','sd','min','max'),
            covariate.labels = c("Value per unit of land, v", "Price of land, p", "Lot area (sq. ft.)","Travel time (minutes)"),
            digits =2,
            title = "Table 1 - Descriptive statistics of Residential data")

Results

For our replication study, we have the results for the following variants introduced in the original paper:
1. Excluded the top 1 and bottom 1 percentile records from the residential data based on value of unit per land. This is based on the claims by the author that the results were robust to the presence of extreme values.
2. Added two models to the OLS based estimates- GLM log-linear model and a log-linear model based on gradient descent loss function.
3. Replicated the supply and production function estimates for generalized log-linear model along with the original models.
4. Replicated the plots of supply function and production function with 95% CI bands for the generalized log-linear model
5. Replicated the instrumental variable estimator model(suggested by Hausman, Newey, Ichimura, and Powell (Journal of Econometrics, 1991)[@hausman1991identification] for third order polynomial using Bayesian regression.

# Define the squared error cost function    
cost <- function(X, y, theta) {
  sum( (X %*% theta - y)^2 ) / (2*length(y))
}

#Gradient descent model function
gradientDescentLinearModel <- function(x, y, alpha, num_iters){

   alpha <- alpha # Specify the learning rate
   num_iters <- num_iters # Specify the number of iterations
   cost_history <- rep(0,num_iters) # will be used to store the value of cost function after
   # every iteration 
   theta_history <- list(num_iters) # will be used to store the value of theta after every 
   # iteration 
   theta <-  c(0,0) # Initial values of theta
   X <- cbind(1,x) # Add a column vector with all values  to be 1 to x so that hypothesis 
   # function has an intercept 
   for (i in 1:num_iters) {
      theta[1] <- theta[1] - alpha * (1/length(y)) * sum(((X%*%theta)- y))
      theta[2] <- theta[2] - alpha * (1/length(y)) * sum(((X%*%theta)- y)*X[,2])
      cost_history[i] <- cost(X, y, theta)
      theta_history[[i]] <- theta
   } 
   return(theta)
}
#Prediction function for the model
calculatePredictionValues <- function(x, y, slope, intercept){
   y <- slope * x + intercept
   return(y)
}

Regression estimates and plots

knitr::include_graphics("Estimate_coeff_table.PNG")

The original paper estimates the r(v) functions using OLS for log-linear, linear, quadratic and cubic models. We obtained similar results using gaussian log-linear model and gradient descent log linear model in addition to the four models. For both of these two models, they fit the main features of the residential data substantially well and their performance is similar to the earlier four models. All the p-values were calculated using heteroskedasticity-robust standard errors and they were significant at 1% level.

As observed in Table 2, while the base paper had slope to be $log(v)$ as 0.909 and the intercept coefficient as -1.605 for the log-linear model in Figure \@ref(fig:original-coef-table), we obtained similar results with (0.919,-1.632) and (0.918,-1.631) as (slope, intercept) for our GLM model and gradient descent models respectively.

For effective comparison of all the six models used, we have also enclosed a plot as seen in Figure \@ref(fig:plot-the-fitted-models). We find all the models to fit the data very well. We were able to establish the monotonicity condition of $r(v)$ to be satisfied for the two replication models as well under all the polynomial estimation cases for the range of values of v observed in the data.

Supply and production function plots

As seen in Figures \@ref(fig:derive-supply-functions-for-each-model) and \@ref(fig:production-function-estimation), we find that the supply functions and the corresponding production functions of all the base models as well as the GLM log-linear model have similar shapes.

Supply and production functions with 95% confidence band for GLM Log-linear model

knitr::include_graphics("supply_f_plot_orig.PNG")

knitr::include_graphics("prod_f_plot_orig.PNG")

Using simulation, standard errors are calculated based on which the corresponding 95% confidence interval bands are estimated for the supply and production functions of GLM Log-Linear model.

As observed, we were able to replicate the supply and production functions with a 95% confidence interval band for GLM log-linear model as seen in Figures \@ref(fig:glm-supply-func-confidence-band) and \@ref(fig:glm-production-func-confidence-band) and are similar to the plots produced in the original paper, as seen in Figures \@ref(fig:original-supply-plot) and \@ref(fig:original-production-plot) .

Replication of non-linear regressor estimation using HNIP

The $r(v)$ function replicated using bayesian regression for third order polynomial is similar to the one estimated in the original paper using OLS based Linear Regression model. The original paper produced a mean of 0.1732, --0.0005 and 0.000004 for V, V^2 and V^3 coefficients respectively and the replicated bayesian model produced similar means of r round(mean(beta.hat[,1]),6),r round(mean(beta.hat[,2]),6) and r round(mean(beta.hat[,3]),6) as seen in Table 3.

Conclusion

By introducing new models and removing outliers from the Allegheny residential data,we were able to successfully replicate the following key claims from the paper [@epple2010new] 1) by considering price and quantities as latent variables, the observed variation in land prices and housing values per unit of land is sufficient to identify the housing supply function per unit of land and 2) Given the supply function, the underlying production function for housing can be estimated, without strong functional form assumptions. While the team faced some difficulties during the replication process due to a couple of outdated packages, the codes and the data from the original paper performed as intended and made our replication study a lot easier. Compared to the models used in [@epple2010new] for estimating the $r(v)$, our Gaussian Generalized Log Linear Model and Gradient Descent Log Linear Model had similar performances and the Generalized log linear model had similar supply and production function estimates.

References

UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project documentation built on March 17, 2020, 12:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project
Replication of 'A New Approach to Estimating the production function for housing'

In UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project: Replication of 'A New Approach to Estimating the production function for housing'

Loading the libraries directly used in this Rmd

Setting global options for the document

Reading the R scripts

Results

Regression estimates and plots

Supply and production function plots

Supply and production functions with 95% confidence band for GLM Log-linear model

Replication of non-linear regressor estimation using HNIP

Conclusion

References

R Package Documentation

Browse R Packages

We want your feedback!

UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project Replication of 'A New Approach to Estimating the production function for housing'

In UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project: Replication of 'A New Approach to Estimating the production function for housing'

Loading the libraries directly used in this Rmd

Setting global options for the document

Reading the R scripts

Results

Regression estimates and plots

Supply and production function plots

Supply and production functions with 95% confidence band for GLM Log-linear model

Replication of non-linear regressor estimation using HNIP

Conclusion

References

R Package Documentation

Browse R Packages

We want your feedback!

UW-MSDS-DATA-598-Reproducibility-WI20/dutta-gupta-manohar-prabhakar-rathod-replication-project
Replication of 'A New Approach to Estimating the production function for housing'