In FernandoTusell/ipv: Computation of housing price indices

knitr::opts_chunk$set(cache=TRUE, fig.width=7, fig.height=5.5)
options(width=120)

Machine dependent. Make cores equal to the number of cores in your machine you want to use. Some routines (like BackFitting) can make use of more than one thread. Be careful: if R is linked to BLAS / LAPACK routines which are parallelized, and you parallelize also at the top level, this may destroy the performance of the machine. Set the value of cores so no more threads are started than the machine can accomodate.

cores=2

In the process of constructing a housing price index for the Basque Autonomous Community (CAPV) and certain of its subareas using real estate data from web sources (mainly idealista.com), it has been necessary to write a number of functions, which are now collected in package ipv.

These are spur-of-the-moment functions, which are usable and in fact have supported our work; but they lack de degree of polish and checking that one would expect from a carefuly planned and designed package.

Datos

The data have been collected in two .rda files, venta.dep.rda and alquiler.dep.rda. We have cleanned the raw data as made available to us by removing anomalous or repeated observations, etc. The files reside in the data folder of the package and can be loaded by typing:

comienzo <- Sys.time()

library(ipv)
data("venta.dep")
data("alquiler.dep")

(Beware: if you are using a publicly available version of this package, such data may have been obscured, distorted and shuffled to preserve confidentiality. You will not be able to reproduce the results in this vignette, although the data provided will suffice to test the use of functions.)

Other than package ipv we will need those following: all of them can be obtained from CRAN.

library(ggplot2)
library(mgcv)
library(tidyverse, verbose=FALSE)
library(sp)
library(rgdal)
library(rgeos)
library(spgwr)
library(zoo)

Model estimation

The package targets two types of models for housing prices index estimation:

Global models plus non-parametric tendency: these are models where parameters of the hedonic part are common to the whole area of estimation. Location does matter in prices, but it enters the model via a proxy like SUBREGION, AF (functional area), etc. The general form of a global model would be: $$ \log(\textrm{Precio/m2}){it} = \sum{j=1}^K\beta_jx_{ij} + \sum_{l=1}^L\beta_{l}\textrm{loc}{il} + s(t) + \epsilon{i} $$ $\textrm{loc}_{il}$ is a dummy which takes value 1 if observation $i$ belongs to region $l$, cero otherwise; $\beta_l$ measures the impact on the response of the fact that the dwelling
is in region $l$, and $s(t)$ is a non-parametric function of time global to all regions, from which the index computed.
Geographically weighted regression models (GWR) plus non-parametric trend: these are models where the parameters change across the space. The effect in the price of a dwelling of including a garage, for instance, is not the same across space, and the model takes these into account. The general form of a GWR witt non-parametric trend is: $$ \log(\textrm{Precio/m2}){it} = \sum{j=1}^K\beta_j(u_i,v_i)x_{ij} + s(t) + \epsilon_{i} $$ Here the location dummies have disappeared and the betas of the hedonic part are dependent on the coordinates $u_i,v_i$. The lcoation dummies are no longer present, as the space effect is taken up via the dependence of the betas on coordinates $(u_i,v_i)$.

Both types of models have a hedonic part for quality correction and a non-parametric tendency, from which the price index is derived. Although not imperative, it makes more sense to take prices in the log scale and per unit of surface (square foot or square meter) of the dwelling.

We prepare the data computing the log price per unit of surface, an auxiliary time variable x (days elapsed from the date of the oldest observation in the sample) and filtering out observations for which no geographical information is available in coordinates DX_ETRS89 and DY_ETRS89:

datos <- alquiler.dep %>% 
  mutate( logpm2 = log(IMP_CREACION / NUM_SUPERF),
          x = as.integer(FEC_CREACION - min(FEC_CREACION)) ) %>%
  filter( !is.na(DX_ETRS89) & !is.na(DY_ETRS89)) %>%
  as.data.frame()

Geographical coordinates are not needed in a global model, but are required for a GWR model.

Global models

Global models are estimated using function gam. For instance,

mod1 <- gam( logpm2 ~   SUBREGION + ROOMNUMBER +
                HASLIFT +  IND_GARAJE +
                NUM_SUPERF + s(x,bs="cr", k=10), 
              data=datos)

It is up to the analyst to decide which variables to include. Clearly, we want a model as descriptive as posible, incorporating each variable which conceivably is descriptive of the quality of the dwellings; however, too ambitious a model will force to discard many observations for which the included variables are not observed. The k parameter is the equivalent number of degrees of freedom (EDF): we have found one or two per year to be plenty. s(x, bs="cr", k=10) is a cubic spline with the prescribed degrees of freedom.

Results can be inspected by:

summary(mod1)
anova(mod1)

ConsInd

The object returned by gam is all we need to compute the index, using function ConsInd:

ind <- ConsInd(modelo=mod1, basedate="2007-12-31", 
        tit="Índice precio vivienda CAPV.\nEf. espacial: SUBREGION",
        fechas=as.Date(datos$FEC_CREACION), conf=0.95)

ConsInd returns (invisibly) the index, as a time series with daily dates.

class(ind)
head(ind)

The method used fits a spline to a number of points. Each time we add new observations to the sample, normally "to the right", i.e. most recent data, a global fit is recomputed. This has the potential of changing the values of the index for previous periods, which needs not be a bad thing: it is quite common for past figures to be revised in the light of new data.

If, however, we absolutely want to have previous values of the index computed so far unchanged, we may pass a copy of such 'frozen' index as argument congelado, and the frozen values will be inserted in place of the newly computed values. See for instance,

datos1 <- subset(datos, datos$FEC_CREACION < as.Date('2016-01-01') )
mod1.1 <- gam( logpm2 ~   SUBREGION + ROOMNUMBER +
                HASLIFT +  IND_GARAJE +
                NUM_SUPERF + s(x,bs="cr", k=10), 
              data=datos1)

mod1.2 <- gam( logpm2 ~   SUBREGION + ROOMNUMBER +
                HASLIFT +  IND_GARAJE +
                NUM_SUPERF + s(x,bs="cr", k=10), 
              data=datos)

Here we estimate the index with data before '2016-01-01':

ind1.1 <- ConsInd(modelo=mod1.1, basedate="2007-12-31", 
        tit="Índice precio vivienda CAPV.\nEf. espacial: SUBREGION",
        fechas=as.Date(datos1$FEC_CREACION), conf=0.95)
head(ind1.1)
head(ind)

ind1.2 <- ConsInd(modelo=mod1.1, basedate="2007-12-31", 
        tit="Índice precio vivienda CAPV.\nEf. espacial: SUBREGION",
        congelado=ind1.1,
        fechas=as.Date(datos$FEC_CREACION), conf=0.95)
head(ind1.2)
tail(ind1.2)
tail(ind)

IndZonas

We might want to compute indices for diferent zones. Function IndZonas is meant to simplify the task: rather than a separate invocation to ConsInd for each zone, we can make a single invocation to IndZonas passing an argument zonas whose levels define the different zones.

Hence, if we want a separate index for each province,

ind.prov <- IndZonas(datos, base="2007-12-31", zonas="PROVINCIA", frm=formula(mod1))

We have taken the model specification formn mod1; we could type it directly.

sel   <- (datos$FEC_CREACION > as.Date("2007-11-30"))   # Only after this date there is enough data
datos <- datos[sel,]
ind.prov <- IndZonas(datos, zonas="PROVINCIA", base="2007-12-31",
                     frm=mod1$formula)
ind.af   <- IndZonas(datos, zonas="AF", base="2007-12-31",
                     frm=update(mod1$formula, . ~ . - SUBREGION))

The results in ind.prov are lists of indices which can be represented in separate panels.

mggplot

Plotting several indices is facilitated by using the function mggplot, which takes the object returned by IndZones and constructs the plot. Below we obtain the names of the plots from the names in the list ind.prov, but we could supply different names in the argument columnas.

mggplot(ind.prov, titulo="Indice precios alquiler")   
mggplot(ind.af, titulo="Indice precios alquiler")

We might opt for plotting all indices in a single scale:

mggplot(ind.prov, tipo="multiple", titulo="Indice precios alquiler")   
mggplot(ind.af,   tipo="multiple", titulo="Indice precios alquiler")

Geographically weighted models

The dataframes venta.dep and alquiler.dep include UTM projected coordinates (DX_ETRS89, DY_ETRS89 and corresponding DX_ED50 and DY_ED50) We can set directly a desired bandwidth or select it by cross-validation, which with objects the size of the ones we have here would be quite expensive. We will set a bandwidth of 5000m; this is the radius of the circles around each point which receive substantial weight in the local fitting procedure.

The backfitting algorithm alternates the spatial and time adjustments until convergence. For ilustration only a subsample can be selected here, uncommenting the line which sets sel.to reduce computation time. Results with the full sample will be different.

frm         <- formula(mod1)
frm.param   <- update(frm, ~ . - s(x,bs="cr",k=10))
smooth.term <- 's(x,bs="cr",k=10)'

# sel <- ((1:nrow(datos)) %% 5) == 0                        # One observation out of each 5
if (!file.exists("bfg.rda")) {
indice.bfg <- BackFitting(frm.param=frm.param,
            smooth.term=smooth.term,
            var.loc="SUBREGION",
            coords=c("DX_ETRS89","DY_ETRS89"),
            datos=datos[sel,],
            cores=cores,
            var.fecha='FEC_CREACION',
            baseday='2008-01-01')
save(indice.bfg, file="bfg.rda")
} else {
  load("bfg.rda")
}

The result is

plot(indice.bfg)

The previous result corresponds to a model in which the parametric part displays spatial variation (accounted for by GWR) while the non-parametric part (which gives rise to the index) is unique for all the observations. Function BackFittingLocal affords the estimation of indices for specific points, whose coordinates are given in the argument cal.pts. A second bandwidth bandwitdh, bws, is specified, which may not be coincident with the bw used for the GWR: the former gives the spatial scope of observations which are deemed to share a time trend, while the second gives the scope of observations which are deemed to share a similar valuation of attributes.

Seleccionaremos como puntos de cálculo de índices locales los centroides de barrios que parecen haber registrado una evolución diferente:

barrios <- c("Indautxu", "Abando - Albia", "Deusto")
cal.pts <- matrix(0,length(barrios),2)
b    <- venta.dep[venta.dep$BARRIO %in% barrios, ]
dimnames(cal.pts) <- list(barrios, NULL)
for (i in barrios) {
  cal.pts[i,] <- apply(b[b$BARRIO==i, c("DX_ETRS89","DY_ETRS89")], 2, FUN="median", na.rm=TRUE)
}
cal.pts

xy <- data.frame(ID=c("PlazaEnsanche","GeneralEguia"),
                 X=c(-2.931413,-2.946283),
                 Y=c(43.263538,43.259257))
coordinates(xy) <- c("X","Y")
proj4string(xy) <- CRS("+proj=longlat +datum=WGS84") 
res <- spTransform(xy, CRS("+proj=utm +zone=30 ellps=WGS84"))
res <- as.data.frame(res)
cal.pts <- as.matrix(res[,c("X","Y")])
rownames(cal.pts) <- res[,"ID"]

Podemos ahora calcular varios índices locales así:

# 
#  Eliminación de observaciones muy lejos de Bilbao
#
sel <- (datos$DX_ETRS89 > 504000) & (datos$DX_ETRS89 < 506000)
sel <- sel & (datos$DY_ETRS89 > 4788000) & (datos$DX_ETRS89 < 4791000)
sel <- sel & datos$FEC_CREACION > as.Date("2010-01-31")   # Sólo donde hay datos suficientes
lista.ind <- BackFittingLocal(frm.param=formula(logpm2 ~ ROOMNUMBER + HASLIFT +
                                                  IND_GARAJE + NUM_SUPERF),
                        smooth.term='s(x,bs="cr",k=10)',
                        cal.pts=cal.pts,
                        datos=datos[sel,],
                        coords=c("DX_ETRS89","DY_ETRS89"),
                        var.fecha='FEC_CREACION',
                        baseday="2010-12-31",
                        cores=cores,
                        bw=500,
                        bws=200,
                        tol=0.005)

names(lista.ind[[2]]) <- rownames(lista.ind[[1]])
p <-mggplot(lista.ind[[2]], columnas=rownames(lista.ind[[1]]), tipo="multiple")
p <- p + ggtitle("Local indices",
              sub=paste("Base: ","31-12-2010"," = 100"))
# pdf(file="localindices.pdf")
print(p)
# scratch <- dev.off()

Local indices via time slicing

An alternative way to the computation of local indices would be time slicing. Using this approach, for each location we would do the following:

Set the calibration point (location for which the index is sought).
Slice observations according to time. The time intervals need be such that enough observations are left in each slice to allow the fitting of GWR model.
With observations in each slice, fit the GWR model at the selected location. $$ \log(\textrm{Precio/m2}){it} = \sum{j=1}^K\beta_j(u_i,v_i,t)x_{ij} + \epsilon_{i} $$ This gives a set of $\beta(u,v,t)$, where $t$ corresponds to the time slice.
With a set of such indices, we can estimate the price at time $t$ of a dwelling of given attributes $x_j$ by $$\sum_{j=1}^K\beta_j(u_i,v_i,t)x_{j}.$$ Such dwelling would have cost at the base time $t=0$ $$\sum_{j=1}^K\beta_j(u_i,v_i,0)x_{j};$$ hence an estimate of the price index $I_0(t)$ at time $t$ with base period $t=0$ can be constructed as $$I_0(t) = \frac{\exp{\left(\sum_{j=1}^K\beta_j(u_i,v_i,t)x_{j}\right)}}{\exp{\left(\sum_{j=1}^K\beta_j(u_i,v_i,0)x_{j}\right)}}$$ This is just a Laspeyres type of index in which the shadow prices of the attributes (the $\beta_j(u_i,v_i,t)$) are given weights equal to their "presence" $x_j$ in a virtual homogeneous dwelling, which is taken as representative of the area (but need not have been observed at all times, nor even at a single time!).

We select data in the Bilbao area, from 2012 onwards (observations before are scarce)

The points for which we want predictions must be in the form of a data frame with complete values of the variables used in the model. The easiest way to construct values for the argument fit.points below is to take arbitrary rows of spdatos (one for calibration point), replace the values of the variables for the ones we want, and the set the coordinates conveniently. We use as a "median house" for this area a flat with three bedrooms, garage space and elevator---.

median.houses <- datos[1000:1001,] 
median.houses$ROOMNUMBER <- 3
median.houses$HASLIFT <- factor("SI", levels=levels(datos$HASLIFT))
median.houses$NUM_SUPERF <- 85
median.houses

Now we set the coordinates:

median.houses <- median.houses[,-match(c("LAT","LNG"),colnames(median.houses))]
coordinates(median.houses) <- c("DX_ETRS89","DY_ETRS89")

We will fit a model to each slice of time, and obtain predictions of the prices for the median houses:

preds <- SlicedIndex(
    cal.pts = median.houses,
    datos = datos,
    coords = c("DX_ETRS89","DY_ETRS89"),
    from = as.yearqtr("2015 Q1"),
    to = as.yearqtr("2019 Q1"),
    bw = 1000,
    frm = formula(logpm2 ~ ROOMNUMBER + NUM_SUPERF +
                    HASLIFT + IND_GARAJE),
    date = "FEC_CREACION",
    slice.by=as.yearqtr)
plot(exp(preds))