knitr::opts_chunk$set(cache=TRUE, fig.width=7, fig.height=5.5) options(width=120)
Machine dependent. Make cores
equal to the number of cores in your machine you want to use. Some routines (like BackFitting
) can make use of more than one thread. Be careful: if R is linked to BLAS / LAPACK routines which are parallelized, and you parallelize also at the top level, this may destroy the performance of the machine. Set the value of cores
so no more threads are started than the machine can accomodate.
cores=2
In the process of constructing a housing price index for the Basque Autonomous Community (CAPV) and certain of its subareas using real estate data from web sources (mainly idealista.com), it has been necessary to write a number of functions, which are now collected in package ipv
.
These are spur-of-the-moment functions, which are usable and in fact have supported our work; but they lack de degree of polish and checking that one would expect from a carefuly planned and designed package.
The data have been collected in two .rda
files, venta.dep.rda
and alquiler.dep.rda
.
We have cleanned the raw data as made available to us by removing anomalous or repeated observations, etc. The files reside in the data
folder of the package and can be loaded
by typing:
comienzo <- Sys.time()
library(ipv) data("venta.dep") data("alquiler.dep")
(Beware: if you are using a publicly available version of this package, such data may have been obscured, distorted and shuffled to preserve confidentiality. You will not be able to reproduce the results in this vignette, although the data provided will suffice to test the use of functions.)
Other than package ipv
we will need those following: all of them can be obtained from CRAN.
library(ggplot2) library(mgcv) library(tidyverse, verbose=FALSE) library(sp) library(rgdal) library(rgeos) library(spgwr) library(zoo)
The package targets two types of models for housing prices index estimation:
Global models plus non-parametric tendency: these are models where parameters of the hedonic part are common to the whole area of estimation.
Location does matter in prices, but it enters the model via a proxy like SUBREGION
, AF
(functional area), etc.
The general form of a global model would be:
$$ \log(\textrm{Precio/m2}){it} = \sum{j=1}^K\beta_jx_{ij} + \sum_{l=1}^L\beta_{l}\textrm{loc}{il} + s(t) + \epsilon{i} $$
$\textrm{loc}_{il}$ is a dummy which takes value 1 if observation
$i$ belongs to region $l$, cero otherwise; $\beta_l$ measures the impact
on the response of the fact that the dwelling
is in region $l$, and $s(t)$ is a non-parametric function of time global
to all regions, from which the index computed.
Geographically weighted regression models (GWR) plus non-parametric trend: these are models where the parameters change across the space. The effect in the price of a dwelling of including a garage, for instance, is not the same across space, and the model takes these into account. The general form of a GWR witt non-parametric trend is: $$ \log(\textrm{Precio/m2}){it} = \sum{j=1}^K\beta_j(u_i,v_i)x_{ij} + s(t) + \epsilon_{i} $$ Here the location dummies have disappeared and the betas of the hedonic part are dependent on the coordinates $u_i,v_i$. The lcoation dummies are no longer present, as the space effect is taken up via the dependence of the betas on coordinates $(u_i,v_i)$.
Both types of models have a hedonic part for quality correction and a non-parametric tendency, from which the price index is derived. Although not imperative, it makes more sense to take prices in the log scale and per unit of surface (square foot or square meter) of the dwelling.
We prepare the data computing the log price per unit of surface, an auxiliary time variable x
(days elapsed from the
date of the oldest observation in the sample) and filtering out observations for which no geographical information is available in coordinates DX_ETRS89
and DY_ETRS89
:
datos <- alquiler.dep %>% mutate( logpm2 = log(IMP_CREACION / NUM_SUPERF), x = as.integer(FEC_CREACION - min(FEC_CREACION)) ) %>% filter( !is.na(DX_ETRS89) & !is.na(DY_ETRS89)) %>% as.data.frame()
Geographical coordinates are not needed in a global model, but are required for a GWR model.
Global models are estimated using function gam
. For instance,
mod1 <- gam( logpm2 ~ SUBREGION + ROOMNUMBER + HASLIFT + IND_GARAJE + NUM_SUPERF + s(x,bs="cr", k=10), data=datos)
It is up to the analyst to decide which variables to include. Clearly, we want a model as descriptive as posible,
incorporating each variable which conceivably is descriptive of the quality of the dwellings; however, too ambitious a model will force to discard many observations for which the included variables are not observed. The k
parameter
is the equivalent number of degrees of freedom (EDF): we have found one or two per year to be plenty.
s(x, bs="cr", k=10)
is a cubic spline with the prescribed degrees of freedom.
Results can be inspected by:
summary(mod1) anova(mod1)
The object returned by gam
is all we need to compute the index, using function ConsInd
:
ind <- ConsInd(modelo=mod1, basedate="2007-12-31", tit="Índice precio vivienda CAPV.\nEf. espacial: SUBREGION", fechas=as.Date(datos$FEC_CREACION), conf=0.95)
ConsInd
returns (invisibly)
the index, as a time series with daily dates.
class(ind) head(ind)
The method used fits a spline to a number of points. Each time we add new observations to the sample, normally "to the right", i.e. most recent data, a global fit is recomputed. This has the potential of changing the values of the index for previous periods, which needs not be a bad thing: it is quite common for past figures to be revised in the light of new data.
If, however, we absolutely want to have previous values of the index computed so far unchanged,
we may pass a copy of such 'frozen' index as argument congelado
, and the frozen values will be
inserted in place of the newly computed values. See for instance,
datos1 <- subset(datos, datos$FEC_CREACION < as.Date('2016-01-01') ) mod1.1 <- gam( logpm2 ~ SUBREGION + ROOMNUMBER + HASLIFT + IND_GARAJE + NUM_SUPERF + s(x,bs="cr", k=10), data=datos1)
mod1.2 <- gam( logpm2 ~ SUBREGION + ROOMNUMBER + HASLIFT + IND_GARAJE + NUM_SUPERF + s(x,bs="cr", k=10), data=datos)
Here we estimate the index with data before '2016-01-01':
ind1.1 <- ConsInd(modelo=mod1.1, basedate="2007-12-31", tit="Índice precio vivienda CAPV.\nEf. espacial: SUBREGION", fechas=as.Date(datos1$FEC_CREACION), conf=0.95) head(ind1.1) head(ind)
ind1.2 <- ConsInd(modelo=mod1.1, basedate="2007-12-31", tit="Índice precio vivienda CAPV.\nEf. espacial: SUBREGION", congelado=ind1.1, fechas=as.Date(datos$FEC_CREACION), conf=0.95) head(ind1.2) tail(ind1.2) tail(ind)
We might want to compute indices for diferent zones. Function IndZonas
is meant to simplify
the task: rather than a separate invocation to ConsInd
for each zone, we can make a single
invocation to IndZonas
passing an argument zonas
whose levels define the different zones.
Hence, if we want a separate index for each province,
ind.prov <- IndZonas(datos, base="2007-12-31", zonas="PROVINCIA", frm=formula(mod1))
We have taken the model specification formn mod1
; we could type it directly.
sel <- (datos$FEC_CREACION > as.Date("2007-11-30")) # Only after this date there is enough data datos <- datos[sel,] ind.prov <- IndZonas(datos, zonas="PROVINCIA", base="2007-12-31", frm=mod1$formula) ind.af <- IndZonas(datos, zonas="AF", base="2007-12-31", frm=update(mod1$formula, . ~ . - SUBREGION))
The results in ind.prov
are lists of indices which can be represented in separate panels.
Plotting several indices is facilitated by using the function mggplot
, which takes the object
returned by IndZones
and constructs the plot. Below we obtain the names of the plots from
the names in the list ind.prov
, but we could supply different names in the argument columnas
.
mggplot(ind.prov, titulo="Indice precios alquiler") mggplot(ind.af, titulo="Indice precios alquiler")
We might opt for plotting all indices in a single scale:
mggplot(ind.prov, tipo="multiple", titulo="Indice precios alquiler") mggplot(ind.af, tipo="multiple", titulo="Indice precios alquiler")
The dataframes venta.dep
and alquiler.dep
include UTM projected coordinates
(DX_ETRS89
, DY_ETRS89
and
corresponding DX_ED50
and DY_ED50
) We can set directly a desired bandwidth or select it
by cross-validation, which with objects the size of the ones we have here would be quite expensive.
We will set a bandwidth of 5000m; this is the radius of the circles around each point which receive substantial weight
in the local fitting procedure.
The backfitting algorithm alternates the spatial and time adjustments until convergence. For ilustration only a subsample can be selected here, uncommenting the line which sets sel
.to reduce computation time. Results with the full sample will be different.
frm <- formula(mod1) frm.param <- update(frm, ~ . - s(x,bs="cr",k=10)) smooth.term <- 's(x,bs="cr",k=10)' # sel <- ((1:nrow(datos)) %% 5) == 0 # One observation out of each 5 if (!file.exists("bfg.rda")) { indice.bfg <- BackFitting(frm.param=frm.param, smooth.term=smooth.term, var.loc="SUBREGION", coords=c("DX_ETRS89","DY_ETRS89"), datos=datos[sel,], cores=cores, var.fecha='FEC_CREACION', baseday='2008-01-01') save(indice.bfg, file="bfg.rda") } else { load("bfg.rda") }
The result is
plot(indice.bfg)
The previous result corresponds to a model in which the parametric part displays spatial variation (accounted for by GWR) while the non-parametric part (which gives rise to the index) is unique for all the observations. Function BackFittingLocal
affords the estimation of indices for specific points, whose coordinates are given in the
argument cal.pts
. A second bandwidth bandwitdh, bws
, is specified, which may not be coincident with the bw
used for the GWR: the former gives the spatial scope of observations which are deemed to share a time trend, while the second gives the scope of observations which are deemed to share a similar valuation of attributes.
Seleccionaremos como puntos de cálculo de índices locales los centroides de barrios que parecen haber registrado una evolución diferente:
barrios <- c("Indautxu", "Abando - Albia", "Deusto") cal.pts <- matrix(0,length(barrios),2) b <- venta.dep[venta.dep$BARRIO %in% barrios, ] dimnames(cal.pts) <- list(barrios, NULL) for (i in barrios) { cal.pts[i,] <- apply(b[b$BARRIO==i, c("DX_ETRS89","DY_ETRS89")], 2, FUN="median", na.rm=TRUE) } cal.pts xy <- data.frame(ID=c("PlazaEnsanche","GeneralEguia"), X=c(-2.931413,-2.946283), Y=c(43.263538,43.259257)) coordinates(xy) <- c("X","Y") proj4string(xy) <- CRS("+proj=longlat +datum=WGS84") res <- spTransform(xy, CRS("+proj=utm +zone=30 ellps=WGS84")) res <- as.data.frame(res) cal.pts <- as.matrix(res[,c("X","Y")]) rownames(cal.pts) <- res[,"ID"]
Podemos ahora calcular varios índices locales así:
# # Eliminación de observaciones muy lejos de Bilbao # sel <- (datos$DX_ETRS89 > 504000) & (datos$DX_ETRS89 < 506000) sel <- sel & (datos$DY_ETRS89 > 4788000) & (datos$DX_ETRS89 < 4791000) sel <- sel & datos$FEC_CREACION > as.Date("2010-01-31") # Sólo donde hay datos suficientes lista.ind <- BackFittingLocal(frm.param=formula(logpm2 ~ ROOMNUMBER + HASLIFT + IND_GARAJE + NUM_SUPERF), smooth.term='s(x,bs="cr",k=10)', cal.pts=cal.pts, datos=datos[sel,], coords=c("DX_ETRS89","DY_ETRS89"), var.fecha='FEC_CREACION', baseday="2010-12-31", cores=cores, bw=500, bws=200, tol=0.005)
names(lista.ind[[2]]) <- rownames(lista.ind[[1]]) p <-mggplot(lista.ind[[2]], columnas=rownames(lista.ind[[1]]), tipo="multiple") p <- p + ggtitle("Local indices", sub=paste("Base: ","31-12-2010"," = 100")) # pdf(file="localindices.pdf") print(p) # scratch <- dev.off()
An alternative way to the computation of local indices would be time slicing. Using this approach, for each location we would do the following:
We select data in the Bilbao area, from 2012 onwards (observations before are scarce)
The points for which we want predictions must be in the form of a data frame
with complete values of the variables used in the model. The easiest way to
construct values for the argument fit.points
below is to take arbitrary rows
of spdatos
(one for calibration point), replace the values of the variables
for the ones we want, and the set the coordinates conveniently. We use as a
"median house" for this area a flat with three bedrooms, garage space and
elevator---.
median.houses <- datos[1000:1001,] median.houses$ROOMNUMBER <- 3 median.houses$HASLIFT <- factor("SI", levels=levels(datos$HASLIFT)) median.houses$NUM_SUPERF <- 85 median.houses
Now we set the coordinates:
median.houses <- median.houses[,-match(c("LAT","LNG"),colnames(median.houses))] coordinates(median.houses) <- c("DX_ETRS89","DY_ETRS89")
We will fit a model to each slice of time, and obtain predictions of the prices for the median houses:
preds <- SlicedIndex( cal.pts = median.houses, datos = datos, coords = c("DX_ETRS89","DY_ETRS89"), from = as.yearqtr("2015 Q1"), to = as.yearqtr("2019 Q1"), bw = 1000, frm = formula(logpm2 ~ ROOMNUMBER + NUM_SUPERF + HASLIFT + IND_GARAJE), date = "FEC_CREACION", slice.by=as.yearqtr) plot(exp(preds))
final <- Sys.time() final - comienzo
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.