library(here)
library(tidyverse)
library(MASS)
source(here("R", "cleaning.R"), local = TRUE)

Pricing Model

Finally, we would like develop a model to predict the price of a listing as well as determine the effect of the co-variates on that price. The first model considered was: Price ~ availability + neighborhood + room type + minimum nights + number of reviews + reviews per month. This first model presented many problems, including a small $R^2$ value of 0.099. Also, observing the diagnostic plots we can see that this model also posed problems of normality, homogeneity of the variance and some really influential points.

mod1 <- lm(price ~ avail + num_reviews + rpm + nb + room_type + min_nights, data = listings)
par(mfrow = c(2,2))
plot(mod1, which = c(1,2,5))

Seeing these plots we can also identify several listings that are clear outliers. These listings had prices above \$1000. It is important to note that in the whole data set only r sum(listings$price > 1000) listings have prices above that price range. For exploration purposes, the three influential observations were removed from the model and then fitted again and the results were that minimum nights was no longer a significant variable in the model.

In addition, we considered doing a Box-Cox transformation in order to improve the goodness-of-fit of the model. The following displays the log-likelihood of the box-cox fit:

bc <- boxcox(lm(price ~ avail + num_reviews + rpm + nb + room_type, 
          data = listings), lambda = seq(-0.5,0,by = 0.01))

We can see from the confidence interval that we can use the value of $\lambda = -0.4$ to transform price. We then fitted the model again but with the transformed price. Observing the diagnostic plots we can see that the original problems haven been fixed. The $R^2$ was also increased to 0.3139.

listings$price.bc <- listings$price^-0.4
mod2 <- lm(price.bc ~ avail + num_reviews + rpm + nb + room_type + min_nights, data = listings)
par(mfrow = c(2,2))
plot(mod2, which = c(1,2,5))

However, in this model minimum nights was no longer a significant variable. We decided to use leaps to select the best model, observing both AIC and BIC. The result was that the best model (with the lowest AIC and BIC) was the following:

Price^-0.4^ ~ availability + neighborhood + room type + number of reviews + reviews per month.

With the following summary statistics:

mod3 <- lm(price.bc ~ avail + num_reviews + rpm + nb + room_type, data = listings)
summary(mod3)

Due to the transformation of price most of the interpretability of the coefficients on price is gone, but we can still interpret their effect on the transformed price. We can see, for example, that of all the Neighborhood dummy variables, San Marco has the most negative coefficient. This means that on average San Marco has the lowest transformed price, if the remaining variables remain constant. Since the transformation on price is the reciprocal of price, this result translates to a Neighborhood with higher prices on average. For example, Lido has the most positive coefficient, so we would expect it, on average, to have the lowest prices.

This observation is also backed up by the ANOVA presented in the previous section. Furthermore, we can observe this affect on neighborhood in the scatterplots of the predicted price vs. an other covariate. In particular, a scatterplot of Predicted Price vs. Reviews Per Month we can see a clear stratification of the prices by neighborhood:

listings <- listings %>% 
  mutate(preds = fitted(mod3)^(-1/0.4))

ggplot(listings, aes(x = rpm, y = preds)) +
  geom_jitter(aes(color = nb), alpha = 0.5) +
  ylab('Predicted Price') +
  xlab('Reviews Per Month') +
  ggthemes::theme_hc()

San Marco in blue has a very high price independent of the co-variate, while Lido in green has a lower price.

It is also important to note that the highest price estimated by the model is of \$r round(max(fitted.values(mod3)^(-1/0.4)),2), which is lower than r round(mean(listings$price > max(fitted.values(mod3)^(-1/0.4)))*100,2)% of the original prices. In other words, the model is clearly underestimating the prices although not to a great extent.



moecampos/425-project documentation built on May 14, 2019, 2:03 a.m.