In Kiwi-Random-House/R-Projects: Building Analytic Apps with R

A Running Example {#example}

We demonstrate the knowledge in this book by emulating an analytic project that includes the development and deployment of a machine learning system. This chapter presents the background, requirements, and deliverables of the analytic project.

The example is based on an entry-level Kaggle competition for predicting house prices[^kaggle-competition]. Using the same dataset, we build an end-to-end machine learning system that solves a pseudo-real-life problem. The project includes two parts: training a prediction model on historical data, and making predictions on unseen data.

# Figure 1 - Predict the house sale price
knitr::include_graphics('./images/for-sale-ad.jpg', dpi = NA)

Originally, the house prices competition serves as a playground for data scientists to hone their skills. The goal of the competition is to predict sales prices for 1,459 houses. The competition features a dataset with 81 columns in which 73 are identifiable house attributes, termed amenities. Each participant submits, i.e., uploads to Kaggle, a table with 1,459 rows and two columns: Id and SalePrice. Kaggle evaluates the accuracy of each solution, based on RMSE, and ranks each participant on the leader-board in comparison to other competitors' submissions scores.

See full dataset description at the appendix

tables$report_salient_amenities()

To emulate a real-work analysis project, we transmute the competition setup to a business case setup. The deliverable of the business case is an automated valuation model (AVM). The AVM provides house prices predictions for other real estate agencies tools, such as a website or a real estate management system. Some major differences in needs between the original setting of the data science competition and the business case include:

An essential feature of AVM is the ability to quantify the error variance. Given the variance and a point estimation for a sale price, one may calculate the estimated price range, i.e., prediction intervals.
Another desired feature of AVM is the ability to provide explanations about the relationships between the estimated sale prices and house amenities. Thus, the evaluation metric should measure both the accuracy and complexity of the prediction model.
Among the 73 available amenities, some must be included in the model, regardless of their predictive power. The required amenities are dictated by house buyers' common choices, such as the number of rooms and bathrooms, and appraisal standards.
A house might be sold more than once.
In contrast to the competition, the solution is not a one-time prediction job. Instead, the analytic application has to be re-trainable and re-runnable on demand.

Consider the following factors:

The housing market changes over time; and
The agency's database changes over time, e.g., new listings are added to the database.

Both of these factors require the analytic application to be re-runnable when the need arises. The first factor involves a trigger that periodically, say once a month, calls for a price update of all active listings. The second factor triggers a call when a new listing is to be added to the agency's database.

Similarly, to the competition, the solution is iterative.

[^kaggle-competition]: You can read more about it on Kaggle.

Kiwi-Random-House/R-Projects documentation built on Dec. 31, 2020, 2:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com