A Running Example {#example}

We demonstrate the knowledge in this book by emulating an analytic project that includes the development and deployment of a machine learning system. This chapter presents the background, requirements, and deliverables of the analytic project.

The example is based on an entry-level Kaggle competition for predicting house prices[^kaggle-competition]. Using the same dataset, we build an end-to-end machine learning system that solves a pseudo-real-life problem. The project includes two parts: training a prediction model on historical data, and making predictions on unseen data.

# Figure 1 - Predict the house sale price
knitr::include_graphics('./images/for-sale-ad.jpg', dpi = NA)

Originally, the house prices competition serves as a playground for data scientists to hone their skills. The goal of the competition is to predict sales prices for 1,459 houses. The competition features a dataset with 81 columns in which 73 are identifiable house attributes, termed amenities. Each participant submits, i.e., uploads to Kaggle, a table with 1,459 rows and two columns: Id and SalePrice. Kaggle evaluates the accuracy of each solution, based on RMSE, and ranks each participant on the leader-board in comparison to other competitors' submissions scores.

See full dataset description at the appendix

tables$report_salient_amenities()

To emulate a real-work analysis project, we transmute the competition setup to a business case setup. The deliverable of the business case is an automated valuation model (AVM). The AVM provides house prices predictions for other real estate agencies tools, such as a website or a real estate management system. Some major differences in needs between the original setting of the data science competition and the business case include:

Consider the following factors:

  1. The housing market changes over time; and
  2. The agency's database changes over time, e.g., new listings are added to the database.

Both of these factors require the analytic application to be re-runnable when the need arises. The first factor involves a trigger that periodically, say once a month, calls for a price update of all active listings. The second factor triggers a call when a new listing is to be added to the agency's database.

Similarly, to the competition, the solution is iterative.

[^kaggle-competition]: You can read more about it on Kaggle.



Kiwi-Random-House/R-Projects documentation built on Dec. 31, 2020, 2:10 p.m.