We demonstrate the knowledge in this book by emulating an analytic project that includes the development and deployment of a machine learning system. This chapter presents the background, requirements, and deliverables of the analytic project.
The example is based on an entry-level Kaggle competition for predicting house prices[^kaggle-competition]. Using the same dataset, we build an end-to-end machine learning system that solves a pseudo-real-life problem. The project includes two parts: training a prediction model on historical data, and making predictions on unseen data.
# Figure 1 - Predict the house sale price knitr::include_graphics('./images/for-sale-ad.jpg', dpi = NA)
Originally, the house prices competition serves as a playground for data scientists to hone their skills. The goal of the competition is to predict sales prices for 1,459 houses. The competition features a dataset with 81 columns in which 73 are identifiable house attributes, termed amenities. Each participant submits, i.e., uploads to Kaggle, a table with 1,459 rows and two columns: Id and SalePrice. Kaggle evaluates the accuracy of each solution, based on RMSE, and ranks each participant on the leader-board in comparison to other competitors' submissions scores.
See full dataset description at the appendix
tables$report_salient_amenities()
To emulate a real-work analysis project, we transmute the competition setup to a business case setup. The deliverable of the business case is an automated valuation model (AVM). The AVM provides house prices predictions for other real estate agencies tools, such as a website or a real estate management system. Some major differences in needs between the original setting of the data science competition and the business case include:
Consider the following factors:
Both of these factors require the analytic application to be re-runnable when the need arises. The first factor involves a trigger that periodically, say once a month, calls for a price update of all active listings. The second factor triggers a call when a new listing is to be added to the agency's database.
Similarly, to the competition, the solution is iterative.
[^kaggle-competition]: You can read more about it on Kaggle.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.