In dymium-org/dymiumCore: A Toolkit for Building a Dynamic Microsimulation Model for Integrated Urban Modelling

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

As an R user, I feel that there are so many great packages that can make R a complete platform for microsimulation and yet, to my knowledge, no one has fully taken advantage of it. The stages of microsimulation modelling is no different from the stages of any other modelling excerises in general, at least at the high-level. They can be broken down as the following:

Data preparation: synthesise microdata and estimate models.
Microsimulation: microsimulate agents in the microdata using the estimated models.
Validation: check the simulation result againts real-world observed statistics.
Calibration: slightly nudge some parameters of the estimated models to have a better validation result.
Sensitivity analysis: make use of the model for policy analysis.

These stages are often done under different programming environments which makes it extremely difficult for anyone to keep track of the scripts, the dependencies, the software licences involved, and the data produced during the development phrase. Inevitably, those issues will cause the model to be shorted-lived. These issues as well as all the stages of microsimulation modelling can be solved and done effectively with packages that are freely avaiable in R.

Data preparation

In the first stage, packages for data manipulation such as data.table, dplyr, the recent dplyr-like packages that use data.table as their backend such as dtplyr and tidydt, and sf for working with geopgrahical data are immensely useful.

Systhesise microdata

To represent the initial state of the simulation we need microdata and often the microdata that we need is only available as a small sample of the true population. This is to prevent the survey participants from being identified. Hence, we often need to synthesise/reconstruct a population that resembles the true population using the microdata and some known margins of the true population. Luckily, as an R user you have many well deleloped packages to choose from for systhesising data such as MultiLevelIPF, simPop, mipfp. Even a book on this topic, Spatial Microsimulation with R by Robin Lovelace and Morgane Dumont, in R is available!

Imputing variables

Sometimes, you may find that your variables of interest exist in another dataset that is similar to your base microdata. You may use data fusion techiques to transfer those variables to your base microdata. StatMatch can take care of all your imputation needs. It is very easy to use and comes with a vignette that is well-written.

Let's assume that your dataset, A, consists of records with X variables and the other dataset, B, consist of X variables and a Y variable. To transfer the Y varible, find a probable Y value for each of the records of A by comparing the X variables of both A and B that are similar, then transfer the Y value of the most similar record in B to A. This, in a nutshell, is called data fusion.

Esimate models

Statistical modelling is one of the main strengths of the R ecosystem. My very impression of R was how easy it was for me to fit various models from a OLS regression to a binary logit modelling using just one function, which is stats::glm(). Since, most microsimulation models relies on probalistic models, (and also rule-based models) to simulate the decisions made by its agents, there are a vast amount of packages that can facilitate model fitting. stats, caret, mlr, mlogit and more can be used for this purpose. Even if a person only know the caret package alone, he/she can have access to 238 different model implementations, from basic econometric models to advanced machine-learning models. This provides an extensive playground for microsimulation modellers be more experimental and go beyond the traditional knowledge of the field.

Microsimulation

There are very few R packages for microsimulation. The few packages that are available are somewhat feature lacking and not meant for a large-scale microsimulation model that is flexible to addition of new behavior and easy to maintain. Hence, this is where we, the authors, see a place for dymiumCore in the R package ecosystem.

Validation

A microsimulation model should be validated visually and numerically. For visualisation, most R users probably turn to ggplot2 and its extensions. Spatial data can be visualised using ggmap, leaflet, mapview and more. For numerical comparison, Metrics provides many commonly used evaluation metrics for regression and classification problems.

Calibration

Calibration is often used in large-scale models (a system of many models), which many microsimulation models are, to adjust for their errors and ultimately improve the accuracy of the submodels and the overall result. As you may know, calibration is a complex process especially for large-scale models, which are often viewed as black-box models, as it can be very expensive (time and computing resources) to perform. However, if you wish to calibrate your microsimulation model mlrMBO are here for you.

(A tutorial on using mlrMBO for calibration of a dymium microsimulation model is being drafted and will be released soon.)

Reproducibility

More general issues such as reproducibility and maintainability of the programming environments used can be easily solved using drake and renv. drake helps you to manage your workflow in an organised and reproducible way while renv takes care of your dependencies.

dymium-org/dymiumCore documentation built on July 18, 2021, 5:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com