+
library(knitr) # Set basic options. You usually do not want your code, messages, warnings etc. # to show in your actual manuscript however for the first run or two these will # be set on. opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, cache = TRUE, include = TRUE, results = 'hide', dev = 'pdf') ## load necessary packages library(rpart) library(rpart.plot) library(Amelia) library(ggplot2) library(ggdendro) library(GGally) library(grid) library(gridExtra) library(mi) library(xtable) library(mice) library(devtools) ## load MCAR.test function source('~/Dropbox/ALL THE THINGS/PhD/code/R/mex/R/mcar_test.R') ## source code from elith et al. source("~/Dropbox/ALL THE THINGS/PhD/Guides/Elith_BRT_Guide_2_of_2/brt.functions.R") ## start dev mode dev_mode() ## launch mex library("mex", lib.loc="~/R-dev") ## this dataset has random missingness as well as missingness induced when the variable ## C1 is higher than 55. data(sim.dat)
Everything is still very much in a draft phase, and is in a state of flux - the things you see written here are not written in stone (more like a map drawn in the dirt with pointy sticks).
If you have missing data, then you need to explore reasons for missingness. With mex, we aim to provide:
shiny
and ggvis
to enhance exploration. The main functions in mex are:
In the explore step, you want to know how much missing data there is, and if there is a possibility of bias. This exploration step utilises:
Before searching for structured missingness in the data, it is useful to ask whether the missingness is prevalent enough for us to need an investigation, and determine whether the data may be missing completely at random (MCAR)
We can do this by:
I have written an R function MCAR.test
which allows the user perform this test. The function outputs a table giving the results of the t-test and the chi2 test. This function still currently not working for the dataset sim.dat.csv
, due to bugs.
sim.dat$F1 <- as.factor(sim.dat$F1) sim.dat$F2 <- as.factor(sim.dat$F2) sim.dat.2 <- sim.dat[,-c(1,7:9)] head(sim.dat.2) factor.list <- c("F1", "F2") MCAR.test(data = sim.dat, y = "C1", factor.list = factor.list)
For some reason in the code below I need to specify the full directory - even though this runs fine when I feed it into the console, it doesn't really work when I try to run a command like
simdat <- read.csv("sim.dat.csv")
. So the code below specifies the rull directory.
A simulated missing dataset was created to explore the missing data methods. This dataset sim.dat.csv
is in the Github repo. This dataset contained five variables, two categorical and three continuous, with 1,000 observations in each. The two categorical factors, F1 and F2, ranged uniformly across categories nominally labelled 1-7, and 1-10, respectively. The three continuous variables, named C1, C2 and C3, were normally distributed with means and standard deviations of 50 and 10, 90 and 10, and 30 and 3, respectively.
The code I wrote to create this data is in GitHub, under /demo/simulated_data.R :
I have written a function sim_miss_data.R
to create missingness in data. In our case, I have created missingness in the data sim1
, and called it sim.dat
. However, the sim_miss_data.R
function currently saves the data with the induced missingness into a directory. One of the changes that I haven't gotten around to is changing the code so I can save it to an object or .csv, rather than to a directory, as it currently is.
## depict the missingness. missing.pattern.plot(sim.dat, gray.scale = TRUE, clustered = TRUE)
R Datasets with missingness Only a few base R datasets seem to have missingness airquality
and attenu
, which can be seen below.
Other datasets come from the mice
package:
missing.pattern.plot(airquality, gray.scale = TRUE, clustered = TRUE) missing.pattern.plot(attenu, gray.scale = TRUE, clustered = TRUE) missing.pattern.plot(boys, gray.scale = TRUE, clustered = TRUE) missing.pattern.plot(fdd, gray.scale = TRUE, clustered = TRUE)
so we are currently just using the sim.dat
dataset.
In the model step, you model possible mechanisms for missingness, using the rpart
, gbm
, and (unsure of which precise clustering method to use) hclust
.
To assist us in detecting missingness structure, we use CART models to predict the proportion of missing data in a row. For the CART model, I used the rpart
and rpart.plot
packages.
miss_perc
, the proportion of missing data in a row, using the appropriate independent variables.## this creates the rpart tree. cart.small <- rpart(miss_perc ~ ID + C1 + C2 + C3 + F1 + F2, data = sim.dat, na.action = na.rpart, method = "anova") ## this actually plots the rpart tree. prp(cart.small, extra = 1, type = 4, prefix = "Prop. Miss = ")
For the BRT model, I use the gbm
package, and the source code from elith et al. (2008) found in supplementary file 2.
I'm not sure if I can just upload this code to GitHub and then reference it - or if I should provide the link, so that they, the authors, can get the altmetrics from it?
Specifically, I used the functions:
- gbm.step
:
fits a gbm model to one or more response variables, using cross-validation to estimate the optimal number of trees.
gbm.plot
:Plots the partial dependence of the response on one or more predictors.
There are a few other components that can be used from the elith et al (2008) supplementary materials - gbm.fixed
, gbm.holdout
, gbm.simplify
, gbm.plot.fits
, gbm.interactions
, gbm.perspec
, gbm.predict.grids
.
A tricky part of fitting the BRT is understanding the right levels for each of the parameters:
tree complexity
learning rate
bag fraction
And the family
of distributions.
I followed the examples out of the Elith et al guide (2008), which was fiddly. It would be nice to make this model fitting process easier, perhaps this could be done using ggvis
, to help control the parameters and get an update in real time. It is also the case that the gbm.step function from Elith et al(2008) is slowing down everything. I wonder if it would be possible to look at what they have done and re-write it to be faster. It's annoying when each model takes 5 minutes+ to run, and you want to test 20 combinations of parameters.
If we are to do any graphing, I would really like it if we made all of the output plots in ggplot.That is, if we are going to produce automated plots, we should use ggplot. I'd also really like to fix the gbm.plot() code the Elith et al. have written.
brt.mod <- gbm.step(data=sim.dat, gbm.x = c(2:6), gbm.y = 10, family = "gaussian", tree.complexity = 2, learning.rate = 0.01, bag.fraction = 0.5) gbm.plot(brt.mod, n.plots = 5, write.title=F, plot.layout = c(2,3))
note: This vignette needs to change the gbm.plot code- as this code- that I have modified in 'gbm.plot' - for the window() command - that I changed to X11() - it now plots everything after that in the X11 device, which I don't want, as it is slow.
In the confirm step, you can use the models created to predict the missingness structure of your own dataset, allowing for a degree of comparison or 'model fit'.
One feature could be to plot the model fit of the proportion of missing data, and the predicted amount from the models.
#obtain predicted values cart.pred <- predict(object = cart.small, newdata = sim.dat) gg.cart.pred.plot <- ggplot(data = NULL, aes(x = sim.dat$miss_perc, y = cart.pred)) + geom_point(size = 2) + labs(title = "", x = "CART Observed Values", y = "CART Predicted Values") + theme(axis.text=element_text(size=10), axis.title=element_text(size=12)) ## print(gg.cart.pred.plot)
## obtain predictions brt.pred <- predict.gbm(brt.mod, n.trees= 600) ## make a scatter plot of observed vs predicted gg.brt.pred.plot <- ggplot(data = sim.dat, aes(x = sim.dat$miss_perc, brt.pred)) + geom_point(size = 2) + labs(title = "", x = "BRT Observed Values", y = "BRT Predicted Values") + theme(axis.text=element_text(size=10), axis.title=element_text(size=12)) ## print(gg.brt.pred.plot)
## gridExtra allows for easy combining of ggplots... grid.arrange(gg.cart.pred.plot, gg.brt.pred.plot, ncol=2)
mex
compare to other existing solutions?Current solutions such MissingDataGUI
, VIM
, missmap (in Amelia)
, missing.pattern.plot (in mi)
usually focus on the user needing to visually search and find the trends. Whilst humans are very good at finding patterns, having a model behind the output has more potential for really helping researchers explore their missing data problems in a precise way. So whilst it is possible for people to use the methods provided in mex
, it isn't necessarily easy, and straightforward
MissingDataGUI
VIM
missmap
(from the AMELIA II
package)gives a visual depiction of the missingness in the dataset
missmap(sim.dat, rank.order = FALSE)
missing.pattern.plot
Another missingness map tool that exists from the im
package - this allows you to specify a "clustered" option, which groups data with similar missingness patterns together.
missing.pattern.plot(sim.dat, gray.scale = TRUE)
Damjan has made a great point, that the CART and BRT models may neglect useful information from the data's correlation structure. An approach is to use hierarchical clustering on a binary (present/absent) dataset, and then apply the CART or BRT to predict membership in a particular cluster, using the values of the dataset.
So the code might look something like this.
First run the hierarhical clustering, using hclust()
## run the clustering on a dataset made of the hclust.fit <- hclust(dist(is.na.data.frame(sim.dat))) ## plot the hclust plot(hclust.fit) ## cut the hclust into 4 pieces. c.id <- as.factor(cutree(hclust.fit, 4)) ## add the 4 pieces into a dataframe sim.clust <- data.frame(sim.dat, c.id)
Now let's look at the percent of missing data in each cluster
## plot it on ggplot, this is the percent of missing data in each cluster ggplot(data = sim.clust, aes(x = c.id, y = miss_perc)) + geom_point()
Now let's use the rpart method.
## get the CART model to predict membership in each cluster. rpart.clust <- rpart(c.id ~ C1 + C2 + C3 + F1 + F2, data = sim.clust, na.action = na.rpart, control = rpart.control(xval = 100)) prp(rpart.clust, extra = 100, type = 4, prefix = "Clust ID = ")
Now let's use the brt method (currently not working!)
brt.clust <- gbm.step(data=sim.clust, gbm.x = c(3:7), gbm.y = 12, family = "gaussian", tree.complexity = 2, learning.rate = 0.01, bag.fraction = 0.5) gbm.plot(brt.clust, n.plots = 5, write.title=F, plot.layout = c(2,3))
Bundle these functions together into a package, with an intuitive grammar for the functions, making it easy for users to:
These plots and diagnostics could utilie ggplot, ggvis, and shiny, so users can interactively explore their missing data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.