+

library(knitr)
  # Set basic options. You usually do not want your code, messages, warnings etc.
  # to show in your actual manuscript however for the first run or two these will 
  # be set on.
opts_chunk$set(echo=TRUE, 
               warning=FALSE, 
               message=FALSE,
               cache = TRUE,
               include = TRUE,
               results = 'hide',
               dev = 'pdf')

## load necessary packages
library(rpart)
library(rpart.plot)
library(Amelia)
library(ggplot2)
library(ggdendro)
library(GGally)
library(grid)
library(gridExtra)
library(mi)
library(xtable)
library(mice)
library(devtools)

## load MCAR.test function
  source('~/Dropbox/ALL THE THINGS/PhD/code/R/mex/R/mcar_test.R')

## source code from elith et al.
  source("~/Dropbox/ALL THE THINGS/PhD/Guides/Elith_BRT_Guide_2_of_2/brt.functions.R")

## start dev mode
dev_mode()

## launch mex
library("mex", lib.loc="~/R-dev")


## this dataset has random missingness as well as missingness induced when the variable
## C1 is higher than 55.
data(sim.dat)

mex: the (m)issingness (ex)plorer.

Everything is still very much in a draft phase, and is in a state of flux - the things you see written here are not written in stone (more like a map drawn in the dirt with pointy sticks).

Why should someone use mex?

If you have missing data, then you need to explore reasons for missingness. With mex, we aim to provide:

Functions.

The main functions in mex are:

Explore

In the explore step, you want to know how much missing data there is, and if there is a possibility of bias. This exploration step utilises:

MCAR.test (aka, explore?)

Before searching for structured missingness in the data, it is useful to ask whether the missingness is prevalent enough for us to need an investigation, and determine whether the data may be missing completely at random (MCAR)

We can do this by:

I have written an R function MCAR.test which allows the user perform this test. The function outputs a table giving the results of the t-test and the chi2 test. This function still currently not working for the dataset sim.dat.csv, due to bugs.

sim.dat$F1 <- as.factor(sim.dat$F1)

sim.dat$F2 <- as.factor(sim.dat$F2)

sim.dat.2 <- sim.dat[,-c(1,7:9)]

head(sim.dat.2)

factor.list <- c("F1",
                 "F2")

MCAR.test(data = sim.dat,
          y = "C1",
          factor.list = factor.list)

For some reason in the code below I need to specify the full directory - even though this runs fine when I feed it into the console, it doesn't really work when I try to run a command like simdat <- read.csv("sim.dat.csv"). So the code below specifies the rull directory.

Datasets to use

A simulated missing dataset was created to explore the missing data methods. This dataset sim.dat.csv is in the Github repo. This dataset contained five variables, two categorical and three continuous, with 1,000 observations in each. The two categorical factors, F1 and F2, ranged uniformly across categories nominally labelled 1-7, and 1-10, respectively. The three continuous variables, named C1, C2 and C3, were normally distributed with means and standard deviations of 50 and 10, 90 and 10, and 30 and 3, respectively.

The code I wrote to create this data is in GitHub, under /demo/simulated_data.R :

I have written a function sim_miss_data.R to create missingness in data. In our case, I have created missingness in the data sim1, and called it sim.dat. However, the sim_miss_data.R function currently saves the data with the induced missingness into a directory. One of the changes that I haven't gotten around to is changing the code so I can save it to an object or .csv, rather than to a directory, as it currently is.

## depict the missingness.
missing.pattern.plot(sim.dat,
                     gray.scale = TRUE,
                     clustered = TRUE)

R Datasets with missingness Only a few base R datasets seem to have missingness airquality and attenu, which can be seen below.

Other datasets come from the mice package:

missing.pattern.plot(airquality, 
                     gray.scale = TRUE,
                     clustered = TRUE)

missing.pattern.plot(attenu, 
                     gray.scale = TRUE,
                     clustered = TRUE)

missing.pattern.plot(boys, 
                     gray.scale = TRUE,
                     clustered = TRUE)

missing.pattern.plot(fdd, 
                     gray.scale = TRUE,
                     clustered = TRUE)

so we are currently just using the sim.dat dataset.

Model

In the model step, you model possible mechanisms for missingness, using the rpart, gbm, and (unsure of which precise clustering method to use) hclust.

CART model.

To assist us in detecting missingness structure, we use CART models to predict the proportion of missing data in a row. For the CART model, I used the rpart and rpart.plot packages.

The process for running the CART model using rpart.
## this creates the rpart tree.
cart.small <- rpart(miss_perc ~ ID + C1 + C2 + C3 + F1 + F2,
                    data = sim.dat,
                    na.action = na.rpart, 
                    method = "anova")


## this actually plots the rpart tree.
prp(cart.small, 
    extra = 1, 
    type = 4, 
    prefix = "Prop. Miss = ")

The BRT

For the BRT model, I use the gbm package, and the source code from elith et al. (2008) found in supplementary file 2.

I'm not sure if I can just upload this code to GitHub and then reference it - or if I should provide the link, so that they, the authors, can get the altmetrics from it?

Specifically, I used the functions: - gbm.step:

fits a gbm model to one or more response variables, using cross-validation to estimate the optimal number of trees.

There are a few other components that can be used from the elith et al (2008) supplementary materials - gbm.fixed, gbm.holdout, gbm.simplify, gbm.plot.fits, gbm.interactions, gbm.perspec, gbm.predict.grids.

A tricky part of fitting the BRT is understanding the right levels for each of the parameters:

And the family of distributions.

I followed the examples out of the Elith et al guide (2008), which was fiddly. It would be nice to make this model fitting process easier, perhaps this could be done using ggvis, to help control the parameters and get an update in real time. It is also the case that the gbm.step function from Elith et al(2008) is slowing down everything. I wonder if it would be possible to look at what they have done and re-write it to be faster. It's annoying when each model takes 5 minutes+ to run, and you want to test 20 combinations of parameters.

If we are to do any graphing, I would really like it if we made all of the output plots in ggplot.That is, if we are going to produce automated plots, we should use ggplot. I'd also really like to fix the gbm.plot() code the Elith et al. have written.

brt.mod <- gbm.step(data=sim.dat, 
                    gbm.x = c(2:6), 
                    gbm.y = 10,
                    family = "gaussian",
                    tree.complexity = 2,
                    learning.rate = 0.01,
                    bag.fraction = 0.5)

gbm.plot(brt.mod, 
         n.plots = 5, 
         write.title=F,
         plot.layout = c(2,3))

note: This vignette needs to change the gbm.plot code- as this code- that I have modified in 'gbm.plot' - for the window() command - that I changed to X11() - it now plots everything after that in the X11 device, which I don't want, as it is slow.

Confirm

In the confirm step, you can use the models created to predict the missingness structure of your own dataset, allowing for a degree of comparison or 'model fit'.

One feature could be to plot the model fit of the proportion of missing data, and the predicted amount from the models.

#obtain predicted values
cart.pred <- predict(object = cart.small,
                     newdata = sim.dat)

gg.cart.pred.plot <- ggplot(data = NULL, 
                         aes(x = sim.dat$miss_perc, 
                             y = cart.pred)) + 
                        geom_point(size = 2) + 
                        labs(title = "",
                             x = "CART Observed Values",
                             y = "CART Predicted Values") + 
                        theme(axis.text=element_text(size=10),
                              axis.title=element_text(size=12))

## print(gg.cart.pred.plot)
## obtain predictions
  brt.pred <- predict.gbm(brt.mod, 
                          n.trees= 600)

## make a scatter plot of observed vs predicted

gg.brt.pred.plot <- ggplot(data = sim.dat,
                       aes(x = sim.dat$miss_perc, 
                           brt.pred)) + 
  geom_point(size = 2) + 
  labs(title = "",
       x = "BRT Observed Values",
       y = "BRT Predicted Values") + 
  theme(axis.text=element_text(size=10),
        axis.title=element_text(size=12))

## print(gg.brt.pred.plot)
## gridExtra allows for easy combining of ggplots...

grid.arrange(gg.cart.pred.plot,
             gg.brt.pred.plot, 
             ncol=2)

How does mex compare to other existing solutions?

Current solutions such MissingDataGUI, VIM, missmap (in Amelia), missing.pattern.plot (in mi) usually focus on the user needing to visually search and find the trends. Whilst humans are very good at finding patterns, having a model behind the output has more potential for really helping researchers explore their missing data problems in a precise way. So whilst it is possible for people to use the methods provided in mex, it isn't necessarily easy, and straightforward

MissingDataGUI

VIM

missmap (from the AMELIA II package)

gives a visual depiction of the missingness in the dataset

missmap(sim.dat,
        rank.order = FALSE)

missing.pattern.plot

Another missingness map tool that exists from the im package - this allows you to specify a "clustered" option, which groups data with similar missingness patterns together.

missing.pattern.plot(sim.dat,
                     gray.scale = TRUE)

Future Work

Damjan has made a great point, that the CART and BRT models may neglect useful information from the data's correlation structure. An approach is to use hierarchical clustering on a binary (present/absent) dataset, and then apply the CART or BRT to predict membership in a particular cluster, using the values of the dataset.

So the code might look something like this.

First run the hierarhical clustering, using hclust()

## run the clustering on a dataset made of the 
hclust.fit <- hclust(dist(is.na.data.frame(sim.dat)))

## plot the hclust
plot(hclust.fit)

## cut the hclust into 4 pieces.
c.id <- as.factor(cutree(hclust.fit, 4))

## add the 4 pieces into a dataframe
sim.clust <- data.frame(sim.dat, 
                        c.id)

Now let's look at the percent of missing data in each cluster

## plot it on ggplot, this is the percent of missing data in each cluster
ggplot(data = sim.clust,
       aes(x = c.id,
           y = miss_perc)) + 
  geom_point()

Now let's use the rpart method.

## get the CART model to predict membership in each cluster.
rpart.clust <- rpart(c.id ~ C1 + C2 + C3 + F1 + F2,
                    data = sim.clust,
                    na.action = na.rpart,
                    control = rpart.control(xval = 100))

prp(rpart.clust, 
    extra = 100, 
    type = 4, 
    prefix = "Clust ID = ")

Now let's use the brt method (currently not working!)

brt.clust <- gbm.step(data=sim.clust,
                    gbm.x = c(3:7), 
                    gbm.y = 12,
                    family = "gaussian",
                    tree.complexity = 2,
                    learning.rate = 0.01,
                    bag.fraction = 0.5)

gbm.plot(brt.clust, 
         n.plots = 5, 
         write.title=F,
         plot.layout = c(2,3))

Future Goals.

Bundle these functions together into a package, with an intuitive grammar for the functions, making it easy for users to:

These plots and diagnostics could utilie ggplot, ggvis, and shiny, so users can interactively explore their missing data.



njtierney/mex documentation built on May 23, 2019, 8:22 p.m.