In njtierney/mex: A Tool To More Precisely Explore Missing Data

library(knitr)
  # Set basic options. You usually do not want your code, messages, warnings etc.
  # to show in your actual manuscript however for the first run or two these will 
  # be set on.
opts_chunk$set(echo=TRUE, 
               warning=FALSE, 
               message=FALSE,
               cache = TRUE,
               include = TRUE,
               results = 'hide',
               dev = 'pdf')

## load necessary packages
library(rpart)
library(rpart.plot)
library(Amelia)
library(ggplot2)
library(ggdendro)
library(GGally)
library(grid)
library(gridExtra)
library(mi)
library(xtable)
library(mice)
library(devtools)

## load MCAR.test function
  source('~/Dropbox/ALL THE THINGS/PhD/code/R/mex/R/mcar_test.R')

## source code from elith et al.
  source("~/Dropbox/ALL THE THINGS/PhD/Guides/Elith_BRT_Guide_2_of_2/brt.functions.R")

## start dev mode
dev_mode()

## launch mex
library("mex", lib.loc="~/R-dev")


## this dataset has random missingness as well as missingness induced when the variable
## C1 is higher than 55.
data(sim.dat)

mex: the (m)issingness (ex)plorer.

Everything is still very much in a draft phase, and is in a state of flux - the things you see written here are not written in stone (more like a map drawn in the dirt with pointy sticks).

Why should someone use mex?

If you have missing data, then you need to explore reasons for missingness. With mex, we aim to provide:

a simple model-assisted approach to elucidating missingness structure.
nice visualisations of model output, incorporating the use of shiny and ggvis to enhance exploration.

Functions.

The main functions in mex are:

explore
model
confirm

Explore

In the explore step, you want to know how much missing data there is, and if there is a possibility of bias. This exploration step utilises:

visual plots to show how much data there is missing,
t-test and $\chi^2$ to explore whether the mean of expected count is affected by missingness

MCAR.test (aka, explore?)

Before searching for structured missingness in the data, it is useful to ask whether the missingness is prevalent enough for us to need an investigation, and determine whether the data may be missing completely at random (MCAR)

We can do this by:

splitting the data into two groups according to the presence or absence of a selected dependent variable, and
applying a t-test if the independent variables are continuous or
a chi-square test if they are discrete, This allows us to determine the equality of means or category probabilities, respectively. A Bonferroni adjustment or similar method can be used to allow for multiple tests.

I have written an R function MCAR.test which allows the user perform this test. The function outputs a table giving the results of the t-test and the chi2 test. This function still currently not working for the dataset sim.dat.csv, due to bugs.

sim.dat$F1 <- as.factor(sim.dat$F1)

sim.dat$F2 <- as.factor(sim.dat$F2)

sim.dat.2 <- sim.dat[,-c(1,7:9)]

head(sim.dat.2)

factor.list <- c("F1",
                 "F2")

MCAR.test(data = sim.dat,
          y = "C1",
          factor.list = factor.list)

For some reason in the code below I need to specify the full directory - even though this runs fine when I feed it into the console, it doesn't really work when I try to run a command like simdat <- read.csv("sim.dat.csv"). So the code below specifies the rull directory.

Datasets to use

A simulated missing dataset was created to explore the missing data methods. This dataset sim.dat.csv is in the Github repo. This dataset contained five variables, two categorical and three continuous, with 1,000 observations in each. The two categorical factors, F1 and F2, ranged uniformly across categories nominally labelled 1-7, and 1-10, respectively. The three continuous variables, named C1, C2 and C3, were normally distributed with means and standard deviations of 50 and 10, 90 and 10, and 30 and 3, respectively.

The code I wrote to create this data is in GitHub, under /demo/simulated_data.R :

I have written a function sim_miss_data.R to create missingness in data. In our case, I have created missingness in the data sim1, and called it sim.dat. However, the sim_miss_data.R function currently saves the data with the induced missingness into a directory. One of the changes that I haven't gotten around to is changing the code so I can save it to an object or .csv, rather than to a directory, as it currently is.

## depict the missingness.
missing.pattern.plot(sim.dat,
                     gray.scale = TRUE,
                     clustered = TRUE)

R Datasets with missingness Only a few base R datasets seem to have missingness airquality and attenu, which can be seen below.

Other datasets come from the mice package:

boys
fdd

missing.pattern.plot(airquality, 
                     gray.scale = TRUE,
                     clustered = TRUE)

missing.pattern.plot(attenu, 
                     gray.scale = TRUE,
                     clustered = TRUE)

missing.pattern.plot(boys, 
                     gray.scale = TRUE,
                     clustered = TRUE)

missing.pattern.plot(fdd, 
                     gray.scale = TRUE,
                     clustered = TRUE)

so we are currently just using the sim.dat dataset.

Model

In the model step, you model possible mechanisms for missingness, using the rpart, gbm, and (unsure of which precise clustering method to use) hclust.

CART model.

To assist us in detecting missingness structure, we use CART models to predict the proportion of missing data in a row. For the CART model, I used the rpart and rpart.plot packages.

The process for running the CART model using rpart.

Predict miss_perc, the proportion of missing data in a row, using the appropriate independent variables.
plot the data appropriately.
my common mistakes:
- make sure there aren't strings (or relabel as factors)
- make sure those variables that are supposed to be factors, are factors.

## this creates the rpart tree.
cart.small <- rpart(miss_perc ~ ID + C1 + C2 + C3 + F1 + F2,
                    data = sim.dat,
                    na.action = na.rpart, 
                    method = "anova")


## this actually plots the rpart tree.
prp(cart.small, 
    extra = 1, 
    type = 4, 
    prefix = "Prop. Miss = ")

The BRT

For the BRT model, I use the gbm package, and the source code from elith et al. (2008) found in supplementary file 2.

I'm not sure if I can just upload this code to GitHub and then reference it - or if I should provide the link, so that they, the authors, can get the altmetrics from it?

Specifically, I used the functions: - gbm.step:

fits a gbm model to one or more response variables, using cross-validation to estimate the optimal number of trees.

gbm.plot:

Plots the partial dependence of the response on one or more predictors.

There are a few other components that can be used from the elith et al (2008) supplementary materials - gbm.fixed, gbm.holdout, gbm.simplify, gbm.plot.fits, gbm.interactions, gbm.perspec, gbm.predict.grids.

A tricky part of fitting the BRT is understanding the right levels for each of the parameters:

tree complexity
learning rate
bag fraction

And the family of distributions.

I followed the examples out of the Elith et al guide (2008), which was fiddly. It would be nice to make this model fitting process easier, perhaps this could be done using ggvis, to help control the parameters and get an update in real time. It is also the case that the gbm.step function from Elith et al(2008) is slowing down everything. I wonder if it would be possible to look at what they have done and re-write it to be faster. It's annoying when each model takes 5 minutes+ to run, and you want to test 20 combinations of parameters.

If we are to do any graphing, I would really like it if we made all of the output plots in ggplot.That is, if we are going to produce automated plots, we should use ggplot. I'd also really like to fix the gbm.plot() code the Elith et al. have written.

brt.mod <- gbm.step(data=sim.dat, 
                    gbm.x = c(2:6), 
                    gbm.y = 10,
                    family = "gaussian",
                    tree.complexity = 2,
                    learning.rate = 0.01,
                    bag.fraction = 0.5)

gbm.plot(brt.mod, 
         n.plots = 5, 
         write.title=F,
         plot.layout = c(2,3))

note: This vignette needs to change the gbm.plot code- as this code- that I have modified in 'gbm.plot' - for the window() command - that I changed to X11() - it now plots everything after that in the X11 device, which I don't want, as it is slow.

Confirm

In the confirm step, you can use the models created to predict the missingness structure of your own dataset, allowing for a degree of comparison or 'model fit'.

One feature could be to plot the model fit of the proportion of missing data, and the predicted amount from the models.

#obtain predicted values
cart.pred <- predict(object = cart.small,
                     newdata = sim.dat)

gg.cart.pred.plot <- ggplot(data = NULL, 
                         aes(x = sim.dat$miss_perc, 
                             y = cart.pred)) + 
                        geom_point(size = 2) + 
                        labs(title = "",
                             x = "CART Observed Values",
                             y = "CART Predicted Values") + 
                        theme(axis.text=element_text(size=10),
                              axis.title=element_text(size=12))

## print(gg.cart.pred.plot)

## obtain predictions
  brt.pred <- predict.gbm(brt.mod, 
                          n.trees= 600)

## make a scatter plot of observed vs predicted

gg.brt.pred.plot <- ggplot(data = sim.dat,
                       aes(x = sim.dat$miss_perc, 
                           brt.pred)) + 
  geom_point(size = 2) + 
  labs(title = "",
       x = "BRT Observed Values",
       y = "BRT Predicted Values") + 
  theme(axis.text=element_text(size=10),
        axis.title=element_text(size=12))

## print(gg.brt.pred.plot)

## gridExtra allows for easy combining of ggplots...

grid.arrange(gg.cart.pred.plot,
             gg.brt.pred.plot, 
             ncol=2)

How does `mex` compare to other existing solutions?

Current solutions such MissingDataGUI, VIM, missmap (in Amelia), missing.pattern.plot (in mi) usually focus on the user needing to visually search and find the trends. Whilst humans are very good at finding patterns, having a model behind the output has more potential for really helping researchers explore their missing data problems in a precise way. So whilst it is possible for people to use the methods provided in mex, it isn't necessarily easy, and straightforward

`MissingDataGUI`

`VIM`

`missmap` (from the `AMELIA II` package)

gives a visual depiction of the missingness in the dataset

missmap(sim.dat,
        rank.order = FALSE)

`missing.pattern.plot`

Another missingness map tool that exists from the im package - this allows you to specify a "clustered" option, which groups data with similar missingness patterns together.

missing.pattern.plot(sim.dat,
                     gray.scale = TRUE)

Future Work

Damjan has made a great point, that the CART and BRT models may neglect useful information from the data's correlation structure. An approach is to use hierarchical clustering on a binary (present/absent) dataset, and then apply the CART or BRT to predict membership in a particular cluster, using the values of the dataset.

So the code might look something like this.

First run the hierarhical clustering, using hclust()

## run the clustering on a dataset made of the 
hclust.fit <- hclust(dist(is.na.data.frame(sim.dat)))

## plot the hclust
plot(hclust.fit)

## cut the hclust into 4 pieces.
c.id <- as.factor(cutree(hclust.fit, 4))

## add the 4 pieces into a dataframe
sim.clust <- data.frame(sim.dat, 
                        c.id)

Now let's look at the percent of missing data in each cluster

## plot it on ggplot, this is the percent of missing data in each cluster
ggplot(data = sim.clust,
       aes(x = c.id,
           y = miss_perc)) + 
  geom_point()

Now let's use the rpart method.

## get the CART model to predict membership in each cluster.
rpart.clust <- rpart(c.id ~ C1 + C2 + C3 + F1 + F2,
                    data = sim.clust,
                    na.action = na.rpart,
                    control = rpart.control(xval = 100))

prp(rpart.clust, 
    extra = 100, 
    type = 4, 
    prefix = "Clust ID = ")

Now let's use the brt method (currently not working!)

brt.clust <- gbm.step(data=sim.clust,
                    gbm.x = c(3:7), 
                    gbm.y = 12,
                    family = "gaussian",
                    tree.complexity = 2,
                    learning.rate = 0.01,
                    bag.fraction = 0.5)

gbm.plot(brt.clust, 
         n.plots = 5, 
         write.title=F,
         plot.layout = c(2,3))

Future Goals.

Bundle these functions together into a package, with an intuitive grammar for the functions, making it easy for users to:

explore their missing data
model their missing data
confirm possible missingness patterns
create plots and diagnostics.

These plots and diagnostics could utilie ggplot, ggvis, and shiny, so users can interactively explore their missing data.

njtierney/mex documentation built on May 23, 2019, 8:22 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

njtierney/mex
A Tool To More Precisely Explore Missing Data

In njtierney/mex: A Tool To More Precisely Explore Missing Data

mex: the (m)issingness (ex)plorer.

Why should someone use mex?

Functions.

Explore

MCAR.test (aka, explore?)

Datasets to use

Model

CART model.

The process for running the CART model using rpart.

The BRT

Confirm

How does `mex` compare to other existing solutions?

`MissingDataGUI`

`VIM`

`missmap` (from the `AMELIA II` package)

`missing.pattern.plot`

Future Work

Future Goals.

R Package Documentation

Browse R Packages

We want your feedback!

njtierney/mex A Tool To More Precisely Explore Missing Data

In njtierney/mex: A Tool To More Precisely Explore Missing Data

mex: the (m)issingness (ex)plorer.

Why should someone use mex?

Functions.

Explore

MCAR.test (aka, explore?)

Datasets to use

Model

CART model.

The process for running the CART model using rpart.

The BRT

Confirm

How does mex compare to other existing solutions?

MissingDataGUI

VIM

missmap (from the AMELIA II package)

missing.pattern.plot

Future Work

Future Goals.

R Package Documentation

Browse R Packages

We want your feedback!

njtierney/mex
A Tool To More Precisely Explore Missing Data

How does `mex` compare to other existing solutions?

`MissingDataGUI`

`VIM`

`missmap` (from the `AMELIA II` package)

`missing.pattern.plot`