README.md

midamix

Travis build
status

Overview

midamix is an R package dedicated to flexible, tidy handling of missing data. For numeric and/or ordinal datasets, midamix uses a transformation model coupled with a truncated Dirichlet process mixture model to generate multiple completed datasets. It contains the following easy-to-use methods for performing principled multiple imputation.

Development version

To use the development version, you can install midamix from GitHub.

# install.packages("devtools")
devtools::install_github("burrisk/midamix")

Usage

The following code enables you to obtain multiple completed datasets, stored in a nested tibble.

library(midamix)

data(diamonds)
print(diamonds)
#> # A tibble: 1,000 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  1.24 NA        F     SI2      60    NA    5000  7.03  7     4.21
#>  2  1    Good      NA    SI1      62.7  60      NA  6.24  6.3   3.93
#>  3  1.06 Premium   NA    SI1      59.6  58    4815  6.67  6.73  3.99
#>  4 NA    NA        E     SI1      60.6  58    1440 NA    NA     3.2 
#>  5  0.33 Ideal     E     VS2      NA    54     928  4.47  4.42 NA   
#>  6  0.56 Ideal     H     NA       62.2  55.3    NA  5.26  5.31  3.28
#>  7  1.64 NA        NA    NA       60.7  59    9632  7.61  7.65 NA   
#>  8  0.27 Very Good NA    VVS2     61.9  NA      NA NA    NA     2.59
#>  9  0.4  Very Good D     VS1      62.9  NA    1123  4.67  4.71 NA   
#> 10  0.5  Premium   NA    NA       61.4  61    1971  5.14  5.03  3.12
#> # ... with 990 more rows

diamonds_imputed <- impute(diamonds, imputations = 5, burnin = 100, n_iter = 1000,
                           seed = 314)
print(diamonds_imputed)
#> # A tibble: 5 x 2
#>   imputation data                 
#>        <int> <list>               
#> 1          1 <tibble [1,000 × 10]>
#> 2          2 <tibble [1,000 × 10]>
#> 3          3 <tibble [1,000 × 10]>
#> 4          4 <tibble [1,000 × 10]>
#> 5          5 <tibble [1,000 × 10]>

To extract the second imputed dataset, for example, you could run

print(diamonds_imputed$data[[2]])
#> # A tibble: 1,000 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  1.24 Premium   F     SI2      60    60    5000  7.03  7     4.21
#>  2  1    Good      G     SI1      62.7  60    3713  6.24  6.3   3.93
#>  3  1.06 Premium   G     SI1      59.6  58    4815  6.67  6.73  3.99
#>  4  0.51 Very Good E     SI1      60.6  58    1440  5.44  5.3   3.2 
#>  5  0.33 Ideal     E     VS2      62.9  54     928  4.47  4.42  2.82
#>  6  0.56 Ideal     H     VVS1     62.2  55.3  1846  5.26  5.31  3.28
#>  7  1.64 Very Good I     SI2      60.7  59    9632  7.61  7.65  4.6 
#>  8  0.27 Very Good F     VVS2     61.9  60     499  4.16  4.15  2.59
#>  9  0.4  Very Good D     VS1      62.9  56    1123  4.67  4.71  3   
#> 10  0.5  Premium   E     VS1      61.4  61    1971  5.14  5.03  3.12
#> # ... with 990 more rows

You can fit a given model to each of these datasets in a streamlined way. First, define a function that takes in a dataset and returns a model object. Then, apply the function fit_model() to your imputed datasets.

diamonds_linear_model <- function(data){
  lm(log(price) ~ carat + depth + table, data = data)
}

model_fits <- fit_model(diamonds_imputed, diamonds_linear_model)
print(model_fits)
#> # A tibble: 20 x 6
#>    imputation term        estimate std.error statistic  p.value
#>         <int> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#>  1          1 (Intercept)  10.4      0.818       12.7  2.06e-34
#>  2          1 carat         1.95     0.0277      70.5  0.      
#>  3          1 depth        -0.0450   0.0100      -4.49 7.88e- 6
#>  4          1 table        -0.0240   0.00641     -3.75 1.87e- 4
#>  5          2 (Intercept)   9.31     0.827       11.3  8.61e-28
#>  6          2 carat         1.95     0.0275      70.8  0.      
#>  7          2 depth        -0.0364   0.0102      -3.59 3.53e- 4
#>  8          2 table        -0.0142   0.00628     -2.27 2.37e- 2
#>  9          3 (Intercept)   9.71     0.835       11.6  1.93e-29
#> 10          3 carat         1.93     0.0279      69.1  0.      
#> 11          3 depth        -0.0387   0.0102      -3.80 1.54e- 4
#> 12          3 table        -0.0186   0.00644     -2.88 4.02e- 3
#> 13          4 (Intercept)   8.68     0.809       10.7  1.73e-25
#> 14          4 carat         1.95     0.0275      70.8  0.      
#> 15          4 depth        -0.0258   0.00997     -2.59 9.69e- 3
#> 16          4 table        -0.0147   0.00634     -2.32 2.04e- 2
#> 17          5 (Intercept)   9.67     0.851       11.4  3.38e-28
#> 18          5 carat         1.94     0.0280      69.3  0.      
#> 19          5 depth        -0.0391   0.0105      -3.72 2.12e- 4
#> 20          5 table        -0.0174   0.00646     -2.70 7.14e- 3

To aggregate these results for model inference, we use the pool_inferences function:

model_summary <- pool_inferences(model_fits)
print(model_summary)
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)   9.55     1.07         8.89 4.20e-  9
#> 2 carat         1.94     0.0298      65.3  1.09e-149
#> 3 depth        -0.0370   0.0127      -2.90 6.81e-  3
#> 4 table        -0.0178   0.00771     -2.31 2.61e-  2

Getting help

If you encounter a clear bug, please file a minimal reproducible example on github.



burrisk/midamix documentation built on June 1, 2019, 12:49 p.m.