README.md
In burrisk/midamix: Bayesian multiple imputation via mixtures

midamix

midamix is an R package dedicated to flexible, tidy handling of missing data. For numeric and/or ordinal datasets, midamix uses a transformation model coupled with a truncated Dirichlet process mixture model to generate multiple completed datasets. It contains the following easy-to-use methods for performing principled multiple imputation.

impute() generates multiply imputed datasets
fit_model() fits a model to each of the multiply imputed datasets.
pool_inferences() combines inferences for linear models or generalized linear models fit on each of the datasets via the combining rules of Rubin (1987).

To use the development version, you can install midamix from GitHub.

# install.packages("devtools")
devtools::install_github("burrisk/midamix")

The following code enables you to obtain multiple completed datasets, stored in a nested tibble.

library(midamix)

data(diamonds)
print(diamonds)
#> # A tibble: 1,000 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  1.24 NA        F     SI2      60    NA    5000  7.03  7     4.21
#>  2  1    Good      NA    SI1      62.7  60      NA  6.24  6.3   3.93
#>  3  1.06 Premium   NA    SI1      59.6  58    4815  6.67  6.73  3.99
#>  4 NA    NA        E     SI1      60.6  58    1440 NA    NA     3.2 
#>  5  0.33 Ideal     E     VS2      NA    54     928  4.47  4.42 NA   
#>  6  0.56 Ideal     H     NA       62.2  55.3    NA  5.26  5.31  3.28
#>  7  1.64 NA        NA    NA       60.7  59    9632  7.61  7.65 NA   
#>  8  0.27 Very Good NA    VVS2     61.9  NA      NA NA    NA     2.59
#>  9  0.4  Very Good D     VS1      62.9  NA    1123  4.67  4.71 NA   
#> 10  0.5  Premium   NA    NA       61.4  61    1971  5.14  5.03  3.12
#> # ... with 990 more rows

diamonds_imputed <- impute(diamonds, imputations = 5, burnin = 100, n_iter = 1000,
                           seed = 314)
print(diamonds_imputed)
#> # A tibble: 5 x 2
#>   imputation data                 
#>        <int> <list>               
#> 1          1 <tibble [1,000 × 10]>
#> 2          2 <tibble [1,000 × 10]>
#> 3          3 <tibble [1,000 × 10]>
#> 4          4 <tibble [1,000 × 10]>
#> 5          5 <tibble [1,000 × 10]>

To extract the second imputed dataset, for example, you could run

print(diamonds_imputed$data[[2]])
#> # A tibble: 1,000 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1  1.24 Premium   F     SI2      60    60    5000  7.03  7     4.21
#>  2  1    Good      G     SI1      62.7  60    3713  6.24  6.3   3.93
#>  3  1.06 Premium   G     SI1      59.6  58    4815  6.67  6.73  3.99
#>  4  0.51 Very Good E     SI1      60.6  58    1440  5.44  5.3   3.2 
#>  5  0.33 Ideal     E     VS2      62.9  54     928  4.47  4.42  2.82
#>  6  0.56 Ideal     H     VVS1     62.2  55.3  1846  5.26  5.31  3.28
#>  7  1.64 Very Good I     SI2      60.7  59    9632  7.61  7.65  4.6 
#>  8  0.27 Very Good F     VVS2     61.9  60     499  4.16  4.15  2.59
#>  9  0.4  Very Good D     VS1      62.9  56    1123  4.67  4.71  3   
#> 10  0.5  Premium   E     VS1      61.4  61    1971  5.14  5.03  3.12
#> # ... with 990 more rows

You can fit a given model to each of these datasets in a streamlined way. First, define a function that takes in a dataset and returns a model object. Then, apply the function fit_model() to your imputed datasets.

diamonds_linear_model <- function(data){
  lm(log(price) ~ carat + depth + table, data = data)
}

model_fits <- fit_model(diamonds_imputed, diamonds_linear_model)
print(model_fits)
#> # A tibble: 20 x 6
#>    imputation term        estimate std.error statistic  p.value
#>         <int> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#>  1          1 (Intercept)  10.4      0.818       12.7  2.06e-34
#>  2          1 carat         1.95     0.0277      70.5  0.      
#>  3          1 depth        -0.0450   0.0100      -4.49 7.88e- 6
#>  4          1 table        -0.0240   0.00641     -3.75 1.87e- 4
#>  5          2 (Intercept)   9.31     0.827       11.3  8.61e-28
#>  6          2 carat         1.95     0.0275      70.8  0.      
#>  7          2 depth        -0.0364   0.0102      -3.59 3.53e- 4
#>  8          2 table        -0.0142   0.00628     -2.27 2.37e- 2
#>  9          3 (Intercept)   9.71     0.835       11.6  1.93e-29
#> 10          3 carat         1.93     0.0279      69.1  0.      
#> 11          3 depth        -0.0387   0.0102      -3.80 1.54e- 4
#> 12          3 table        -0.0186   0.00644     -2.88 4.02e- 3
#> 13          4 (Intercept)   8.68     0.809       10.7  1.73e-25
#> 14          4 carat         1.95     0.0275      70.8  0.      
#> 15          4 depth        -0.0258   0.00997     -2.59 9.69e- 3
#> 16          4 table        -0.0147   0.00634     -2.32 2.04e- 2
#> 17          5 (Intercept)   9.67     0.851       11.4  3.38e-28
#> 18          5 carat         1.94     0.0280      69.3  0.      
#> 19          5 depth        -0.0391   0.0105      -3.72 2.12e- 4
#> 20          5 table        -0.0174   0.00646     -2.70 7.14e- 3

To aggregate these results for model inference, we use the pool_inferences function:

model_summary <- pool_inferences(model_fits)
print(model_summary)
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)   9.55     1.07         8.89 4.20e-  9
#> 2 carat         1.94     0.0298      65.3  1.09e-149
#> 3 depth        -0.0370   0.0127      -2.90 6.81e-  3
#> 4 table        -0.0178   0.00771     -2.31 2.61e-  2

If you encounter a clear bug, please file a minimal reproducible example on github.

burrisk/midamix documentation built on June 1, 2019, 12:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com