mice.impute.ds.pmm: Imputation by predictive mean matching for DataSHIELD
In stefvanbuuren/dsMiceClient: Distributed Multiple Imputations

Description Usage Arguments Details Value Author(s) References See Also Examples

Calculates imputations for univariate missing data by predictive mean matching.

1 2	mice.impute.ds.pmm(y, ry, x, wy = NULL, donors = 5L, matchtype = 1L, ridge = 1e-05, ...)

`y`	Vector to be imputed
`ry`	Logical vector of length `length(y)` indicating the the subset `y[ry]` of elements in `y` to which the imputation model is fitted. The `ry` generally distinguishes the observed (`TRUE`) and missing values (`FALSE`) in `y`.
`x`	Numeric design matrix with `length(y)` rows with predictors for `y`. Matrix `x` may have no missing values.
`wy`	Logical vector of length `length(y)`. A `TRUE` value indicates locations in `y` for which imputations are created.
`donors`	The size of the donor pool among which a draw is made. The default is `donors = 5L`. Setting `donors = 1L` always selects the closest match, but is not recommended. Values between 3L and 10L provide the best results in most cases (Morris et al, 2015).
`matchtype`	Type of matching distance. The default choice (`matchtype = 1L`) calculates the distance between the predicted value of `yobs` and the drawn values of `ymis` (called type-1 matching). Other choices are `matchtype = 0L` (distance between predicted values) and `matchtype = 2L` (distance between drawn values).
`ridge`	The ridge penalty used in `.ds.norm.draw()` to prevent problems with multicollinearity. The default is `ridge = 1e-05`, which means that 0.01 percent of the diagonal is added to the cross-product. Larger ridges may result in more biased estimates. For highly noisy data (e.g. many junk variables), set `ridge = 1e-06` or even lower to reduce bias. For highly collinear data, set `ridge = 1e-04` or higher.
`...`	Other named arguments.

Imputation of y by predictive mean matching, based on van Buuren (2012, p. 73). The procedure is as follows:

Calculate the cross-product matrix S=X_{obs}'X_{obs}.
Calculate V = (S+{diag}(S)κ)^{-1}, with some small ridge parameter κ.
Calculate regression weights \hatβ = VX_{obs}'y_{obs}.
Draw q independent N(0,1) variates in vector \dot z_1.
Calculate V^{1/2} by Cholesky decomposition.
Calculate \dotβ = \hatβ + \dotσ\dot z_1 V^{1/2}.
Calculate \dotη(i,j)=|X_{{obs},[i]|}\hatβ-X_{{mis},[j]}\dotβ with i=1,…,n_1 and j=1,…,n_0.
Construct n_0 sets Z_j, each containing d candidate donors, from Y_obs such that ∑_d\dotη(i,j) is minimum for all j=1,…,n_0. Break ties randomly.
Draw one donor i_j from Z_j randomly for j=1,…,n_0.
Calculate imputations \dot y_j = y_{i_j} for j=1,…,n_0.

The name predictive mean matching was proposed by Little (1988).

Vector with imputed data, same type as y, and of length sum(wy)

Stef van Buuren, Karin Groothuis-Oudshoorn

Little, R.J.A. (1988), Missing data adjustments in large surveys (with discussion), Journal of Business Economics and Statistics, 6, 287–301.

Morris TP, White IR, Royston P (2015). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol. ;14:75.

Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition. Chapman & Hall/CRC. Boca Raton, FL.

Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/v45/i03/

Other univariate imputation functions: mice.impute.ds.mean, mice.impute.ds.norm

# We normally call mice.impute.ds.pmm() from within mice()
# But we may call it directly as follows (not recommended)

set.seed(53177)
xname <- c('age', 'hgt', 'wgt')
r <- stats::complete.cases(boys[, xname])
x <- boys[r, xname]
y <- boys[r, 'tv']
ry <- !is.na(y)
table(ry)

# percentage of missing data in tv
sum(!ry) / length(ry)

# Impute missing tv data
yimp <- mice.impute.ds.pmm(y, ry, x)
length(yimp)
hist(yimp, xlab = 'Imputed missing tv')

# Impute all tv data
yimp <- mice.impute.ds.pmm(y, ry, x, wy = rep(TRUE, length(y)))
length(yimp)
hist(yimp, xlab = 'Imputed missing and observed tv')
plot(jitter(y), jitter(yimp),
    main = 'Predictive mean matching on age, height and weight',
    xlab = 'Observed tv (n = 224)',
    ylab = 'Imputed tv (n = 224)')
abline(0, 1)
cor(y, yimp, use = 'pair')