quickpred: Quick selection of predictors from the data

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/quickpred.R

Description

Selects predictors according to simple statistics

Usage

1
2
3
4
5
6
7
8
quickpred(
  data,
  mincor = 0.1,
  minpuc = 0,
  include = "",
  exclude = "",
  method = "pearson"
)

Arguments

data

Matrix or data frame with incomplete data.

mincor

A scalar, numeric vector (of size ncol(data)) or numeric matrix (square, of size ncol(data) specifying the minimum threshold(s) against which the absolute correlation in the data is compared.

minpuc

A scalar, vector (of size ncol(data)) or matrix (square, of size ncol(data) specifying the minimum threshold(s) for the proportion of usable cases.

include

A string or a vector of strings containing one or more variable names from names(data). Variables specified are always included as a predictor.

exclude

A string or a vector of strings containing one or more variable names from names(data). Variables specified are always excluded as a predictor.

method

A string specifying the type of correlation. Use 'pearson' (default), 'kendall' or 'spearman'. Can be abbreviated.

Details

This function creates a predictor matrix using the variable selection procedure described in Van Buuren et al.~(1999, p.~687–688). The function is designed to aid in setting up a good imputation model for data with many variables.

Basic workings: The procedure calculates for each variable pair (i.e. target-predictor pair) two correlations using all available cases per pair. The first correlation uses the values of the target and the predictor directly. The second correlation uses the (binary) response indicator of the target and the values of the predictor. If the largest (in absolute value) of these correlations exceeds mincor, the predictor will be added to the imputation set. The default value for mincor is 0.1.

In addition, the procedure eliminates predictors whose proportion of usable cases fails to meet the minimum specified by minpuc. The default value is 0, so predictors are retained even if they have no usable case.

Finally, the procedure includes any predictors named in the include argument (which is useful for background variables like age and sex) and eliminates any predictor named in the exclude argument. If a variable is listed in both include and exclude arguments, the include argument takes precedence.

Advanced topic: mincor and minpuc are typically specified as scalars, but vectors and squares matrices of appropriate size will also work. Each element of the vector corresponds to a row of the predictor matrix, so the procedure can effectively differentiate between different target variables. Setting a high values for can be useful for auxiliary, less important, variables. The set of predictor for those variables can remain relatively small. Using a square matrix extends the idea to the columns, so that one can also apply cellwise thresholds.

Value

A square binary matrix of size ncol(data).

Note

quickpred() uses data.matrix to convert factors to numbers through their internal codes. Especially for unordered factors the resulting quantification may not make sense.

Author(s)

Stef van Buuren, Aug 2009

References

van Buuren, S., Boshuizen, H.C., Knook, D.L. (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18, 681–694.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/v45/i03/

See Also

mice, mids

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# default: include all predictors with absolute correlation over 0.1
quickpred(nhanes)

# all predictors with absolute correlation over 0.4
quickpred(nhanes, mincor = 0.4)

# include age and bmi, exclude chl
quickpred(nhanes, mincor = 0.4, inc = c("age", "bmi"), exc = "chl")

# only include predictors with at least 30% usable cases
quickpred(nhanes, minpuc = 0.3)

# use low threshold for bmi, and high thresholds for hyp and chl
pred <- quickpred(nhanes, mincor = c(0, 0.1, 0.5, 0.5))
pred

# use it directly from mice
imp <- mice(nhanes, pred = quickpred(nhanes, minpuc = 0.25, include = "age"))

Example output

Attaching package:miceThe following object is masked frompackage:stats:

    filter

The following objects are masked frompackage:base:

    cbind, rbind

    age bmi hyp chl
age   0   0   0   0
bmi   1   0   1   1
hyp   1   0   0   1
chl   1   1   1   0
    age bmi hyp chl
age   0   0   0   0
bmi   0   0   0   0
hyp   1   0   0   1
chl   1   0   1   0
    age bmi hyp chl
age   0   0   0   0
bmi   1   0   0   0
hyp   1   1   0   0
chl   1   1   1   0
    age bmi hyp chl
age   0   0   0   0
bmi   1   0   0   0
hyp   1   0   0   0
chl   1   1   1   0
    age bmi hyp chl
age   0   0   0   0
bmi   1   0   1   1
hyp   1   0   0   0
chl   1   0   0   0

 iter imp variable
  1   1  bmi  hyp  chl
  1   2  bmi  hyp  chl
  1   3  bmi  hyp  chl
  1   4  bmi  hyp  chl
  1   5  bmi  hyp  chl
  2   1  bmi  hyp  chl
  2   2  bmi  hyp  chl
  2   3  bmi  hyp  chl
  2   4  bmi  hyp  chl
  2   5  bmi  hyp  chl
  3   1  bmi  hyp  chl
  3   2  bmi  hyp  chl
  3   3  bmi  hyp  chl
  3   4  bmi  hyp  chl
  3   5  bmi  hyp  chl
  4   1  bmi  hyp  chl
  4   2  bmi  hyp  chl
  4   3  bmi  hyp  chl
  4   4  bmi  hyp  chl
  4   5  bmi  hyp  chl
  5   1  bmi  hyp  chl
  5   2  bmi  hyp  chl
  5   3  bmi  hyp  chl
  5   4  bmi  hyp  chl
  5   5  bmi  hyp  chl

mice documentation built on Jan. 27, 2021, 5:10 p.m.

Related to quickpred in mice...