fill.NAs | R Documentation |

Given a `data.frame`

or `formula`

and data,
`fill.NAs()`

returns an expanded data frame, including a new
missingness flag for each variable with missing values and
replacing each missing entry with a value representing a reasonable
default for missing values in its column. Functions in the formula
are supported, with transformations happening before `NA`

replacement. The expanded data frame is useful for propensity
modeling and balance checking when there are covariates with
missing values.

fill.NAs(x, data = NULL, all.covs = FALSE, contrasts.arg = NULL)

`x` |
Can be either a data frame (in which case the data
argument should be |

`data` |
If x is a formula, this must be a data.frame. Otherwise it will be ignored. |

`all.covs` |
Should the response variable be imputed? For
formula |

`contrasts.arg` |
(from |

`fill.NAs`

prepares data for use in a model or matching
procedure by filling in missing values with minimally invasive
substitutes. Fill-in is performed column-wise, with each column
being treated individually. For each column that is missing, a new
column is created of the form “ColumnName.NA” with
indicators for each observation that is missing a value for
“ColumnName”. Rosenbaum and Rubin (1984, Sec. 2.4 and
Appendix B) discuss propensity score models using this data
structure.

The replacement value used to fill in a missing value is simple
mean replacement. For transformations of variables, e.g. ```
y ~
x1 * x2
```

, the transformation occurs first. The transformation
column will be `NA`

if any of the base columns are
`NA`

. Fill-in occurs next, replacing all missing values with
the observed column mean. This includes transformation columns.

Data can be passed to `fill.NAs`

in two ways. First, you can
simply pass a `data.frame`

object and `fill.NAs`

will
fill every column. Alternatively, you can pass a `formula`

and
a `data.frame`

. Fill-in will only be applied to columns
specifically used in the formula. Prior to fill-in, any functions
in the formula will be expanded. If any arguments to the functions
are `NA`

, the function value will also be `NA`

and
subject to fill-in.

By default, `fill.NAs`

does not impute the response
variable. This is to encourage more sophisticated imputation
schemes when the response is a treatment indicator in a matching
problem. This behavior can be overridden by setting ```
all.covs
= TRUE
```

.

A `data.frame`

with all `NA`

values replaced with
mean values and additional indicator columns for each column
including missing values. Suitable for directly passing to
`lm`

or other model building functions to build
propensity scores.

Mark M. Fredrickson and Jake Bowers

Rosenbaum, Paul R. and Rubin, Donald B. (1984) ‘Reducing
Bias in Observational Studies using Subclassification on the
Propensity Score,’ *Journal of the American Statistical
Association*, **79**, 516 – 524.

Von Hipple, Paul T. (2009) ‘How to impute interactions,
squares, and other transformed variables,’ *Sociological
Methodology*, **39**(1), 265 – 291.

`match_on`

, `lm`

data(nuclearplants) ### Extract some representative covariates: np.missing <- nuclearplants[c('t1', 't2', 'ne', 'ct', 'cum.n')] ### create some missingness in the covariates n <- dim(np.missing)[1] k <- dim(np.missing)[2] for (i in 1:n) { missing <- rbinom(1, prob = .1, size = k) if (missing > 0) { np.missing[i, sample(k, missing)] <- NA } } ### Restore outcome and treatment variables: np.missing <- data.frame(nuclearplants[c('cost', 'pr')], np.missing) ### Fit a propensity score but with missing covariate data flagged ### and filled in, as in Rosenbaum and Rubin (1984, Appendix): np.filled <- fill.NAs(pr ~ t1 * t2, np.missing) # Look at np.filled to establish what missingness flags were created head(np.filled) (np.glm <- glm(pr ~ ., family=binomial, data=np.filled)) (glm(pr ~ t1 + t2 + `t1:t2` + t1.NA + t2.NA, family=binomial, data=np.filled)) # In a non-interactive session, the following may help, as long as # the formula passed to `fill.NAs` (plus any missingness flags) is # the desired formula for the glm. (glm(formula(terms(np.filled)), family=binomial, data=np.filled)) ### produce a matrix of propensity distances based on the propensity model ### with fill-in and flagging. Then perform pair matching on it: pairmatch(match_on(np.glm, data=np.filled), data=np.filled) ## fill NAs without using treatment contrasts by making a list of contrasts for ## each factor ## following hints from https://stackoverflow.com/a/4569239/161808 np.missing$t1F<-factor(np.missing$t1) cov.factors <- sapply(np.missing[,c("t1F","t2")],is.factor) cov.contrasts <- lapply( np.missing[,names(cov.factors)[cov.factors],drop=FALSE], contrasts, contrasts = FALSE) ## make a data frame filling the missing covariate values, but without ## excluding any levels of any factors np.noNA2<-fill.NAs(pr~t1F+t2,data=np.missing,contrasts.arg=cov.contrasts)

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.