fill.NAs | R Documentation |
Given a data.frame
or formula
and data,
fill.NAs()
returns an expanded data frame, including a new
missingness flag for each variable with missing values and
replacing each missing entry with a value representing a reasonable
default for missing values in its column. Functions in the formula
are supported, with transformations happening before NA
replacement. The expanded data frame is useful for propensity
modeling and balance checking when there are covariates with
missing values.
fill.NAs(x, data = NULL, all.covs = FALSE, contrasts.arg = NULL)
x |
Can be either a data frame (in which case the data
argument should be |
data |
If x is a formula, this must be a data.frame. Otherwise it will be ignored. |
all.covs |
Should the response variable be imputed? For
formula |
contrasts.arg |
(from |
fill.NAs
prepares data for use in a model or matching
procedure by filling in missing values with minimally invasive
substitutes. Fill-in is performed column-wise, with each column
being treated individually. For each column that is missing, a new
column is created of the form “ColumnName.NA” with
indicators for each observation that is missing a value for
“ColumnName”. Rosenbaum and Rubin (1984, Sec. 2.4 and
Appendix B) discuss propensity score models using this data
structure.
The replacement value used to fill in a missing value is simple
mean replacement. For transformations of variables, e.g. y ~
x1 * x2
, the transformation occurs first. The transformation
column will be NA
if any of the base columns are
NA
. Fill-in occurs next, replacing all missing values with
the observed column mean. This includes transformation columns.
Data can be passed to fill.NAs
in two ways. First, you can
simply pass a data.frame
object and fill.NAs
will
fill every column. Alternatively, you can pass a formula
and
a data.frame
. Fill-in will only be applied to columns
specifically used in the formula. Prior to fill-in, any functions
in the formula will be expanded. If any arguments to the functions
are NA
, the function value will also be NA
and
subject to fill-in.
By default, fill.NAs
does not impute the response
variable. This is to encourage more sophisticated imputation
schemes when the response is a treatment indicator in a matching
problem. This behavior can be overridden by setting all.covs
= TRUE
.
A data.frame
with all NA
values replaced with
mean values and additional indicator columns for each column
including missing values. Suitable for directly passing to
lm
or other model building functions to build
propensity scores.
Mark M. Fredrickson and Jake Bowers
Rosenbaum, Paul R. and Rubin, Donald B. (1984) ‘Reducing Bias in Observational Studies using Subclassification on the Propensity Score,’ Journal of the American Statistical Association, 79, 516 – 524.
Von Hipple, Paul T. (2009) ‘How to impute interactions, squares, and other transformed variables,’ Sociological Methodology, 39(1), 265 – 291.
match_on
, lm
data(nuclearplants)
### Extract some representative covariates:
np.missing <- nuclearplants[c('t1', 't2', 'ne', 'ct', 'cum.n')]
### create some missingness in the covariates
n <- dim(np.missing)[1]
k <- dim(np.missing)[2]
for (i in 1:n) {
missing <- rbinom(1, prob = .1, size = k)
if (missing > 0) {
np.missing[i, sample(k, missing)] <- NA
}
}
### Restore outcome and treatment variables:
np.missing <- data.frame(nuclearplants[c('cost', 'pr')], np.missing)
### Fit a propensity score but with missing covariate data flagged
### and filled in, as in Rosenbaum and Rubin (1984, Appendix):
np.filled <- fill.NAs(pr ~ t1 * t2, np.missing)
# Look at np.filled to establish what missingness flags were created
head(np.filled)
(np.glm <- glm(pr ~ ., family=binomial, data=np.filled))
(glm(pr ~ t1 + t2 + `t1:t2` + t1.NA + t2.NA,
family=binomial, data=np.filled))
# In a non-interactive session, the following may help, as long as
# the formula passed to `fill.NAs` (plus any missingness flags) is
# the desired formula for the glm.
(glm(formula(terms(np.filled)), family=binomial, data=np.filled))
### produce a matrix of propensity distances based on the propensity model
### with fill-in and flagging. Then perform pair matching on it:
pairmatch(match_on(np.glm, data=np.filled), data=np.filled)
## fill NAs without using treatment contrasts by making a list of contrasts for
## each factor ## following hints from https://stackoverflow.com/a/4569239/161808
np.missing$t1F<-factor(np.missing$t1)
cov.factors <- sapply(np.missing[,c("t1F","t2")],is.factor)
cov.contrasts <- lapply(
np.missing[,names(cov.factors)[cov.factors],drop=FALSE],
contrasts, contrasts = FALSE)
## make a data frame filling the missing covariate values, but without
## excluding any levels of any factors
np.noNA2<-fill.NAs(pr~t1F+t2,data=np.missing,contrasts.arg=cov.contrasts)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.