transcan  R Documentation 
transcan
is a nonlinear additive transformation and imputation
function, and there are several functions for using and operating on
its results. transcan
automatically transforms continuous and
categorical variables to have maximum correlation with the best linear
combination of the other variables. There is also an option to use a
substitute criterion  maximum correlation with the first principal
component of the other variables. Continuous variables are expanded
as restricted cubic splines and categorical variables are expanded as
contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component
columns. This function is similar to ace
except that
transformations for continuous variables are fitted using restricted
cubic splines, monotonicity restrictions are not allowed, and
NA
s are allowed. When a variable has any NA
s,
transformed scores for that variable are imputed using least squares
multiple regression incorporating optimum transformations, or
NA
s are optionally set to constants. Shrinkage can be used to
safeguard against overfitting when imputing. Optionally, imputed
values on the original scale are also computed and returned. For this
purpose, recursive partitioning or multinomial logistic models can
optionally be used to impute categorical variables, using what is
predicted to be the most probable category.
By default, transcan
imputes NA
s with “best
guess” expected values of transformed variables, back transformed to
the original scale. Values thus imputed are most like conditional
medians assuming the transformations make variables' distributions
symmetric (imputed values are similar to conditionl modes for
categorical variables). By instead specifying n.impute
,
transcan
does approximate multiple imputation from the
distribution of each variable conditional on all other variables.
This is done by sampling n.impute
residuals from the
transformed variable, with replacement (a la bootstrapping), or by
default, using Rubin's approximate Bayesian bootstrap, where a sample
of size n with replacement is selected from the residuals on
n nonmissing values of the target variable, and then a sample
of size m with replacement is chosen from this sample, where
m is the number of missing values needing imputation for the
current multiple imputation repetition. Neither of these bootstrap
procedures assume normality or even symmetry of residuals. For
sometimesmissing categorical variables, optimal scores are computed
by adding the “best guess” predicted mean score to random
residuals off this score. Then categories having scores closest to
these predicted scores are taken as the random multiple imputations
(impcat = "rpart"
is not currently allowed
with n.impute
). The literature recommends using n.impute
= 5
or greater. transcan
provides only an approximation to
multiple imputation, especially since it “freezes” the
imputation model before drawing the multiple imputations rather than
using different estimates of regression coefficients for each
imputation. For multiple imputation, the aregImpute
function
provides a much better approximation to the full Bayesian approach
while still not requiring linearity assumptions.
When you specify n.impute
to transcan
you can use
fit.mult.impute
to refit any model n.impute
times based
on n.impute
completed datasets (if there are any sometimes
missing variables not specified to transcan
, some observations
will still be dropped from these fits). After fitting n.impute
models, fit.mult.impute
will return the fit object from the
last imputation, with coefficients
replaced by the average of
the n.impute
coefficient vectors and with a component
var
equal to the imputationcorrected variancecovariance
matrix using Rubin's rule. fit.mult.impute
can also use the object created by the
mice
function in the mice library to draw the
multiple imputations, as well as objects created by
aregImpute
. The following components of fit objects are
also replaced with averages over the n.impute
model fits:
linear.predictors
, fitted.values
, stats
,
means
, icoef
, scale
, center
,
y.imputed
.
By specifying fun
to fit.mult.impute
you can run any
function on the fit objects from completed datasets, with the results
saved in an element named funresults
. This facilitates
running bootstrap or crossvalidation separately on each completed
dataset and storing all these results in a list for later processing,
e.g., with the rms
package processMI
function. Note that for
rms
type validation you will need to specify
fitargs=list(x=TRUE,y=TRUE)
to fit.mult.impute
and to
use special names for fun
result components, such as
validate
and calibrate
so that the result can be
processed with processMI
. When simultaneously running multiple
imputation and resampling model validation you may not need values for
n.impute
or B
(number of bootstraps) as high as usual,
as the total number of repetitions will be n.impute * B
.
fit.mult.impute
can incorporate robust sandwich variance estimates into
Rubin's rule if robust=TRUE
.
For ols
models fitted by fit.mult.impute
with stacking,
the R^2
measure in the stacked model fit is OK, and
print.ols
computes adjusted R^2
using the real sample
size so it is also OK because fit.mult.compute
corrects the
stacked error degrees of freedom in the stacked fit object to reflect
the real sample size.
The summary
method for transcan
prints the function
call, R^2
achieved in transforming each variable, and for each
variable the coefficients of all other transformed variables that are
used to estimate the transformation of the initial variable. If
imputed=TRUE
was used in the call to transcan, also uses the
describe
function to print a summary of imputed values. If
long = TRUE
, also prints all imputed values with observation
identifiers. There is also a simple function print.transcan
which merely prints the transformation matrix and the function call.
It has an optional argument long
, which if set to TRUE
causes detailed parameters to be printed. Instead of plotting while
transcan
is running, you can plot the final transformations
after the fact using plot.transcan
or ggplot.transcan
,
if the option trantab = TRUE
was specified to transcan
.
If in addition the option
imputed = TRUE
was specified to transcan
,
plot
and ggplot
will show the location of imputed values
(including multiples) along the axes. For ggplot
, imputed
values are shown as red plus signs.
impute
method for transcan
does imputations for a
selected original data variable, on the original scale (if
imputed=TRUE
was given to transcan
). If you do not
specify a variable to impute
, it will do imputations for all
variables given to transcan
which had at least one missing
value. This assumes that the original variables are accessible (i.e.,
they have been attached) and that you want the imputed variables to
have the same names are the original variables. If n.impute
was
specified to transcan
you must tell impute
which
imputation
to use. Results are stored in .GlobalEnv
when list.out
is not specified (it is recommended to use
list.out=TRUE
).
The predict
method for transcan
computes
predicted variables and imputed values from a matrix of new data.
This matrix should have the same column variables as the original
matrix used with transcan
, and in the same order (unless a
formula was used with transcan
).
The Function
function is a generic function
generator. Function.transcan
creates R functions to transform
variables using transformations created by transcan
. These
functions are useful for getting predicted values with predictors set
to values on the original scale.
The vcov
methods are defined here so that
imputationcorrected variancecovariance matrices are readily
extracted from fit.mult.impute
objects, and so that
fit.mult.impute
can easily compute traditional covariance
matrices for individual completed datasets.
The subscript method for transcan
preserves attributes.
The invertTabulated
function does either inverse linear
interpolation or uses sampling to sample qualifying xvalues having
yvalues near the desired values. The latter is used to get inverse
values having a reasonable distribution (e.g., no floor or ceiling
effects) when the transformation has a flat or nearly flat segment,
resulting in a manytoone transformation in that region. Sampling
weights are a combination of the frequency of occurrence of xvalues
that are within tolInverse
times the range of y
and the
squared distance between the associated yvalues and the target
yvalue (aty
).
transcan(x, method=c("canonical","pc"),
categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute,
boot.method=c('approximate bayesian', 'simple'),
trantab=FALSE, transformed=FALSE,
impcat=c("score", "multinom", "rpart"),
mincut=40,
inverse=c('linearInterp','sample'), tolInverse=.05,
pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE,
imputed.actual=c('none','datadensity','hist','qq','ecdf'),
iter.max=50, eps=.1, curtail=TRUE,
imp.con=FALSE, shrink=FALSE, init.cat="mode",
nres=if(boot.method=='simple')200 else 400,
data, subset, na.action, treeinfo=FALSE,
rhsImp=c('mean','random'), details.impcat='', ...)
## S3 method for class 'transcan'
summary(object, long=FALSE, digits=6, ...)
## S3 method for class 'transcan'
print(x, long=FALSE, ...)
## S3 method for class 'transcan'
plot(x, ...)
## S3 method for class 'transcan'
ggplot(data, mapping, scale=FALSE, ..., environment)
## S3 method for class 'transcan'
impute(x, var, imputation, name, pos.in, data,
list.out=FALSE, pr=TRUE, check=TRUE, ...)
fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE,
dtrans, derived, fun, vcovOpts=NULL,
robust=FALSE, cluster, robmethod=c('huber', 'efron'),
method=c('ordinary', 'stack', 'only stack'),
funstack=TRUE, lrt=FALSE,
pr=TRUE, subset, fitargs)
## S3 method for class 'transcan'
predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE,
type=c("transformed","original"),
inverse, tolInverse, check=FALSE, ...)
Function(object, ...)
## S3 method for class 'transcan'
Function(object, prefix=".", suffix="", pos=1, ...)
invertTabulated(x, y, freq=rep(1,length(x)),
aty, name='value',
inverse=c('linearInterp','sample'),
tolInverse=0.05, rule=2)
## Default S3 method:
vcov(object, regcoef.only=FALSE, ...)
## S3 method for class 'fit.mult.impute'
vcov(object, regcoef.only=TRUE,
intercepts='mid', ...)
x 
a matrix containing continuous variable values and codes for
categorical variables. The matrix must have column names
( 
formula 
any R model formula 
fitter 
any R, 
xtrans 
an object created by 
method 
use 
categorical 
a character vector of names of variables in 
asis 
a character vector of names of variables that are not to be
transformed. For these variables, the guts of

nk 
number of knots to use in expanding each continuous variable (not
listed in 
imputed 
Set to 
n.impute 
number of multiple imputations. If omitted, single predicted
expected value imputation is used. 
boot.method 
default is to use the approximate Bayesian bootstrap (sample with
replacement from sample with replacement of the vector of residuals).
You can also specify 
trantab 
Set to 
transformed 
set to 
impcat 
This argument tells how to impute categorical variables on the
original scale. The default is 
mincut 
If 
inverse 
By default, imputed values are backsolved on the original scale
using inverse linear interpolation on the fitted tabulated
transformed values. This will cause distorted distributions of
imputed values (e.g., floor and ceiling effects) when the estimated
transformation has a flat or nearly flat section. To instead use
the 
tolInverse 
the multiplyer of the range of transformed values, weighted by

pr 
For 
pl 
Set to 
allpl 
Set to 
show.na 
Set to 
imputed.actual 
The default is ‘"none"’ to suppress plotting of actual
vs. imputed values for all variables having any 
iter.max 
maximum number of iterations to perform for 
eps 
convergence criterion for 
curtail 
for 
imp.con 
for 
shrink 
default is 
init.cat 
method for initializing scorings of categorical variables. Default is ‘"mode"’ to use a dummy variable set to 1 if the value is the most frequent value (this is the default). Use ‘"random"’ to use a random 01 variable. Set to ‘"asis"’ to use the original integer codes asstarting scores. 
nres 
number of residuals to store if 
data 
Data frame used to fill the formula. For 
subset 
an integer or logical vector specifying the subset of observations to fit 
na.action 
These may be used if 
treeinfo 
Set to 
rhsImp 
Set to ‘"random"’ to use random draw imputation when a
sometimes missing variable is moved to be a predictor of other
sometimes missing variables. Default is 
details.impcat 
set to a character scalar that is the name of a category variable to
include in the resulting 
... 
arguments passed to 
long 
for 
digits 
number of significant digits for printing values by

scale 
for 
mapping , environment 
not used; needed because of rules about generics 
var 
For 
imputation 
specifies which of the multiple imputations to use for filling in

name 
name of variable to impute, for 
pos.in 
location as defined by 
list.out 
If 
check 
set to 
newdata 
a new data matrix for which to compute transformed
variables. Categorical variables must use the same integer codes as
were used in the call to 
fit.reps 
set to 
dtrans 
provides an approach to creating derived variables from a single
filledin dataset. The function specified as 
derived 
an expression containing R expressions for computing derived
variables that are used in the model formula. This is useful when
multiple imputations are done for component variables but the actual
model uses combinations of these (e.g., ratios or other
derivations). For a single derived variable you can specify for
example 
fun 
a function of a fit made on one of the completed datasets.
Typical uses are bootstrap model validations. The result of

vcovOpts 
a list of named additional arguments to pass to the

robust 
set to 
cluster 
a vector of cluster IDs that is the same length of the number
of rows in the dataset being analyzed. When specified, 
robmethod 
see the 
funstack 
set to 
lrt 
set to 
fitargs 
a list of extra arguments to pass to 
type 
By default, the matrix of transformed variables is returned, with
imputed values on the transformed scale. If you had specified

object 
an object created by 
prefix , suffix 
When creating separate R functions for each variable in 
pos 
position as in 
y 
a vector corresponding to 
freq 
a vector of frequencies corresponding to crossclassified 
aty 
vector of transformed values at which inverses are desired 
rule 
see 
regcoef.only 
set to 
intercepts 
this is primarily for 
The starting approximation to the transformation for each variable is
taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of the
nonmissing values for the variable (for continuous ones) or the most
frequent category (for categorical ones). Instead, if imp.con
is a vector, its values are used for imputing NA
values. When
using each variable as a dependent variable, NA
values on that
variable cause all observations to be temporarily deleted. Once a new
working transformation is found for the variable, along with a model
to predict that transformation from all the other variables, that
latter model is used to impute NA
values in the selected
dependent variable if imp.con
is not specified.
When that variable is used to predict a new dependent variable, the
current working imputed values are inserted. Transformations are
updated after each variable becomes a dependent variable, so the order
of variables on x
could conceivably make a difference in the
final estimates. For obtaining outofsample
predictions/transformations, predict
uses the same
iterative procedure as transcan
for imputation, with the same
starting values for fillins as were used by transcan
. It also
(by default) uses a conservative approach of curtailing transformed
variables to be within the range of the original ones. Even when
method = "pc"
is specified, canonical variables are used for
imputing missing values.
Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the
transformed imputed values returned in xt
. This is because
transcan
uses an approximate method based on linear
interpolation to backsolve for imputed values on the original scale.
Shrinkage uses the method of Van Houwelingen and Le Cessie (1990) (similar to Copas, 1983). The shrinkage factor is
\frac{1\frac{(1R2)(n1)}{nk1}}{R2}
where R2 is the apparent R^2
d for predicting the
variable, n is the number of nonmissing values, and k is
the effective number of degrees of freedom (aside from intercepts). A
heuristic estimate is used for k:
A  1 + sum(max(0,Bi  1))/m + m
, where
A is the number of d.f. required to represent the variable being
predicted, the Bi are the number of columns required to
represent all the other variables, and m is the number of all
other variables. Division by m is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The
+ m
term comes from the number of coefficients estimated
on the right hand side, whether by least squares or canonical
variates. If a shrinkage factor is negative, it is set to 0. The
shrinkage factor is the ratio of the adjusted R^2
d to
the ordinary R^2
d. The adjusted R^2
d is
1\frac{(1R2)(n1)}{nk1}
which is also set to zero if it is negative. If shrink=FALSE
and the adjusted R^2
s are much smaller than the
ordinary R^2
s, you may want to run transcan
with shrink=TRUE
.
Canonical variates are scaled to have variance of 1.0, by multiplying
canonical coefficients from cancor
by
\sqrt{n1}
.
When specifying a nonrms library fitting function to
fit.mult.impute
(e.g., lm
, glm
),
running the result of fit.mult.impute
through that fit's
summary
method will not use the imputationadjusted
variances. You may obtain the new variances using fit$var
or
vcov(fit)
.
When you specify a rms function to fit.mult.impute
(e.g.
lrm
, ols
, cph
,
psm
, bj
, Rq
,
Gls
, Glm
), automatically computed
transformation parameters (e.g., knot locations for
rcs
) that are estimated for the first imputation are
used for all other imputations. This ensures that knot locations will
not vary, which would change the meaning of the regression
coefficients.
Warning: even though fit.mult.impute
takes imputation into
account when estimating variances of regression coefficient, it does
not take into account the variation that results from estimation of
the shapes and regression coefficients of the customized imputation
equations. Specifying shrink=TRUE
solves a small part of this
problem. To fully account for all sources of variation you should
consider putting the transcan
invocation inside a bootstrap or
loop, if execution time allows. Better still, use
aregImpute
or a package such as as mice that uses
real Bayesian posterior realizations to multiply impute missing values
correctly.
It is strongly recommended that you use the Hmisc naclus
function to determine is there is a good basis for imputation.
naclus
will tell you, for example, if systolic blood
pressure is missing whenever diastolic blood pressure is missing. If
the only variable that is well correlated with diastolic bp is
systolic bp, there is no basis for imputing diastolic bp in this case.
At present, predict
does not work with multiple imputation.
When calling fit.mult.impute
with glm
as the
fitter
argument, if you need to pass a family
argument
to glm
do it by quoting the family, e.g.,
family="binomial"
.
fit.mult.impute
will not work with proportional odds models
when regression imputation was used (as opposed to predictive mean
matching). That's because regression imputation will create values of
the response variable that did not exist in the dataset, altering the
intercept terms in the model.
You should be able to use a variable in the formula given to
fit.mult.impute
as a numeric variable in the regression model
even though it was a factor variable in the invocation of
transcan
. Use for example fit.mult.impute(y ~ codes(x),
lrm, trans)
(thanks to Trevor Thompson
trevor@hp5.eushc.org).
Here is an outline of the steps necessary to impute baseline variables
using the dtrans
argument, when the analysis to be repeated by
fit.mult.impute
is a longitudinal analysis (using
e.g. Gls
).
Create a one row per subject data frame containing baseline variables plus followup variables that are assigned to windows. For example, you may have dozens of repeated measurements over years but you capture the measurements at the times measured closest to 1, 2, and 3 years after study entry
Make sure the dataset contains the subject ID
This dataset becomes the one passed to aregImpute
as
data=
. You will be imputing missing baseline variables from
followup measurements defined at fixed times.
Have another dataset with all the nonmissing followup values on it, one record per measurement time per subject. This dataset should not have the baseline variables on it, and the followup measurements should not be named the same as the baseline variable(s); the subject ID must also appear
Add the dtrans argument to fit.mult.impute
to define a
function with one argument representing the one record per subject
dataset with missing values filled it from the current imputation.
This function merges the above 2 datasets; the returned value of this
function is the merged data frame.
This mergedonthefly dataset is the one handed by fit.mult.impute
to your fitting function, so variable names in the formula given to fit.mult.impute
must matched the names created by the merge
For transcan
, a list of class ‘transcan’ with elements
call 
(with the function call) 
iter 
(number of iterations done) 
rsq , rsq.adj 
containing the 
categorical 
the values supplied for 
asis 
the values supplied for 
coef 
the withinvariable coefficients used to compute the first canonical variate 
xcoef 
the (possibly shrunk) acrossvariables coefficients of the first canonical variate that predicts each variable inturn. 
parms 
the parameters of the transformation (knots for splines, contrast matrix for categorical variables) 
fillin 
the initial estimates for missing values ( 
ranges 
the matrix of ranges of the transformed variables (min and max in first and secondrow) 
scale 
a vector of scales used to determine convergence for a transformation. 
formula 
the formula (if 
, and optionally a vector of shrinkage factors used for predicting
each variable from the others. For asis
variables, the scale
is the average absolute difference about the median. For other
variables it is unity, since canonical variables are standardized.
For xcoef
, row i has the coefficients to predict
transformed variable i, with the column for the coefficient of
variable i set to NA
. If imputed=TRUE
was given,
an optional element imputed
also appears. This is a list with
the vector of imputed values (on the original scale) for each variable
containing NA
s. Matrices rather than vectors are returned if
n.impute
is given. If trantab=TRUE
, the trantab
element also appears, as described above. If n.impute > 0
,
transcan
also returns a list residuals
that can be used
for future multiple imputation.
impute
returns a vector (the same length as var
) of
class ‘impute’ with NA
values imputed.
predict
returns a matrix with the same number of columns or
variables as were in x
.
fit.mult.impute
returns a fit object that is a modification of
the fit object created by fitting the completed dataset for the final
imputation. The var
matrix in the fit object has the
imputationcorrected variancecovariance matrix. coefficients
is the average (over imputations) of the coefficient vectors,
variance.inflation.impute
is a vector containing the ratios of
the diagonals of the betweenimputation variance matrix to the
diagonals of the average apparent (withinimputation) variance
matrix. missingInfo
is
Rubin's rate of missing information and dfmi
is
Rubin's degrees of freedom for a tstatistic
for testing a single parameter. The last two objects are vectors
corresponding to the diagonal of the variance matrix. The class
"fit.mult.impute"
is prepended to the other classes produced by
the fitting function.
When method
is not 'ordinary'
, i.e., stacking is used,
fit.mult.impute
returns a modified fit object that is computed
on all completed datasets combined, with most all statistics that are
functions of the sample size corrected to the real sample size.
Elements in the fit such as residuals
will have length equal to
the real sample size times the number of imputations.
fit.mult.impute
stores intercepts
attributes in the
coefficient matrix and in var
for orm
fits.
prints, plots, and impute.transcan
creates new variables.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth Edition, Volume 2, pp. 1265–1323, 1990.
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Statistics in Medicine 8:1303–1325, 1990.
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.
He X, Shen L: Linear regression after spline transformation. Biometrika 84:474–481, 1997.
Little RJA, Rubin DB: Statistical Analysis with Missing Data. New York: Wiley, 1987.
Rubin DJ, Schenker N: Multiple imputation in healthcare databases: An overview and some applications. Stat in Med 10:585–598, 1991.
Faris PD, Ghali WA, et al:Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidem 55:184–191, 2002.
aregImpute
, impute
, naclus
,
naplot
, ace
,
avas
, cancor
,
prcomp
, rcspline.eval
,
lsfit
, approx
, datadensity
,
mice
, ggplot
,
processMI
## Not run:
x < cbind(age, disease, blood.pressure, pH)
#cbind will convert factor object `disease' to integer
par(mfrow=c(2,2))
x.trans < transcan(x, categorical="disease", asis="pH",
transformed=TRUE, imputed=TRUE)
summary(x.trans) #Summary distribution of imputed values, and Rsquares
f < lm(y ~ x.trans$transformed) #use transformed values in a regression
#Now replace NAs in original variables with imputed values, if not
#using transformations
age < impute(x.trans, age)
disease < impute(x.trans, disease)
blood.pressure < impute(x.trans, blood.pressure)
pH < impute(x.trans, pH)
#Do impute(x.trans) to impute all variables, storing new variables under
#the old names
summary(pH) #uses summary.impute to tell about imputations
#and summary.default to tell about pH overall
# Get transformed and imputed values on some new data frame xnew
newx.trans < predict(x.trans, xnew)
w < predict(x.trans, xnew, type="original")
age < w[,"age"] #inserts imputed values
blood.pressure < w[,"blood.pressure"]
Function(x.trans) #creates .age, .disease, .blood.pressure, .pH()
#Repeat first fit using a formula
x.trans < transcan(~ age + disease + blood.pressure + I(pH),
imputed=TRUE)
age < impute(x.trans, age)
predict(x.trans, expand.grid(age=50, disease="pneumonia",
blood.pressure=60:260, pH=7.4))
z < transcan(~ age + factor(disease.code), # disease.code categorical
transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)
ggplot(z, scale=TRUE)
plot(z$transformed)
## End(Not run)
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
set.seed(1)
x1 < factor(sample(c('a','b','c'),100,TRUE))
x2 < (x1=='b') + 3*(x1=='c') + rnorm(100)
y < x2 + 1*(x1=='c') + rnorm(100)
x1[1:20] < NA
x2[18:23] < NA
d < data.frame(x1,x2,y)
n < naclus(d)
plot(n); naplot(n) # Show patterns of NAs
f < transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)
options(digits=3)
summary(f)
f < transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)
summary(f)
h < fit.mult.impute(y ~ x1 + x2, lm, f, data=d)
# Add ,fit.reps=TRUE to save all fit objects in h, then do something like:
# for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))
diag(vcov(h))
h.complete < lm(y ~ x1 + x2, na.action=na.omit)
h.complete
diag(vcov(h.complete))
# Note: had the rms ols function been used in place of lm, any
# function run on h (anova, summary, etc.) would have automatically
# used imputationcorrected variances and covariances
# Example demonstrating how using the multinomial logistic model
# to impute a categorical variable results in a frequency
# distribution of imputed values that matches the distribution
# of nonmissing values of the categorical variable
## Not run:
set.seed(11)
x1 < factor(sample(letters[1:4], 1000,TRUE))
x1[1:200] < NA
table(x1)/sum(table(x1))
x2 < runif(1000)
z < transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')
table(z$imputed$x1)/sum(table(z$imputed$x1))
# Here is how to create a completed dataset
d < data.frame(x1, x2)
z < transcan(~x1 + I(x2), n.impute=5, data=d)
imputed < impute(z, imputation=1, data=d,
list.out=TRUE, pr=FALSE, check=FALSE)
sapply(imputed, function(x)sum(is.imputed(x)))
sapply(imputed, function(x)sum(is.na(x)))
## End(Not run)
# Do single imputation and create a filledin data frame
z < transcan(~x1 + I(x2), data=d, imputed=TRUE)
imputed < as.data.frame(impute(z, data=d, list.out=TRUE))
# Example where multiple imputations are for basic variables and
# modeling is done on variables derived from these
set.seed(137)
n < 400
x1 < runif(n)
x2 < runif(n)
y < x1*x2 + x1/(1+x2) + rnorm(n)/3
x1[1:5] < NA
d < data.frame(x1,x2,y)
w < transcan(~ x1 + x2 + y, n.impute=5, data=d)
# Add ,show.imputed.actual for graphical diagnostics
## Not run:
g < fit.mult.impute(y ~ product + ratio, ols, w,
data=data.frame(x1,x2,y),
derived=expression({
product < x1*x2
ratio < x1/(1+x2)
print(cbind(x1,x2,x1*x2,product)[1:6,])}))
## End(Not run)
# Here's a method for creating a permanent data frame containing
# one set of imputed values for each variable specified to transcan
# that had at least one NA, and also containing all the variables
# in an original data frame. The following is based on the fact
# that the default output location for impute.transcan is
# given by the global environment
## Not run:
xt < transcan(~. , data=mine,
imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)
attach(mine, use.names=FALSE)
impute(xt, imputation=1) # use first imputation
# omit imputation= if using single imputation
detach(1, 'mine2')
## End(Not run)
# Example of using invertTabulated outside transcan
x < c(1,2,3,4,5,6,7,8,9,10)
y < c(1,2,3,4,5,5,5,5,9,10)
freq < c(1,1,1,1,1,2,3,4,1,1)
# x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5
# Within a tolerance of .05*(101) all y's match exactly
# so the distance measure does not play a role
set.seed(1) # so can reproduce
for(inverse in c('linearInterp','sample'))
print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))
# Test inverse='sample' when the estimated transformation is
# flat on the right. First show default imputations
set.seed(3)
x < rnorm(1000)
y < pmin(x, 0)
x[1:500] < NA
for(inverse in c('linearInterp','sample')) {
par(mfrow=c(2,2))
w < transcan(~ x + y, imputed.actual='hist',
inverse=inverse, curtail=FALSE,
data=data.frame(x,y))
if(inverse=='sample') next
# cat('Click mouse on graph to proceed\n')
# locator(1)
}
## Not run:
# While running multiple imputation for a logistic regression model
# Run the rms package validate and calibrate functions and save the
# results in w$funresults
a < aregImpute(~ x1 + x2 + y, data=d, n.impute=10)
require(rms)
g < function(fit)
list(validate=validate(fit, B=50), calibrate=calibrate(fit, B=75))
w < fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, fun=g,
fitargs=list(x=TRUE, y=TRUE))
# Get all validate results in it's own list of length 10
r < w$funresults
val < lapply(r, function(x) x$validate)
cal < lapply(r, function(x) x$calibrate)
# See rms processMI and https://hbiostat.org/rmsc/validate.html#secvalmival
## End(Not run)
## Not run:
# Account for withinsubject correlation using the robust cluster sandwich
# covariance estimate in conjunction with Rubin's rule for multiple imputation
# rms package must be installed
a < aregImpute(..., data=d)
f < fit.mult.impute(y ~ x1 + x2, lrm, a, n.impute=30, data=d, cluster=d$id)
# Get likelihood ratio chisquare tests accounting for missingness
a < aregImpute(..., data=d)
h < fit.mult.impute(y ~ x1 + x2, lrm, a, n.impute=40, data=d, lrt=TRUE)
processMI(h, which='anova') # processMI is in rms
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.