Computes the root mean square distance between predicted and corresponding
observed values in an orthogonal multivariate space. This value is the mean
Mahalanobis distance between observed and imputed values in a space defined by
observations and variables were observed and predicted values are defined.
The statistic provides a way to compare imputation (or prediction) results.
While it is designed to work with imputation, the function can be used with objects
that inherit from
lm or with matrices and data frames that
follow the column naming convention described in the details.
objects created by any combination of
a data frame that defines variables, passed to
a list of variable names you want to include; if NULL all available
variables are included (note that if codeimpute.yai the
Y-variables are returned when
A vector of weights used to compute the mean distances, see details below.
The vectors of individual distances are returned (see Value) rather than the mean Mahalanobis distance.
This function is designed to compute the root mean square distance between observed
and predicted observations over several variables at once. It is the Mahalanobis
distance between observed and predicted but the name emphasizes the similarities
to root mean square difference (or error, see
Here are some notable characteristics.
In the univariate case this function returns the same value as
scale=TRUE. In that case
the root mean square difference is computed after
has been called on the variable.
grmsd is zero if the imputed values are
exactly the same as the observed values over all variables.
grmsd is ~1.0 when the mean of each
variable is imputed in place of a near neighbor (it would be exactly 1.0 if
the maximum likelihood estimate of the covariance were used rather than
the unbiased estimate – it approaches 1 as n gets large.)
This situation corresponds to regression where the slope is zero.
It indicates that the imputation error is, over all, the same as it
would be if the means of the variables were imputed rather than near
neighbors (it does not signal that the means were imputed).
rmsd, values of grmsd > 1.0 indicate that, on average,
the errors in the imputation are greater than they would be if the mean
of the corresponding variables were imputed for each observation.
Note that individual
rmsd values can be computed even when
the variance of the variable is zero. In contrast,
only be computed in the situation where the observed data matrix is full rank.
Rank is determined using
qr and columns are removed from the
analysis to create this condition if necessary (with a warning).
Observed and predicted are matched using the column names. Column names
that have "
.o" are matched to those that do not. Columns that do not
have matching observed and imputed (predicted) values are ignored.
Several objects may be passed as "...". Function
called for any objects that were created by
vars are passed to
when it is used.
When objects inherit from
lm, a suitable matrix is formed using
by calling the
Factors, if found, are removed (with a warning).
wts is defined there must be one value for each pair of
observed and predicted variables. If the values are named (preferred), then
the names are matched to the names of predicted variables (no
The matched values effectively scale the axes in which distances are computed.
When this is done, the resulting distances are not Mahalanobis distances.
rtnVectors=FALSE, a sorted named vector of mean distances
is returned; the names are taken from the arguments.
rtnVectors=TRUE the function returns vectors of distances, sorted and
named as done wnen this argument is FALSE.
Nicholas L. Crookston [email protected]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
require(yaImpute) data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compute the average distances between observed and imputed (predicted) grmsd(msn,mal,lmFit=lm(as.matrix(y) ~ ., data=x[refs,])) # use the all variables and observations in iris # Species is a factor and is automatically deleted with a warning grmsd(msn,mal,ancillaryData=iris) # here is an example using lm, and another using column # means as predictions. impMean <- y colnames(impMean) <- paste0(colnames(impMean),".o") impMean <- cbind(impMean,y) # set the predictions to the mean's of the variables impMean[,"Petal.Length"] <- mean(impMean[,"Petal.Length"]) impMean[,"Petal.Width"] <- mean(impMean[,"Petal.Width"]) grmsd(msn, mal, lmFit=lm(as.matrix(y) ~ ., data=x[refs,]), impMean ) # compare to using function rmsd (values match): msnimp <- na.omit(impute(msn)) grmsd(msnimp[,c("Petal.Length","Petal.Length.o")]) rmsd(msnimp[,c("Petal.Length","Petal.Length.o")],scale=TRUE) # these are multivariate cases and they don't match # because the covariance of the two variables is > 0. grmsd(msnimp) colSums(rmsd(msnimp,scale=TRUE))/2 # get the vectors and make a boxplot, identify outliers stats <- boxplot(grmsd(msn,mal,ancillaryData=iris[,-5],rtnVectors=TRUE), ylab="Mahalanobis distance") stats$out # 118 132 #2.231373 1.990961
lmFit mal msn 0.7208731 1.0072846 1.2464372 mal msn 0.8645804 1.1095280 Warning messages: 1: In grmsd(msn, mal, ancillaryData = iris) : factor(s) have been removed from msn: Species 2: In grmsd(msn, mal, ancillaryData = iris) : factor(s) have been removed from mal: Species lmFit impMean mal msn 0.7208731 0.9899495 1.0072846 1.2464372 msnimp[, c("Petal.Length", "Petal.Length.o")] 0.5196872 rmsdS Petal.Length 0.5196872 msnimp 1.246437 rmsdS 0.7030801 118 132 2.231373 1.990961
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.