grmsd: Generalized Root Mean Square Distance Between Observed and... In yaImpute: Nearest Neighbor Observation Imputation and Evaluation Tools

Description

Computes the root mean square distance between predicted and corresponding observed values in an orthogonal multivariate space. This value is the mean Mahalanobis distance between observed and imputed values in a space defined by observations and variables were observed and predicted values are defined. The statistic provides a way to compare imputation (or prediction) results. While it is designed to work with imputation, the function can be used with objects that inherit from `lm` or with matrices and data frames that follow the column naming convention described in the details.

Usage

 `1` ```grmsd(...,ancillaryData=NULL,vars=NULL,wts=NULL,rtnVectors=FALSE) ```

Arguments

 `...` objects created by any combination of `yai`, `impute.yai`, `ensembleImpute`, `buildConsensus`, `lm` and data frames or matrices that follow the column naming convention described in the details below. If an object is of class `yai`, a call to `impute.yai` is generated internally. `ancillaryData` a data frame that defines variables, passed to `impute.yai`. `vars` a list of variable names you want to include; if NULL all available variables are included (note that if codeimpute.yai the Y-variables are returned when `vars=NULL`). `wts` A vector of weights used to compute the mean distances, see details below. `rtnVectors` The vectors of individual distances are returned (see Value) rather than the mean Mahalanobis distance.

Details

This function is designed to compute the root mean square distance between observed and predicted observations over several variables at once. It is the Mahalanobis distance between observed and predicted but the name emphasizes the similarities to root mean square difference (or error, see `rmsd`). Here are some notable characteristics.

1. In the univariate case this function returns the same value as `rmsd` with `scale=TRUE`. In that case the root mean square difference is computed after `scale` has been called on the variable.

2. Like `rmsd`, `grmsd` is zero if the imputed values are exactly the same as the observed values over all variables.

3. Like `rmsd`, `grmsd` is ~1.0 when the mean of each variable is imputed in place of a near neighbor (it would be exactly 1.0 if the maximum likelihood estimate of the covariance were used rather than the unbiased estimate – it approaches 1 as n gets large.) This situation corresponds to regression where the slope is zero. It indicates that the imputation error is, over all, the same as it would be if the means of the variables were imputed rather than near neighbors (it does not signal that the means were imputed).

4. Like `rmsd`, values of grmsd > 1.0 indicate that, on average, the errors in the imputation are greater than they would be if the mean of the corresponding variables were imputed for each observation.

5. Note that individual `rmsd` values can be computed even when the variance of the variable is zero. In contrast, `grmsd` can only be computed in the situation where the observed data matrix is full rank. Rank is determined using `qr` and columns are removed from the analysis to create this condition if necessary (with a warning).

Observed and predicted are matched using the column names. Column names that have "`.o`" are matched to those that do not. Columns that do not have matching observed and imputed (predicted) values are ignored.

Several objects may be passed as "...". Function `impute.yai` is called for any objects that were created by `yai`; `ancillaryData` and `vars` are passed to `impute.yai` when it is used.

When objects inherit from `lm`, a suitable matrix is formed using by calling the `predict` and `resid` functions.

Factors, if found, are removed (with a warning).

When argument `wts` is defined there must be one value for each pair of observed and predicted variables. If the values are named (preferred), then the names are matched to the names of predicted variables (no `.o` suffix). The matched values effectively scale the axes in which distances are computed. When this is done, the resulting distances are not Mahalanobis distances.

Value

When `rtnVectors=FALSE`, a sorted named vector of mean distances is returned; the names are taken from the arguments.

When `rtnVectors=TRUE` the function returns vectors of distances, sorted and named as done wnen this argument is FALSE.

Author(s)

Nicholas L. Crookston [email protected]

`yai`, `impute.yai`, `rmsd.yai`, `notablyDifferent`

Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49``` ```require(yaImpute) data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compute the average distances between observed and imputed (predicted) grmsd(msn,mal,lmFit=lm(as.matrix(y) ~ ., data=x[refs,])) # use the all variables and observations in iris # Species is a factor and is automatically deleted with a warning grmsd(msn,mal,ancillaryData=iris) # here is an example using lm, and another using column # means as predictions. impMean <- y colnames(impMean) <- paste0(colnames(impMean),".o") impMean <- cbind(impMean,y) # set the predictions to the mean's of the variables impMean[,"Petal.Length"] <- mean(impMean[,"Petal.Length"]) impMean[,"Petal.Width"] <- mean(impMean[,"Petal.Width"]) grmsd(msn, mal, lmFit=lm(as.matrix(y) ~ ., data=x[refs,]), impMean ) # compare to using function rmsd (values match): msnimp <- na.omit(impute(msn)) grmsd(msnimp[,c("Petal.Length","Petal.Length.o")]) rmsd(msnimp[,c("Petal.Length","Petal.Length.o")],scale=TRUE) # these are multivariate cases and they don't match # because the covariance of the two variables is > 0. grmsd(msnimp) colSums(rmsd(msnimp,scale=TRUE))/2 # get the vectors and make a boxplot, identify outliers stats <- boxplot(grmsd(msn,mal,ancillaryData=iris[,-5],rtnVectors=TRUE), ylab="Mahalanobis distance") stats\$out # 118 132 #2.231373 1.990961 ```

Example output

```    lmFit       mal       msn
0.7208731 1.0072846 1.2464372
mal       msn
0.8645804 1.1095280
Warning messages:
1: In grmsd(msn, mal, ancillaryData = iris) :
factor(s) have been removed from msn: Species
2: In grmsd(msn, mal, ancillaryData = iris) :
factor(s) have been removed from mal: Species
lmFit   impMean       mal       msn
0.7208731 0.9899495 1.0072846 1.2464372
msnimp[, c("Petal.Length", "Petal.Length.o")]
0.5196872
rmsdS
Petal.Length 0.5196872
msnimp
1.246437
rmsdS
0.7030801
118      132
2.231373 1.990961
```

yaImpute documentation built on May 2, 2019, 4:44 p.m.