Measuring goodness-of-fit for principal objects.

Share:

Description

These functions compute the ‘coverage coefficient’ R_c for local principal curves, local principal points (i.e., kernel density estimates obtained through iterated mean shift), and other principal objects.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Rc(x,...)

## S3 method for class 'lpc'
Rc(x,...)
## S3 method for class 'lpc.spline'
Rc(x,...)
## S3 method for class 'ms'
Rc(x,...)

base.Rc(data,  closest.coords, type="curve")

Arguments

x

an object used to select a method.

...

Further arguments passed to or from other methods (not needed yet).

data

A data matrix.

closest.coords

A matrix of coordinates of the projected data.

type

For principal curves, don't modify. For principal points, set "points".

Details

Rc computes the coverage coefficient R_c, a quantity which estimates the goodness-of-fit of a fitted principal object. This quantity can be interpreted similar to the coeffient of determination in regression analysis: Values close to 1 indicate a good fit, while values close to 0 indicate a ‘bad’ fit (corresponding to linear PCA).

For objects of type lpc, lpc.spline, and ms, S3 methods are available which use the generic function Rc. This, in turn, calls the base function base.Rc, which can also be used manually if the fitted object is of another class. In principle, function base.Rc can be used for assessing goodness-of-fit of any principal object provided that the coordinates (closest.coords) of the projected data are available. For instance, for HS principal curves fitted via princurve, this information is contained in component $s, and for a a k-means object, say fitk, this information can be obtained via fitk$centers[fitk$cluster,]. Set type="points" in the latter case.

The function Rc attempts to compute all missing information, so computation will take the longer the less informative the given object x is. Note also, Rc looks up the option scaled in the fitted object, and accounts for the scaling automatically. Important: If the data were scaled, then do NOT unscale the results by hand in order to feed the unscaled version into base.Rc, this will give a wrong result.

In terms of methodology, these functions compute R_c directly through the mean reduction of absolute residual length, rather than through the area above the coverage curve.

These functions do currently not account for observation weights, i.e. R_c is computed through the unweighted mean reduction in absolute residual length (even if weights have been used for the curve fitting).

Acknowledgements

Contributions (in form of pieces of code, or useful suggestions for improvements) by Jo Dwyer, Mohammad Zayed, and Ben Oakley are gratefully acknowledged.

Author(s)

J. Einbeck and L. Evers.

References

Einbeck, Tutz, and Evers (2005). Local principal curves. Statistics and Computing 15, 301-313.

Einbeck (2011). Bandwidth selection for nonparametric unsupervised learning techniques – a unified approach via self-coverage. Journal of Pattern Recognition Research 6, 175-192.

See Also

lpc.spline, codems, coverage.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
data(calspeedflow)
lpc1 <- lpc.spline(lpc(calspeedflow[,3:4]), project=TRUE)
Rc(lpc1)
# is the same as:
base.Rc(lpc1$lpcobject$data, lpc1$closest.coords)

ms1 <- ms(calspeedflow[,3:4],plotms=0)
Rc(ms1)
# is the same as:
base.Rc(ms1$data, ms1$cluster.center[ms1$closest.label,], type="points")