subsDiag: Apply two types of diagnostics to clustered data

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Calculate diagnostics on the subspace identified by cluster analysis

Usage

1
subsDiag(X, ncl, clustMethod = "hc", nSim = 2000, sigLvl = 0.05, status = TRUE)

Arguments

X

The Data.

ncl

The Structure of the data, obtained using a clustering statistic or some other hypothesis, e.g. ddwtGap.

clustMethod

Is the cluster definition obtained using hierarchical clustering "hc" or k-means "km". See details in ddwtGap or on the dedicated help-pages.

nSim

The number of simulations used for Monte Carlo estimates of significance.

sigLvl

The significance level for the chi-squared testing whether observations are significantly, or otherwise, influential on the structure of the data.

status

Report the status of the functions?

Details

Model diagnostics assess the validity of particular assumptions. Application of the model diagnostics requires at least two individuals within each well-separated group; the cluster identification algorithms can identify isolated individuals as whole groups. Depending upon the circumstances, it might be reasonable to consider such individuals suspicious. The diagnostics aimed to identify individuals that (a) were extreme in measurement and (b) affected significantly the definition of the data structure.

Brooks (1994) calculated the influence of each data point by jack-knifing, i.e. by comparing the dominant eigenvalues of the data with and without a focal observation. A large difference in dominant eigenvalues implies that the focal observation exerts large influence in the sample, whose significance can be assessed using Monte Carlo estimates. If variables are not normally distributed , reference data sets can be generated using singular-value decomposition.

Fung (1999) devised a method to identify extreme observations outside the expected range of a particular sample.

Note that data need not be questionable or unusual to exert large influence.

Value

A list containing:

both

The index of observations that are BOTH infleuntial and extreme.

influence

The index of infleuntial observations.

distance

The index of extreme observations.

Author(s)

Thomas H.G. Ezard tomezard [at] gmail [dot] com

References

Brooks, S. P. 1994. Diagnostics for Principal Components: Influence Functions as Diagnostic Tools. The Statistician 43:483-494. Ezard, T.H.G., Pearson, P.N. & Purvis, A. 2010. Algorithmic Approaches to Delimit Species in Multidimensional Morphospace. BMC Evol. Biol. 10: 175, doi:10.1186/1471-2148-10-175. Fung, W.-K. 1999. Outlier Diagnostics in Several Multivariate Samples. The Statistician 48:73-84.

See Also

dimReduct, ddwtGap

Examples

1
2
3
##following the example in ddwtGap ....
data(iris)
subsDiag(as.matrix(iris[,1:4]), 3)

splits documentation built on July 16, 2021, 3 p.m.