# PP3many: Main function to carry out three-dimensional projection... In PP3: Three-Dimensional Exploratory Projection Pursuit

## Description

Given a multivariate data set this function applies exploratory projection pursuit to find an interesting three-dimensional projection of the input data.

## Usage

 `1` ```PP3many(xm, nrandstarts = 100, lapplyfn = lapply, action = 0, limit = 3, text = 0) ```

## Arguments

 `xm` Data matrix. Note: ROWS correspond to variables, and so the input matrix here is (probably) the transpose of what you might expect (and is accepted by principal components functions). This might change in the future. `nrandstarts` Number of random starts. `lapplyfn` By default this is lapply. If you can use the parallel package then you can replace this with mclapply to get a speed up. The latter is faster if you know how to use the parallel package and you often need to set the `options(mc.cores=x)` option to a value of x giving the number of codes you wish to use. `action` Integer taking the value 0, 1, 2 or 3. This controls how multivariate outliers are treated. If 0, then no action is taken, the multivariate set is supplied unmolested to the projection pursuit routine. If 1, then outliers are removed. If 2,3 , then outliers are moved towards the centre according to log or square root trimming (the precise formulae are described in the trimsu FORTRAN code, and stem from Jones and Sibson (1987). The outliers are only removed, or trimmed, for the purposes of index calculation. They are retained in the final solution and projected according to the projection direction decided by the projection pursuit algorithm computed on the non-outlier (majority) portion of the data. `limit` This argument is only used if `action` is 1, 2 or 3 and outliers need to be dealt with. If observations are greater in distance than `limit` then they are considered to be outliers. `text` Integer. Has no effect if set to zero. If set to one the FORTRAN code will print out informational messages.

## Details

Exploratory projection pursuit is a method introduced by Friedman and Tukey (1974) which projects multivariate K-dimensional data down onto a L-dimensional subspace, using a projection A (an LxK-dimensional matrix). A projection index, I, is devised to measure how interesting the projection is and hence usually I=I(A). A numerical optimiser is then employed to find the projection A that maximises the index of interestingness.

A goal of exploratory projection pursuit is to find projections that highlight interesting structure and clustering. Often, this method can discover interesting structures that existing methods miss, as it uses a different measure of interestingness.

As with many other optimisation problems of this sort the numerical optimiser does not find the global maximum, but a local maximum. In any case, the global maximum might not correspond to the only interesting projection direction, so it is worth exploring many projections that result in large projection indices. One way of obtaining these is to run projection pursuit from several random starting directions. Although this does not guarantee to find all interesting local optima, it will find many of them.

This code implements the three-dimensional version of the moment projection index proposed by Jones and Sibson, (1987). Description of the three-dimensional version can be found in Nason (1995). This version first centres the data (removes its mean), and then spheres it (transforms its covariance matrix to the identity). The aim of sphering is to ensure that any structure discovered by projection pursuit is not related to anything that could be found by principal components analysis (because principal components "looks for structure" contained in the covariance matrix, and there will not be any if the covariance matrix is the identity). The moment index is an approximation to the entropy index, which looks for departures in the empirical distribution of the projected data from standard normality. The heuristic is that anything that is far from normal is â€˜interestingâ€™. Hence, clustered, bi-, tri- or multimodal is not normal and hence interesting and so clustered data is deemed to be interesting by the moment index. The moment index has been criticised for being sensitive to outliers. More discussion of that, and designs of robust indices can be found in Nason (2001). However, recent numerical experimentation shows that the moment index indeed does find projections with outliers, but these can be easily and quickly discounted; and the method does routinely find interesting projections, when they are present.

## Value

A `PP3` class object, which is a list with the following components.

 `ix3` A vector containing the optimised projection index correspond to each of the `nrandstarts` random starts. `info` A list of `nrandstarts` lists. Each item in the list contains a list with information obtained during the optimisation process. The components of each list are `par` the final optimisation parameters corresponding to the optimal projection direction (if an optimum was found); `value` the associated optimal projection index; `counts` the number of evaluations of the projection index and, separately, for the gradient; `convergence` a code indicating the outcome of the optimisation; `message` a message that describes the convergence outcome. The last three components are information produced by the R `optim` function and are described more fully there. `pdata.list` The data projected according to the optimal projection direction resulting from each random start. A list containing `nrandstarts` matrices, each one corresponding to the optimisation resulting from each random start. Each matrix has three rows (corresponding to the three projection directions) and a number of columns equal to the number of multivariate cases (or observations, the number of columns of the input matrix). The `plot.PP3` function uses this information to produce plots of the data projected according projection solutions, when its `number` argument is specified. `pseudp.vals` It can be hard to evaluate a given optimal projection index value and know whether it can be judged as large. One way of assessing it is to compare it to a set of projection indices computed (without optimisation) on another set of random projection directions. It is known that most arbitrary projections result in uninteresting projections, so a collection of `pseudp.vals` acts as a set of projection index control values (NOT p-values) to which the real `ix3` values can be compared to. The `plot.PP3` functoin uses the `pseudp.vals` to draw a red density estimate of the control values (and also plots median, upper quartile, 0.9 quantile, maximum) along with a histogram of the optimised `ix3` values, and one can easily see what should be considered to be large such values. `origvarnames` The names of the original variables (names of rows) from the original data matrix. This is taken from the `dimnames` component of the data matrix.

## Note

Nearly all of PP3 was written by Guy Nason for his PhD (1992). Nason is grateful too, and worked under the guidance of the late, great, Robin Sibson. Robin also wrote a carefully written eigendecomposition subroutine, which is part of this package.

G. P. Nason

## References

Friedman, J.H. and Tukey, J.W. (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput., 23, 881-890.

Jones, M.C. and Sibson, R. (1987) What is projection pursuit? (with discussion) J. R. Statist. Soc. A, 150, 1-36.

Nason, G. P. (1995) Three-dimensional projection pursuit. J. R. Statist. Soc. C, 44, 411-430.

Nason, G. P. (2001) Robust projection indices. J. R. Statist. Soc. B, 63, 551-567.

`getPP3index`,`getPP3loadings`, `getPP3projdata`,`plot.PP3`, `print.PP3`, `summary.PP3`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101``` ```# # The flea beetle data # data(beetle) # # Run projection pursuit with 100 random starts (normally, you'd use MANY # more random starts, e.g. 1000 or more. Here, we keep the number small to # help CRAN # # # N.b. I am going to set.seed here, so results match what you might see # when trying THESE functions, but, in general, you can ignore set.seed # or set it to your favourite value # set.seed(1) beetle.PP3 <- PP3many(t(beetle), nrandstarts=100) # # Look at the output # beetle.PP3 #Class 'PP3' : Three-dimensional Projection Pursuit Object: # ~~~ : List with 5 components with names # ix3 info pdata.list pseudp.vals origvarnames # #Number of random start(s): 100 #Maximum projection index is 22.02255 achieved by 1 random start(s). #(Partial) list of those starts achieving max are: 90 # #summary(.): #---------- # Summary statistics of projection index # Min. 1st Qu. Median Mean 3rd Qu. Max. # 11.51 14.81 16.18 16.34 17.89 22.02 # Summary statistics of pseudo p-values # Min. 1st Qu. Median Mean 3rd Qu. Max. # 8.592 10.885 12.361 12.466 13.992 18.210 # # The print out shows that 100 random starts were executed and the max # projection index was 22.02255 and only one of those random starts found # this (sometimes more than one random start converges to the same maximum). # # The index number of the run which found the maximum was 90 (this number # can be useful later to access the maximum). # # The summary gives the summary statistics of the 100 projection index # values found. The max is the same as above, but also the distribution # can be discerned. # # The distribution of the pseudo-p-values (NOT actual p-values) is presented # after that. These are the projection indices computed purely on random # directions, not the optimised versions and so you can think of them as # null values to compare the earlier optimised values. E.g. the maximum of # the pseudo-projection indices is 18.21, so any actual optimised projection # index larger than this might be interesting. # # Now produce a plot (using all projection index info on 100 runs): # ## Not run: plot(beetle.PP3) # # This produces (a) a histogram of the projection indices (b) a red density # estimate of the pseudo-projection indices and (c) the median, upper quartile, # 0.9 quantile and maximum of the pseudos as red dotted vertical lines. The red # information corresponds to a kind of null, and so projection indices larger # than these values might be interesting. The plot also produces some text: # #Big Projection Indices #Maximum Psuedo p-value: 18.2103 #Index Number and associated projection indices # 90 74 4 87 75 54 13 6 #22.02255 21.36578 21.29397 20.86531 19.59663 19.42427 19.34596 19.26520 # 23 60 #19.22810 19.16459 # # This is a list of the 10 biggest projection indices and their respective # identity numbers (which one of the random starts generated it). These # can be used in the plot function with a number argument to generate further # information/plots about the projection solution. Note, the number of # biggest projection indices can be controlled with the nbig argument of # plot.PP3. # # Now suppose we wanted to look at the projection solution 74, which had the # second-biggest projection index. We can plot the projected data with the # following command: # ## Not run: plot(beetle.PP3, number=74, colvec=dimnames(beetle)[[1]]) # # The colvec supplies the group structure so the different species can # be coloured differently. The label argument permits you to put text # labels there, per point as well as colours. # # You can extract information from the beetle.PP3 object using the # extractor functions, getPP3index, getPP3projdata and the variable loadings # using getPP3loading. For example, # getPP3index(beetle.PP3, 74) # 74 #21.36578 # # gets the second largest projection index. The third-largest can be obtained # by replacing 74 by 4, etc. ```