The Philosophy of Graphics in rrr

The graphical display of multivariate data is not a new problem. John Tukey's PRIM-9^[Friedman, Jerome H., and Werner Stuetzle. “John W. Tukey's Work on Interactive Graphics.” The Annals of Statistics, vol. 30, no. 6, 2002, pp. 1629–1639.] was a program to plot multivariate data in up to 9 dimensions. It allowed the user to picture, rotate, isolate, and mask data -- it was truly a revolutionary step in interactive graphics.^[The 1974 documentary film, courtesy of Bin-88 Productions, about PRIM-9 featuring John Tukey at the Stanford Linear Accelerator Center can be viewed at the ASA Video Library online at http://stat-graphics.org/movies/prim9.html] While there are contemporary versions of that vision of interactive multivariate graphics, the author of this package believes that 9 dimensions is too many dimensions. People have an intuitive understanding of three-dimensional space (four if you include time), but there are no easy ways to intuitively understand higher dimensions.

Since the techniques of reduced-rank regression -- especially principal component analysis and canonical variate analysis -- are designed to reduce dimensionality, there may be some way to compromise between visualizing high-dimensional data and being able to see patterns in the data in a very intuitive way.

Take, for instance, principal component analysis. If the goal of principal component analysis is to reduce the problem to a few dimensions -- say 2 or 3 -- then we could say that this is equivalent to the goal of being able to plot the principal components in 2- or 3-dimensional space.

require(dplyr)
require(ggplot2)

### LOAD DATA

### TOBACCO DATA SET
data(tobacco)

tobacco_x <- tobacco %>%
    select(starts_with("X"))

tobacco_y <- tobacco %>%
    select(starts_with("Y"))

### PENDIGITS DATA SET
data(pendigits)
digits <- as_data_frame(pendigits)

digits_class <- digits %>% select(V35)
digits_features <- digits %>% select(-V35, -V36)

### COMBO-17 GALAXY DATA SET
data(COMBO17)
galaxy <- as_data_frame(COMBO17) %>%
       select(-starts_with("e."), -Nr, -UFS:-IFD) %>%
       na.omit()

### IRIS DATA SET
data(iris)
iris <- as_data_frame(iris)

iris_features <- iris %>%
    select(-Species)

iris_class <- iris %>%
    select(Species)

### COMBO-17 DATA SET
data(COMBO17)
galaxy <- as_data_frame(COMBO17) %>%
       select(-starts_with("e."), -Nr, -UFS:-IFD) %>%
       na.omit()

galaxy_x <- galaxy %>%
    select(-Rmag:-chi2red)

galaxy_y <- galaxy %>%
    select(Rmag:chi2red)

Rank Trace Plots

rank_trace() can draw rank trace plots for reduced-rank regression, principal components analysis and canonical variate analysis by setting type = "identity" (the default), type = "pca", or type = "cva", respectively.

rank_trace(tobacco_x, tobacco_y)
rank_trace(digits_features, digits_features, type = "pca")
rank_trace(galaxy_x, galaxy_y, type = "cva")

Rank trace plots can be made interactive with the argument interactive = TRUE. Hovering over points on the rank trace will show the rank-trace coordinates. When rank-trace points are clustered tightly together, the user can zoom in on the plot to investigate differences in rank -- though, of course, if they are close enough to require zooming, they are not likely good estimators for $t$.

rank_trace(digits_features, digits_features, type = "pca", interactive = TRUE)

Residual Plots

Residual plots can be created for reduced-rank regression and canonical variate analysis.

residuals(tobacco_x, tobacco_y, rank = 1)
galaxy_residuals <- residuals(galaxy_x, galaxy_y, type = "cva")

Specific elements of the plot matrix can be extracted the same way one would index a matrix in R. Since these plots are ggplot objects, the user can augment them using ggplot2 syntax.

galaxy_residuals[2,1] + ggplot2::ggtitle("Serial Residuals for CV1")

galaxy_residuals[2,2] + ggplot2::ggtitle("Density Plot of CV1 Residuals")

Pairwise Plots

Pairwise plots can be created for PCA, CVA, LDA, by setting type = "pca", type = "cva", and type = "lda", respectively.

pairwise_plot(digits_features, digits_class, pair_x = 1, pair_y = 3)
pairwise_plot(galaxy_x, galaxy_y, type = "cva", pair_x = 2)

Pairwise plots can be turned into interactive plotly graphs by setting interactive = TRUE. The user can use this to mask data -- if class labels are used -- or to zoom in and hover over points.

pairwise_plot(digits_features, digits_class, type = "pca", pair_x = 1, pair_y = 3, interactive = TRUE)

Allpairs Plots

Allpairs plots can be created for principal component analysis. This is a way to see multiple dimensions in a plot matrix.

allpairs_plot(digits_features, digits_class, type = "pca", rank = 3)

3D Plots

Three-dimensional plots allow the user to zoom, rotate, and mask when classes are labeled.

Principal component scores can be plotted in three dimensions by setting type = "pca".

threewise_plot(digits_features, digits_class, type = "pca")

For canonical variate analysis, the threewise_plot() function plots three-dimensional residual plots by setting type = "cva".

threewise_plot(galaxy_x, galaxy_y, type = "cva")

Linear discriminant scores can be plotted in three dimensions by setting type = "lda".

threewise_plot(digits_features, digits_class, type = "lda", k = 0.001)


chrisaddy/rrr documentation built on May 13, 2019, 6:21 p.m.