knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE,
                      message = FALSE,
                      fig.align = "center", 
                      fig.width = 6, 
                      fig.height = 5,
                      out.width = "60%", 
                      collapse = TRUE,
                      comment = "#>",
                      tidy.opts = list(width.cutoff = 65),
                      tidy = FALSE)
library(knitr)
library(magrittr)
library(loon.tourr, quietly = TRUE)
library(tidyverse, quietly = TRUE)
library(class, quietly = TRUE)
set.seed(12314159)
imageDirectory <- "./images/classification"
dataDirectory <- "./data/classification"
path_concat <- function(path1, ..., sep="/") {
  # The "/" is standard unix directory separator and so will
  # work on Macs and Linux.
  # In windows the separator might have to be sep = "\" or 
  # even sep = "\\" or possibly something else. 
  paste(path1, ..., sep = sep)
}
library(RDRToolbox)

A grand tour "method" is an algorithm for assigning a sequence of projections onto a lower dimensional spaces. After the original multivariate dataset is projected onto some "interesting" plane, a question may be raised here, "what's next?"

Well, one of the usage could be in "classification". Rather than putting the original data set into the classifier. Could an interesting projection improve the performance of the result, controlling other hyper-parameters? In this vignette, we will learn about it.

Data

The data set olive records the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic and etc) found in the lipid fraction of 572 Italian olive oils. The oils are samples taken from three Italian regions varying number of areas within each region. The regions and their areas are recorded as shown in the following table [@loon]:

data.frame(
  Region = c("North", "South", "Sardinia"),
  Area = c("North-Apulia, South-Apulia, Calabria, Sicily",
           "East-Liguria, West-Liguria, Umbria", 
           "Coastal-Sardinia, Inland-Sardinia")
) %>% 
  kable()

Task

Task: can we project the data onto a lower dimensional space, meanwhile, each observation can be classified well by Region in such space?

Step 1: randomly pick 80% of the data set as the training data and leave the rest 20% as the test data.

set.seed(123)
N <- nrow(olive)
trainId <- sample(seq(N), 
                  size = floor(0.8 * N))
testId <- setdiff(seq(N), trainId)
acids <- setdiff(colnames(olive), c("region", "area"))
trainX <- olive[trainId, acids]
testX <- olive[testId, acids]
trainY <- olive[trainId, "region"]
testY <- olive[testId, "region"]

Step 2: scaling data set. As the magnitude of each variable is very different, to avoid one specific factor dominate the projection, a scaling technique should be applied ahead. In our case, we provide the variable scaling method that each column is scaled to zero one (the detailed description of different scaling can be found in help("l_tour")).

row.names(trainX) <- NULL
kable(head(trainX))
scalingTrainX <- loon::l_getScaledData(trainX, scaling = "variable")
scalingTestX <- loon::l_getScaledData(testX, scaling = "variable")
kable(head(scalingTrainX), digits = 2)

Projection Methods

We decide to use one of the most basic (but efficient) classifier knn [@altman1992introduction] for all projection methods.

knn_pred <- function(trainX, trainY, testX, testY, k = c(5, 10, 20)) {
  len_test <- length(testY)
  vapply(k,
         function(num) {
           yhat <- class::knn(trainX, testX, trainY, k = num)
           sum(yhat == testY)/len_test
         }, numeric(1L))

}

low_dim_knn_pred <- function(dims = 2:5, fun, 
                             trainX, trainY, testX, testY, 
                             k = c(5, 10, 20), setNames = TRUE) {

  tab <- lapply(dims, 
                function(d) {
                  knn_pred(
                    fun(trainX, d), trainY,
                    fun(testX, d), testY
                  )
                }) %>% 
    as.data.frame() %>%
    as_tibble()

  if(setNames) {
    tab <- tab %>%
      setNames(nm = paste0("d = ", dims))
  }

  rownames(tab) <- paste0("k = ", k)
  tab
}

Method I: p choose k

The most basic projection is that we could choose $d$ from $p$ (where $d \leq p$, no linear combination) and map the $p$ dimensional space onto that chosen $d$ dimensional space. Since we have 8 dimensions, suppose $d = 2$, there are ${8 \choose 2} = 28$ combinations. To simplify the process, with each $d$, we will only extract the highest prediction pair.

# the number of k
dims <- 2:5
var_names <- colnames(scalingTrainX)
low_dim_names <- c()
K <- ncol(trainX)
pChooseD <- lapply(dims, 
                   function(d) {
                     com <- combn(K, d)
                     pred <- apply(com, 2, 
                                   function(pair) {
                                     knn_pred(trainX[, pair], trainY, testX[, pair], testY)
                                   })
                     mean_pred <- apply(pred, 2, mean)
                     id <- which.max(mean_pred)
                     low_dim_names <<- c(low_dim_names, 
                                         paste(var_names[com[, id]], 
                                               collapse = ":"))
                     pred[, id]
                   }) %>% 
  as.data.frame() %>%
  as_tibble() %>%
  setNames(nm = paste0("d = ", dims))
rownames(pChooseD) <- paste0("k = ", c(5, 10, 20))

The best pairs names are

names(low_dim_names) <- paste0("d = ", dims)
low_dim_names

The prediction table is

kable(pChooseD, row.names = TRUE,
      digits = 3)

Method II: PCA

PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component, determined by the largest eigen value), the second greatest variance (the second largest eigen value) on the second coordinate, and so on.

The eigen values of PCA projection on our data set is

trainXPCA <- princomp(scalingTrainX)
testXPCA <- princomp(scalingTestX)
round(trainXPCA$sdev, 2)

The first 5 eigen values are picked, as the sum of them is above 85\%.

PCA <- low_dim_knn_pred(2:5, 
                        fun = function(princomp, d) {princomp$scores[, seq(d)]},
                        trainXPCA,
                        trainY,
                        testXPCA,
                        testY)
kable(PCA, row.names = TRUE,
      digits = 3)

Method III: LLE

LLE (Local Linear Embedding) [@roweis2000nonlinear] begins by finding a set of the nearest neighbors of each point, then computes a set of weights for each point that best describes the point as a linear combination of its neighbors. Finally, it uses an eigenvector-based optimization technique to find the low-dimensional embedding of points.

lle <- low_dim_knn_pred(2:5, 
                        fun = function(data, d) {
                          LLE(data, dim = d, k = 5)
                        },
                        scalingTrainX, trainY,
                        scalingTestX, testY) 
kable(lle, row.names = TRUE,
      digits = 3)
lle <- readRDS(path_concat(dataDirectory, "lle.RDS"))
kable(lle, row.names = TRUE,
      digits = 3)

Method IV: Random Tour

A simple call l_tour

p2 <- l_tour(scalingTrainX, color = trainY)
l <- l_layer_hull(p2, group = trainY)

Here, we assign different groups different colors. Besides, a convex hull is constructed (l_layer_hull) so that the separation of each group is much easier to tell. As we scroll the bar, one random projection can split the groups well (no intersections among the hulls).

include_graphics(path_concat(imageDirectory, "proj2D.PNG"))
proj2D <- readRDS(path_concat(dataDirectory, "proj2D.RDS")) %>% 
  as.matrix()

The matrix of projection vectors is

proj2D <- p2["projection"]
kable(as.data.frame(proj2D, row.names = colnames(trainX)), 
      digits = 2)

Then, we will create 3, 4 and 5 dimension tour paths (by modifying tour_path). The "interesting" projection could be that, on at least one axis, the three groups are split well. For example, in this 3D projection, at the axis V1, the group "gray" is distinguished from the team; at the axis V2, the group "pink" could be told significantly different from the rest; at the axis V3, the "blue" group is popped out. Such rules can be used in 4D and 5D projections as well.

p3 <- l_tour(scalingTrainX, 
             tour_path = tourr::grand_tour(3),
             color = trainY,
             axesLayout = "parallel")
proj3D <- p3["projection"]
include_graphics(path_concat(imageDirectory, "proj3D.PNG"))
proj3D <- readRDS(path_concat(dataDirectory, "proj3D.RDS")) %>% 
  as.matrix()
p4 <- l_tour(scalingTrainX,
             tour_path = tourr::grand_tour(4),
             color = trainY,
             axesLayout = "parallel")
proj4D <- p4["projection"]
include_graphics(path_concat(imageDirectory, "proj4D.PNG"))
proj4D <- readRDS(path_concat(dataDirectory, "proj4D.RDS")) %>% 
  as.matrix()
p5 <- l_tour(scalingTrainX, 
             tour_path = tourr::grand_tour(5),
             color = trainY,
             axesLayout = "parallel")
proj5D <- p5["projection"]
include_graphics(path_concat(imageDirectory, "proj5D.PNG"))
proj5D <- readRDS(path_concat(dataDirectory, "proj5D.RDS")) %>% 
  as.matrix()
tour <- low_dim_knn_pred(list(proj2D, proj3D, 
                              proj4D, proj5D), 
                         fun = function(data, proj) {
                           data %*% as.matrix(proj)
                         },
                         scalingTrainX, trainY,
                         scalingTestX, testY,
                         setNames = FALSE) 
colnames(tour) <- paste0("d = ", 2:5)
kable(tour, row.names = TRUE,
      digits = 3)

Graphical Summaries

rbind(tour, lle, PCA, pChooseD) %>% 
  mutate(k = rep(c(5, 10, 20), 4),
         method = rep(c("tour", "LLE", "PCA", "pChooseD"), each = 3)) %>% 
  pivot_longer(cols = -c(k, method),
               names_to = "Dimensions",
               values_to = "Accuracy") %>%
  mutate(Dimensions = parse_number(Dimensions)) %>%
  ggplot(mapping = aes(x = Dimensions, 
                       y = Accuracy,
                       colour = method)) + 
  geom_path() + 
  facet_wrap(~k) + 
  ggtitle("Facet by the number of neibourhoods")

Through this chart we can tell,

Conclusion

Pros

In this data set, tour gives the best performance. Even in two dimensional space, the accuracy could be as high as 98.3\%. Also, such process is very intuitive.

The loon.tourr also provides several scaling methods, like data (scale to zero one based on the whole data set), variable (scale to zero one based on per column), observation (scale to zero one based on pre row), sphere (PCA). Additionally, users can customize their own scaling methods.

Cons

Reference



z267xu/loon.summary documentation built on March 15, 2021, 2:15 p.m.