About

Include cpca package

library(cpca)

Include other packages

library(plyr)
library(ggplot2)
library(gridExtra)

library(pander)

library(pls)
library(MASS)

library(mclust)
library(ElemStatLearn)

Settings:

theme_set(theme_linedraw())
panderOptions('table.split.table', Inf)
panderOptions('knitr.auto.asis', FALSE)

Examples

Dataset iris

Loading the dataset:

data(iris)

Preparing inputs to the CPCA model:

X1 <- as.matrix(iris[, -5])
Y1 <- iris[, 5]

Running the CPCA model:

m1 <- comprcomp(X1, Y1) 

Plotting scores:

varplot(m1, X1, Y1)
scoreplot(m1, X1, Y1)

Bank notes dataset

Flury studied an interesting dataset known as the Swiss banknotes dataset, which is available from an R package mclust. The dataset consists of six measurements made on 200 banknotes, where 100 are genuine and 100 are forged. The variables are:

Loading the dataset:

data(banknote, package = "mclust")
pander(head(banknote))

Using all 6 variables

Preparing inputs to the CPCA model:

X2 <- as.matrix(banknote[, -1])
Y2 <- banknote[, 1]

Running the CPCA model:

m2 <- comprcomp(X2, Y2) 

Plotting scores:

varplot(m2, X2, Y2)
scoreplot(m2, X2, Y2)

The second component (CPC2, y-axis) on scoreplot captures little variance (5.7\%). Let us see proportions of captured variance across all CPCs (expressed in percentage).

pander(compvar(m2, X2, Y2, perc = TRUE, sorted = TRUE))
pander(compvar(m2, X2, Y2, grouping = TRUE, perc = TRUE))

Looking higher-order components

scoreplot(m2, X2, Y2, comp = c(2, 3))
scoreplot(m2, X2, Y2, comp = c(3, 4))

Scoring the components

pander(compscore(m2, X2, sorted = TRUE))
scoreplot(m2, X2, Y2, comp = c(4, 3))

PCA

m21 <- prcomp(X2)

scoreplot(m21, Y = Y2)

Using 4/6 variables

Preparing inputs to the CPCA model:

X3 <- as.matrix(banknote[, 3:6])
Y3 <- banknote[, 1]

Running the CPCA model:

m3 <- comprcomp(X3, Y3) 

Exploring captured variance from two perspectives:

pander(compvar(m3, X3, Y3, perc = TRUE))
pander(compvar(m3, X3, Y3, grouping = TRUE, perc = TRUE))

Plotting scores:

scoreplot(m3, X3, Y3)

PCA:

m31 <- prcomp(X3)

scoreplot(m31, Y = Y3)

Dataset vowel

Data description at UCI ML repository:

The problem is specified by the accompanying data file, "vowel.data". This consists of a three dimensional array: voweldata [speaker, vowel, input]. The speakers are indexed by integers 0-89. (Actually, there are fifteen individual speakers, each saying each vowel six times.) The vowels are indexed by integers 0-10. For each utterance, there are ten floating-point input values, with array indices 0-9.

The problem is to train the network as well as possible using only on data from "speakers" 0-47, and then to test the network on speakers 48-89, reporting the number of correct classifications in the test set.

The data set vowel.train, as well as vowel.test, is available within the R package ElemStatLearn.

Loading the dataset:

data(vowel.train, package = "ElemStatLearn")
vowel <- vowel.train

words <- c("heed", "hid", "head", "had", "hard", "hud", "hod", "hoard", "hood",
  "whod", "heard")
vowels <- c("hid", "hId", "hEd", "hAd", "hYd", "had", "hOd", "hod", "hUd", 
  "hud", "hed")

vowel <- within(vowel, {
  y <- factor(y, levels = 1:11, labels = vowels)
})
pander(table(vowel$y))
pander(head(vowel))

Preparing inputs to the CPCA model:

X4 <- as.matrix(vowel[, -1])
Y4 <- vowel[, 1]

Running the CPCA model:

m4 <- comprcomp(X4, Y4) 

Plotting scores:

varplot(m4, X4, Y4)
scoreplot(m4, X4, Y4)

Subset of dataset vowel

subvowel <- subset(vowel, y %in% c("hid", "hId", "hOd", "hod"))
X5 <- as.matrix(subvowel[, -1])
Y5 <- subvowel[, 1]

Y5 <- droplevels(Y5)

Running the CPCA model:

m5 <- comprcomp(X5, Y5) 

Plotting scores:

varplot(m5, X5, Y5)
scoreplot(m5, X5, Y5)

Computing the discrimination scores of CPCs:

pander(compscore(m5, X5, sorted = TRUE))

Now we can say something about the components in terms of their discrimination properties.

Ploting with scoreplot using two pairs CPC3 vs. CPC5 and CPC3 vs. CPC9 should clarify the role of each of the three (most dicriminative) components.

grid.arrange(
  scoreplot(m5, X5, Y5, comp = c(3, 5)),
  scoreplot(m5, X5, Y5, comp = c(3, 9)),
  nrow = 1)

It is clear that CPC3 performs reasonably well the main job on separation between two groups: hid / hId and hOd / hod pairs of classes. The further fine-grained job on separation e.g., between hid and hId, is more comlicated. The CPCA technique can offer to the user a solution of low quality: (1) CPC5 for separation between hid and hId (the left panel of the figure above), and (2) CPC9 for separation betwen hOd and hod (the right panel).

Interestingly, PCA applied to the same data has a similar group separation for all four classes, but the first two PC1 and PC2 are not able to separate two hid / hId and hOd / hod groups linearly, as CPCA did.

m51 <- prcomp(X5)
scoreplot(m51, Y = Y5)

Let us close the loop and shows the results of LDA.

m52 <- lda(X5, Y5)
scoreplot(m52, Y = Y5)

It seems that results of all CPCA, PCA and LDA are similar. None of the techniques was able to cope with a cluster of red points (hid class), which tend to be closer to the hOd / hod group.



variani/cpca documentation built on May 3, 2019, 4:34 p.m.