library(cpca)
library(plyr) library(ggplot2) library(gridExtra) library(pander) library(pls) library(MASS) library(mclust) library(ElemStatLearn)
Settings:
theme_set(theme_linedraw()) panderOptions('table.split.table', Inf) panderOptions('knitr.auto.asis', FALSE)
Loading the dataset:
data(iris)
Preparing inputs to the CPCA model:
X1 <- as.matrix(iris[, -5]) Y1 <- iris[, 5]
Running the CPCA model:
m1 <- comprcomp(X1, Y1)
Plotting scores:
varplot(m1, X1, Y1)
scoreplot(m1, X1, Y1)
Flury studied an interesting dataset known as the Swiss banknotes dataset,
which is available from an R package mclust
.
The dataset consists of six measurements made on 200 banknotes,
where 100 are genuine and 100 are forged.
The variables are:
Loading the dataset:
data(banknote, package = "mclust")
pander(head(banknote))
Preparing inputs to the CPCA model:
X2 <- as.matrix(banknote[, -1]) Y2 <- banknote[, 1]
Running the CPCA model:
m2 <- comprcomp(X2, Y2)
Plotting scores:
varplot(m2, X2, Y2)
scoreplot(m2, X2, Y2)
The second component (CPC2, y-axis) on scoreplot
captures little variance (5.7\%).
Let us see proportions of captured variance across all CPCs (expressed in percentage).
pander(compvar(m2, X2, Y2, perc = TRUE, sorted = TRUE))
pander(compvar(m2, X2, Y2, grouping = TRUE, perc = TRUE))
scoreplot(m2, X2, Y2, comp = c(2, 3))
scoreplot(m2, X2, Y2, comp = c(3, 4))
pander(compscore(m2, X2, sorted = TRUE))
scoreplot(m2, X2, Y2, comp = c(4, 3))
m21 <- prcomp(X2) scoreplot(m21, Y = Y2)
Preparing inputs to the CPCA model:
X3 <- as.matrix(banknote[, 3:6]) Y3 <- banknote[, 1]
Running the CPCA model:
m3 <- comprcomp(X3, Y3)
Exploring captured variance from two perspectives:
pander(compvar(m3, X3, Y3, perc = TRUE))
pander(compvar(m3, X3, Y3, grouping = TRUE, perc = TRUE))
Plotting scores:
scoreplot(m3, X3, Y3)
PCA:
m31 <- prcomp(X3) scoreplot(m31, Y = Y3)
Data description at UCI ML repository:
The problem is specified by the accompanying data file, "vowel.data". This consists of a three dimensional array: voweldata [speaker, vowel, input]. The speakers are indexed by integers 0-89. (Actually, there are fifteen individual speakers, each saying each vowel six times.) The vowels are indexed by integers 0-10. For each utterance, there are ten floating-point input values, with array indices 0-9.
The problem is to train the network as well as possible using only on data from "speakers" 0-47, and then to test the network on speakers 48-89, reporting the number of correct classifications in the test set.
The data set vowel.train
, as well as vowel.test
,
is available
within the R package ElemStatLearn
.
Loading the dataset:
data(vowel.train, package = "ElemStatLearn") vowel <- vowel.train words <- c("heed", "hid", "head", "had", "hard", "hud", "hod", "hoard", "hood", "whod", "heard") vowels <- c("hid", "hId", "hEd", "hAd", "hYd", "had", "hOd", "hod", "hUd", "hud", "hed") vowel <- within(vowel, { y <- factor(y, levels = 1:11, labels = vowels) })
pander(table(vowel$y))
pander(head(vowel))
Preparing inputs to the CPCA model:
X4 <- as.matrix(vowel[, -1]) Y4 <- vowel[, 1]
Running the CPCA model:
m4 <- comprcomp(X4, Y4)
Plotting scores:
varplot(m4, X4, Y4)
scoreplot(m4, X4, Y4)
subvowel <- subset(vowel, y %in% c("hid", "hId", "hOd", "hod")) X5 <- as.matrix(subvowel[, -1]) Y5 <- subvowel[, 1] Y5 <- droplevels(Y5)
Running the CPCA model:
m5 <- comprcomp(X5, Y5)
Plotting scores:
varplot(m5, X5, Y5)
scoreplot(m5, X5, Y5)
Computing the discrimination scores of CPCs:
pander(compscore(m5, X5, sorted = TRUE))
Now we can say something about the components in terms of their discrimination properties.
scoreplot
confirms this conclusion.Ploting with scoreplot
using two pairs CPC3 vs. CPC5 and CPC3 vs. CPC9 should
clarify the role of each of the three (most dicriminative) components.
grid.arrange( scoreplot(m5, X5, Y5, comp = c(3, 5)), scoreplot(m5, X5, Y5, comp = c(3, 9)), nrow = 1)
It is clear that CPC3 performs reasonably well the main job on separation between two groups: hid / hId and hOd / hod pairs of classes. The further fine-grained job on separation e.g., between hid and hId, is more comlicated. The CPCA technique can offer to the user a solution of low quality: (1) CPC5 for separation between hid and hId (the left panel of the figure above), and (2) CPC9 for separation betwen hOd and hod (the right panel).
Interestingly, PCA applied to the same data has a similar group separation for all four classes, but the first two PC1 and PC2 are not able to separate two hid / hId and hOd / hod groups linearly, as CPCA did.
m51 <- prcomp(X5)
scoreplot(m51, Y = Y5)
Let us close the loop and shows the results of LDA.
m52 <- lda(X5, Y5)
scoreplot(m52, Y = Y5)
It seems that results of all CPCA, PCA and LDA are similar. None of the techniques was able to cope with a cluster of red points (hid class), which tend to be closer to the hOd / hod group.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.