In variani/cpca: Methods to perform Common Principal Component Analysis (CPCA)

About

Include cpca package

library(cpca)

Include other packages

library(plyr)
library(ggplot2)
library(gridExtra)

library(pander)

library(pls)
library(MASS)

library(mclust)
library(ElemStatLearn)

Settings:

theme_set(theme_linedraw())
panderOptions('table.split.table', Inf)
panderOptions('knitr.auto.asis', FALSE)

Examples

Dataset iris

Loading the dataset:

data(iris)

Preparing inputs to the CPCA model:

X1 <- as.matrix(iris[, -5])
Y1 <- iris[, 5]

Running the CPCA model:

m1 <- comprcomp(X1, Y1)

Plotting scores:

varplot(m1, X1, Y1)

scoreplot(m1, X1, Y1)

Bank notes dataset

Flury studied an interesting dataset known as the Swiss banknotes dataset, which is available from an R package mclust. The dataset consists of six measurements made on 200 banknotes, where 100 are genuine and 100 are forged. The variables are:

width of the banknote;
height of the left-hand side of the banknote;
height of the right-hand side of the banknote;
distance between the top of the inner box to the upper border;
distance between the bottom of the inner box to the lower border;
diagonal of the inner box.

Loading the dataset:

data(banknote, package = "mclust")

pander(head(banknote))

Using all 6 variables

Preparing inputs to the CPCA model:

X2 <- as.matrix(banknote[, -1])
Y2 <- banknote[, 1]

Running the CPCA model:

m2 <- comprcomp(X2, Y2)

Plotting scores:

varplot(m2, X2, Y2)

scoreplot(m2, X2, Y2)

The second component (CPC2, y-axis) on scoreplot captures little variance (5.7\%). Let us see proportions of captured variance across all CPCs (expressed in percentage).

pander(compvar(m2, X2, Y2, perc = TRUE, sorted = TRUE))

pander(compvar(m2, X2, Y2, grouping = TRUE, perc = TRUE))

Looking higher-order components

scoreplot(m2, X2, Y2, comp = c(2, 3))

scoreplot(m2, X2, Y2, comp = c(3, 4))

Scoring the components

pander(compscore(m2, X2, sorted = TRUE))

scoreplot(m2, X2, Y2, comp = c(4, 3))

PCA

m21 <- prcomp(X2)

scoreplot(m21, Y = Y2)

Using 4/6 variables

Preparing inputs to the CPCA model:

X3 <- as.matrix(banknote[, 3:6])
Y3 <- banknote[, 1]

Running the CPCA model:

m3 <- comprcomp(X3, Y3)

Exploring captured variance from two perspectives:

pander(compvar(m3, X3, Y3, perc = TRUE))

pander(compvar(m3, X3, Y3, grouping = TRUE, perc = TRUE))

Plotting scores:

scoreplot(m3, X3, Y3)

PCA:

m31 <- prcomp(X3)

scoreplot(m31, Y = Y3)

Dataset vowel

Data description at UCI ML repository:

The problem is specified by the accompanying data file, "vowel.data". This consists of a three dimensional array: voweldata [speaker, vowel, input]. The speakers are indexed by integers 0-89. (Actually, there are fifteen individual speakers, each saying each vowel six times.) The vowels are indexed by integers 0-10. For each utterance, there are ten floating-point input values, with array indices 0-9.

The problem is to train the network as well as possible using only on data from "speakers" 0-47, and then to test the network on speakers 48-89, reporting the number of correct classifications in the test set.

The data set vowel.train, as well as vowel.test, is available within the R package ElemStatLearn.

Loading the dataset:

data(vowel.train, package = "ElemStatLearn")
vowel <- vowel.train

words <- c("heed", "hid", "head", "had", "hard", "hud", "hod", "hoard", "hood",
  "whod", "heard")
vowels <- c("hid", "hId", "hEd", "hAd", "hYd", "had", "hOd", "hod", "hUd", 
  "hud", "hed")

vowel <- within(vowel, {
  y <- factor(y, levels = 1:11, labels = vowels)
})

pander(table(vowel$y))

pander(head(vowel))

Preparing inputs to the CPCA model:

X4 <- as.matrix(vowel[, -1])
Y4 <- vowel[, 1]

Running the CPCA model:

m4 <- comprcomp(X4, Y4)

Plotting scores:

varplot(m4, X4, Y4)

scoreplot(m4, X4, Y4)

Subset of dataset vowel

subvowel <- subset(vowel, y %in% c("hid", "hId", "hOd", "hod"))
X5 <- as.matrix(subvowel[, -1])
Y5 <- subvowel[, 1]

Y5 <- droplevels(Y5)

Running the CPCA model:

m5 <- comprcomp(X5, Y5)

Plotting scores:

varplot(m5, X5, Y5)

scoreplot(m5, X5, Y5)

Computing the discrimination scores of CPCs:

pander(compscore(m5, X5, sorted = TRUE))

Now we can say something about the components in terms of their discrimination properties.

The first component CPC1 has the lowest score among the others. CPC1 is likely to capture the common variance among the four classes. The previous figure produced by scoreplot confirms this conclusion.
The component CPC3 has the highest score, and it is substantially higher in comparison with the next components CPC5, CPC9, etc.

Ploting with scoreplot using two pairs CPC3 vs. CPC5 and CPC3 vs. CPC9 should clarify the role of each of the three (most dicriminative) components.

grid.arrange(
  scoreplot(m5, X5, Y5, comp = c(3, 5)),
  scoreplot(m5, X5, Y5, comp = c(3, 9)),
  nrow = 1)

It is clear that CPC3 performs reasonably well the main job on separation between two groups: hid / hId and hOd / hod pairs of classes. The further fine-grained job on separation e.g., between hid and hId, is more comlicated. The CPCA technique can offer to the user a solution of low quality: (1) CPC5 for separation between hid and hId (the left panel of the figure above), and (2) CPC9 for separation betwen hOd and hod (the right panel).

Interestingly, PCA applied to the same data has a similar group separation for all four classes, but the first two PC1 and PC2 are not able to separate two hid / hId and hOd / hod groups linearly, as CPCA did.

m51 <- prcomp(X5)

scoreplot(m51, Y = Y5)

Let us close the loop and shows the results of LDA.

m52 <- lda(X5, Y5)

scoreplot(m52, Y = Y5)

It seems that results of all CPCA, PCA and LDA are similar. None of the techniques was able to cope with a cluster of red points (hid class), which tend to be closer to the hOd / hod group.

variani/cpca documentation built on May 3, 2019, 4:34 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

variani/cpca
Methods to perform Common Principal Component Analysis (CPCA)

In variani/cpca: Methods to perform Common Principal Component Analysis (CPCA)

About

Include cpca package

Include other packages

Examples

Dataset iris

Bank notes dataset

Using all 6 variables

Looking higher-order components

Scoring the components

PCA

Using 4/6 variables

Dataset vowel

Subset of dataset vowel

R Package Documentation

Browse R Packages

We want your feedback!

variani/cpca Methods to perform Common Principal Component Analysis (CPCA)

In variani/cpca: Methods to perform Common Principal Component Analysis (CPCA)

About

Include cpca package

Include other packages

Examples

Dataset iris

Bank notes dataset

Using all 6 variables

Looking higher-order components

Scoring the components

PCA

Using 4/6 variables

Dataset vowel

Subset of dataset vowel

R Package Documentation

Browse R Packages

We want your feedback!

variani/cpca
Methods to perform Common Principal Component Analysis (CPCA)