In mascaretti/moxier: An Introduction To Statistics With R

library(knitr)
library(learnr)
knitr::opts_chunk$set(echo = TRUE, eval = TRUE, exercise = FALSE)

Example 2: Principal Component Analysis of the records dataset.

Reading the data

First of all, we read the data and we convert values in minutes to seconds.

raw_data <- moxier::record_mod

# Make the units of measure homogeneous across the variables
df <- transform(raw_data,
                m800 = m800 * 60,
                m1500 = m1500 * 60,
                m3000 = m3000 *60,
                Marathon = Marathon * 60)
df

Exploratory Data Analysis

We make a boxplot to explore the data.

boxplot(df)

We observe that the variability increases non-linearly for increasing lengths: the record times of the marathon have much more variability than that of the other disciplines. This could significantly influence the PCA.

Principal Component Analysis

We perform the PCA

(pc_df <- princomp(df, scores = T))

and also visualise a summary of it.

summary(pc_df)

We also notice inspect the standard deviation of the components,

knitr::kable(pc_df$sd)

the proportion of variance explained by each PC

knitr::kable(pc_df$sd^2/sum(pc_df$sd^2))

and the cumulative proportion of explained variance.

knitr::kable(cumsum(pc_df$sd^2)/sum(pc_df$sd^2))

We can visualise it by means of a screeplot.

screeplot(pc_df)

Loadings

We call loadings coefficients of the linear combination of the original variables that defines each principal component)

load.rec <- pc_df$loadings

We can graphically represent the loadings of the first six principal components.

par(mar = c(1,4,0,2), mfrow = c(6,1))
for(i in 1:6) barplot(load.rec[,i], ylim = c(-1, 1))

As for the interpretation, we notice that the first PCs are connected to longer distance disciplines, whereas the last ones are connected to shorter distance disciplines. The loadings reflect the previous observations: the first PC is represented by the variable Marathon, the second by the long distances, etc. The short distances appear in the last PCs.

Explained Variance

We make a plot to visualise the explained variance.

layout(matrix(c(2, 3, 1, 3), 2, byrow = T))
plot(pc_df, las = 2, main = "Principal components", ylim = c(0, 3.5e6))
barplot(sapply(df, sd)^2, las = 2, main = "Original Variables", ylim = c(0, 3.5e6), ylab = "Variances")
plot(cumsum(pc_df$sd^2) / sum(pc_df$sd^2),
  type = "b", axes = F, xlab = "number of components",
  ylab = "contribution to the total variance", ylim = c(0, 1)
)
abline(h = 1, col = "blue")
abline(h = 0.8, lty = 2, col = "blue")
box()
axis(2, at = 0:10 / 10, labels = 0:10 / 10)
axis(1, at = 1:ncol(df), labels = 1:ncol(df), las = 2)

Scores

We now consider scores.

scores.df <- pc_df$scores
knitr::kable(scores.df)

We plot the results.

layout(matrix(c(1,2),2))
boxplot(df, las=2, col='red', main='Original variables')
scores.df <- data.frame(scores.df)
boxplot(scores.df, las=2, col='red', main='Principal components')

We finally make a biplot.

biplot(pc_df, scale=0, cex=.7)

PCA with standardised variables

We now compute the standardised variables,

runrec.sd <- scale(df)
runrec.sd <- data.frame(runrec.sd)

and the PCA

pc.runrec <- princomp(runrec.sd, scores=T)
pc.runrec
summary(pc.runrec)

Explained Variance

We take a look at the explained variance.

layout(matrix(c(2,3,1,3),2,byrow=T))
plot(pc.runrec, las=2, main='Principal components', ylim=c(0,6))
abline(h=1, col='blue')
barplot(sapply(runrec.sd,sd)^2, las=2, main='Original variables', ylim=c(0,6), ylab='Variances')
plot(cumsum(pc.runrec$sde^2)/sum(pc.runrec$sde^2), type='b', axes=F, xlab='Number of components', ylab='Contribution to the total variance', ylim=c(0,1))
box()
axis(2,at=0:10/10,labels=0:10/10)
axis(1,at=1:ncol(runrec.sd),labels=1:ncol(runrec.sd),las=2)

We see that keeping one or two PCs would probably do.

Loadings

We compute the loadings

(load.rec <- pc.runrec$loadings)

and plot them

par(mar = c(2,2,2,1), mfrow=c(3,1))
for(i in 1:3)barplot(load.rec[,i], ylim = c(-1, 1), main=paste('Loadings PC ',i,sep=''))

In this case, the first PC represents an average of times of all the disciplines, taken with very similar (negative) weights. The second PC contrasts the short distances (m100, m200, m400) with long distances (m800, m1500, m3000, Marathon)

High PC1: long times in all the disciplines
Low PC1: short times in all the disciplines
High PC2: long times in short distances, short times in long distances
Low PC2: long times in short distances, short times in long distances

Scores

We now turn our attentions to the scores.

(scores.runrec <- pc.runrec$scores)

We plot them

layout(matrix(c(1,2),2))
boxplot(runrec.sd, las=2, col='red', main='Original variables')
scores.runrec <- data.frame(scores.runrec)
boxplot(scores.runrec, las=2, col='red', main='Principal components')

layout(matrix(c(1,2),1))
plot(runrec.sd[,'m100'],runrec.sd[,'Marathon'],type="n",xlab="m100",ylab="Marathon", asp=1)
text(runrec.sd[,'m100'],runrec.sd[,'Marathon'],dimnames(runrec.sd)[[1]],cex=.7)
plot(-scores.runrec[,1],-scores.runrec[,2],type="n",xlab="-pc1",ylab="-pc2", asp=1)
text(-scores.runrec[,1],-scores.runrec[,2],dimnames(runrec.sd)[[1]],cex=.7)