library(knitr) library(learnr) knitr::opts_chunk$set(echo = TRUE, eval = TRUE, exercise = FALSE)
First of all, we read the data and we convert values in minutes to seconds.
raw_data <- moxier::record_mod # Make the units of measure homogeneous across the variables df <- transform(raw_data, m800 = m800 * 60, m1500 = m1500 * 60, m3000 = m3000 *60, Marathon = Marathon * 60) df
We make a boxplot to explore the data.
boxplot(df)
We observe that the variability increases non-linearly for increasing lengths: the record times of the marathon have much more variability than that of the other disciplines. This could significantly influence the PCA.
We perform the PCA
(pc_df <- princomp(df, scores = T))
and also visualise a summary of it.
summary(pc_df)
We also notice inspect the standard deviation of the components,
knitr::kable(pc_df$sd)
the proportion of variance explained by each PC
knitr::kable(pc_df$sd^2/sum(pc_df$sd^2))
and the cumulative proportion of explained variance.
knitr::kable(cumsum(pc_df$sd^2)/sum(pc_df$sd^2))
We can visualise it by means of a screeplot.
screeplot(pc_df)
We call loadings coefficients of the linear combination of the original variables that defines each principal component)
load.rec <- pc_df$loadings
We can graphically represent the loadings of the first six principal components.
par(mar = c(1,4,0,2), mfrow = c(6,1)) for(i in 1:6) barplot(load.rec[,i], ylim = c(-1, 1))
As for the interpretation, we notice that the first PCs are connected to longer distance disciplines, whereas the last ones are connected to shorter distance disciplines.
The loadings reflect the previous observations: the first PC is represented by the variable Marathon
, the second by the long distances, etc. The short distances appear in the last PCs.
We make a plot to visualise the explained variance.
layout(matrix(c(2, 3, 1, 3), 2, byrow = T)) plot(pc_df, las = 2, main = "Principal components", ylim = c(0, 3.5e6)) barplot(sapply(df, sd)^2, las = 2, main = "Original Variables", ylim = c(0, 3.5e6), ylab = "Variances") plot(cumsum(pc_df$sd^2) / sum(pc_df$sd^2), type = "b", axes = F, xlab = "number of components", ylab = "contribution to the total variance", ylim = c(0, 1) ) abline(h = 1, col = "blue") abline(h = 0.8, lty = 2, col = "blue") box() axis(2, at = 0:10 / 10, labels = 0:10 / 10) axis(1, at = 1:ncol(df), labels = 1:ncol(df), las = 2)
We now consider scores.
scores.df <- pc_df$scores knitr::kable(scores.df)
We plot the results.
layout(matrix(c(1,2),2)) boxplot(df, las=2, col='red', main='Original variables') scores.df <- data.frame(scores.df) boxplot(scores.df, las=2, col='red', main='Principal components')
We finally make a biplot.
biplot(pc_df, scale=0, cex=.7)
We now compute the standardised variables,
runrec.sd <- scale(df) runrec.sd <- data.frame(runrec.sd)
and the PCA
pc.runrec <- princomp(runrec.sd, scores=T) pc.runrec summary(pc.runrec)
We take a look at the explained variance.
layout(matrix(c(2,3,1,3),2,byrow=T)) plot(pc.runrec, las=2, main='Principal components', ylim=c(0,6)) abline(h=1, col='blue') barplot(sapply(runrec.sd,sd)^2, las=2, main='Original variables', ylim=c(0,6), ylab='Variances') plot(cumsum(pc.runrec$sde^2)/sum(pc.runrec$sde^2), type='b', axes=F, xlab='Number of components', ylab='Contribution to the total variance', ylim=c(0,1)) box() axis(2,at=0:10/10,labels=0:10/10) axis(1,at=1:ncol(runrec.sd),labels=1:ncol(runrec.sd),las=2)
We see that keeping one or two PCs would probably do.
We compute the loadings
(load.rec <- pc.runrec$loadings)
and plot them
par(mar = c(2,2,2,1), mfrow=c(3,1)) for(i in 1:3)barplot(load.rec[,i], ylim = c(-1, 1), main=paste('Loadings PC ',i,sep=''))
In this case, the first PC represents an average of times of all the disciplines, taken with very similar (negative) weights. The second PC contrasts the short distances (m100, m200, m400) with long distances (m800, m1500, m3000, Marathon)
We now turn our attentions to the scores.
(scores.runrec <- pc.runrec$scores)
We plot them
layout(matrix(c(1,2),2)) boxplot(runrec.sd, las=2, col='red', main='Original variables') scores.runrec <- data.frame(scores.runrec) boxplot(scores.runrec, las=2, col='red', main='Principal components')
layout(matrix(c(1,2),1)) plot(runrec.sd[,'m100'],runrec.sd[,'Marathon'],type="n",xlab="m100",ylab="Marathon", asp=1) text(runrec.sd[,'m100'],runrec.sd[,'Marathon'],dimnames(runrec.sd)[[1]],cex=.7) plot(-scores.runrec[,1],-scores.runrec[,2],type="n",xlab="-pc1",ylab="-pc2", asp=1) text(-scores.runrec[,1],-scores.runrec[,2],dimnames(runrec.sd)[[1]],cex=.7)
biplot(pc.runrec)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.