In vjcitn/CSHstats: discussions and exercises on classical statistics for CSHL 2022

Overview

This vignette explores basic aspects of PCA in bivariate and 5-dimensional data. It concludes with some remarks about "eigengenes", which can be verified using the computations shown in the simpler cases.

Make a bivariate dataset with positive correlation and heterogeneous variance

First we create a covariance matrix with greater variance for second variable of our pair.

library(MASS)
set.seed(1234)
options(digits=3)
cm = matrix(c(1,1,1,4), nr=2)
cm

Then we generate a 20000 x 2 matrix of bivariate normal deviates.

sim1 = mvrnorm(20000, c(0,0), cm)
cov(sim1)
cor(sim1)

The data in the original units is easy to visualize:

plot(sim1,xlim=c(-10,10), ylim=c(-10,10))

Now we will perform a PCA. We don't have to reduce dimensions, but we can get a handle on how the components are formed and interpreted.

prc = prcomp(sim1, center=FALSE)
plot(prc$x, xlim=c(-10,10), ylim=c(-10,10))

PC1 is produced by taking linear combinations of the rows of sim1.

We'll illustrate the linear combination concept. The data vector for the first row may be written $(x_1, x_2)$, and a linear combination has the form $ax_1 + bx_2$ for some coefficients $a$ and $b$.

The coefficients are derived from the rotation component of the PCA.

prc$rotation
c11 = prc$rotation[1,1]
c21 = prc$rotation[2,1]
sim1[1,1]*c11 + sim1[1,2]*c21
prc$x[1,1]

This can be done wholesale using matrix multiplication %*%:

(sim1 %*% prc$rotation)[1:5,]
prc$x[1:5,]
all.equal(prc$x, sim1%*% prc$rotation)

Exercises.

Recover the value of prc$x[1,2] using the second column of prc$rotation.
Examine these plots

par(mfrow=c(2,2), mar=c(4,4,3,1))
plot(sim1, xlim=c(-10,10), ylim=c(-10,10), main="raw data",
  xlab="data column 1", ylab="data col. 2")
plot(sim1 %*% prc$rotation, xlim=c(-10,10), ylim=c(-10,10),
  main="data %*% prc$rotation", xlab="PC1 via rotation",
   ylab="PC2 via rotation")
plot(prc$x, xlim=c(-10,10), ylim=c(-10,10), main="x from prcomp",
    )
plot(sim1 %*% 
 prc$rotation %*% t(prc$rotation), xlim=c(-10,10), ylim=c(-10,10),
  main="data %*% rot %*% t(rot)", xlab="data %*% VVt (col 1)",
  ylab = "data %*% VVt (col 2)")

The rotation has been "undone". Letting $V$ denote the 'rotation' component of the PCA, this shows that the matrix product $VV^t = I$, where $I$ is a diagonal matrix with 1 on the diagonal. More background on the underlying computations can be gleaned from the Wikipedia entry on singular value decomposition.

A larger covariance matrix

Here we have a 5-dimensional dataset. We set up the covariance matrix so that columns 1 and 2 have negative correlation, columns 3 and 4 have positive correlation, column 2 has greatest overall variance, and columns 1 and 3 have elevated variance.

cm = diag(5)
cm[3,4] = cm[4,3] = .8
cm[1,2] = cm[2,1] = -.6
A = diag(5)
A[1,1] = 2
A[2,2] = 3
A[3,3] = 2
covm = A%*%cm%*%A
myd = mvrnorm(2000, rep(0,5), covm)

The pairs plot shows the data in original units.

pairs(myd, xlim=c(-10,10), ylim=c(-10,10))

We verify the multivariate structure:

cor(myd)
cov(myd)
dim(myd)

Compute PCA

pp = prcomp(myd)
pairs(pp$x)

The biplot shows the projection to PC1-PC2 and shows how the different variables are related, and how they drive the projection.

par(lwd=2)
biplot(pp, xlabs=rep(".", 2000), expand=.8)

Exercise: Explain the configuration of arrows in the biplot.

"Eigengenes" derived from PCA

When the rows are samples and columns are genes, the x components of prcomp's output are linear combinations of all genes. The coefficients of the combination are derived from the PCA rotation matrix, which is constructed so as to