knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7, fig.height = 4
)
devtools::load_all(".")

Intro

To better understand what a Principal Component Analysis does consider the following example in which 2 features (e.g. observations or gene expression abundance) are measured for 100 samples (different experiments).

set.seed(strtoi('0x10'))
x <- rnorm(n = 100, mean = 50, sd = 10)
y <- x + rnorm(n = 100, mean = 0, sd = 5)
dat <- data.frame(FeatureX = x, FeatureY = y)

# Add some fake row.names
sapply(1:100, function(x) {
  x <- paste0(c(sample(LETTERS, 1, replace = TRUE),
      tolower(sample(LETTERS, 4, replace = TRUE)),
      sample(1:9, 1, replace = TRUE)), collapse = "") } 
  ) -> rownames(dat)

# Add some extra columns to make a better visualisation
dat$Ratio <- log2(dat$FeatureX * dat$FeatureY)
dat$Label <-  ( dat$Ratio > quantile(dat$Ratio , 0.95) | dat$Ratio < quantile(dat$Ratio , 0.05)  )

t(head(dat))

We can visualise the relation between the 2 features by plotting them in 2D.

library(ggplot2)
library(ggrepel)
ggplot(data = dat, aes(FeatureX, FeatureY, fill = Ratio ) ) +
  coord_cartesian(xlim = c(0, 100), ylim = c(0, 100)  ) +
  geom_vline(xintercept = 50) +
  geom_hline(yintercept = 50) +
  geom_point(shape = 21, size = 2) +
  geom_text_repel(data = subset(dat, Label),
                  label = rownames(subset(dat, Label)) ) +
  scale_fill_gradient2(low = "dodgerblue", mid = "white", 
                       midpoint = log2(50*50), high = "firebrick3", 
                       guide = "none") +
  theme_classic()

It is clear that the 2 features are very similar for all 100 samples. In fact the Pearson correlation between FeatureX and FeatureY corresponds to r signif(cor(dat$FeatureX, dat$FeatureY, method = "pearson"), 2).

Here we need 2 dimensions (X and Y) to describe the relation between all 100 samples. It would be convenient if there was a way to reduce these dimensions and have less dimension to visualise the relations between the samples. In this very simple example we can reduce from 2 to only 1 dimension, but more generally PCA is really useful with multidimensional data with many features. In R the function prcomp can use to transform the data

mat <- as.matrix( dat[, c("FeatureX", "FeatureY")] )
pca_data <- stats::prcomp(x = mat, retx = TRUE, center = TRUE, scale. = FALSE)
num_components <- max(as.numeric(gsub("PC", "", colnames(pca_data$x))))
pca_df <- as.data.frame(pca_data$x, stringsAsFactors = FALSE)
colnames(pca_df) <- paste0("PC", 1:num_components)
pca_df$Ratio <- dat$Ratio
pca_df$Label <- dat$Label

Visualise the same data after the Principal Components Analysis.

ggplot(data = pca_df, aes(PC1, PC2, fill = Ratio)  ) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_point(shape = 21, size = 2) +
  coord_cartesian(xlim = c(-50, 50), ylim = c(-50, 50)) +
  geom_text_repel(data = subset(pca_df, Label ),
                  label = rownames(subset(pca_df, Label) ) ) +
  scale_fill_gradient2(low = "dodgerblue", mid = "white", 
                       midpoint = log2(50*50), high = "firebrick3", 
                       guide = "none") +
  theme_classic()

The plot above shows how the principal component 1 (PC1) can separate all the samples in the same was as FeatureX and FeatureY did in the previous plot. Using the sample names (text next to the dots) and the colours in the plot you can see how the samples moved between the two plots. Now the PC1 explains most of the variance in the data while retaining as much as possible the variation in the data set. As you can see the transformation on the principal component is linear. In the plot you also see the principal component 2 (PC2) that does not really explain much variance between the samples, this is because the principal components are uncorrelated and ordered, meaning that the first few PC's contain most of the variance in the original dataset.

To manually calculate a PCA without using basic R function needs first to scale the matrix

scld_mat <- scale(mat, center = TRUE, scale = FALSE)

In this way the each Feature (X and Y) will have a mean of zero across all samples

round(mean(scld_mat[, "FeatureX"]), 3) 
round(mean(scld_mat[, "FeatureY"]), 3)

As we saw above the PCA transformation consisted in rotating the axes which, mathematically speaking, is a matrix multiplication step. If we would like to rotate by an $\alpha$ angle the rotation matrix is: $$ R = \begin{bmatrix} cos\alpha & -\sin\alpha \ \sin\alpha & \cos\alpha \ \end{bmatrix}$$

In the simple case of above it is fair to assume that $\alpha = 45^\circ$. In general terms the right angle to transform the data that one has to look for has been shown to correspond to the eigenvector corresponding to the largest eigenvalue calculated on the covariance matrix. Some people prefer to use the correlation matrix, rather than the covariance matrix. For the following example I'll use the covariance cov() as it is the same one used by prcomp.

cov_mat <- cov(scld_mat)
eig <- eigen(cov_mat)
R <- eig$vectors
R

In R the matrix multiplication symbol is defined as %*%

rotated_mat <- scld_mat %*% R
rotated_df <- as.data.frame(rotated_mat)
colnames(rotated_df) <- c("PC1", "PC2")
rotated_df$Ratio <- dat$Ratio
rotated_df$Label <- dat$Label


ggplot(data = rotated_df, aes(PC1, PC2, fill = Ratio)  ) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_point(shape = 21, size = 2) +
  coord_cartesian(xlim = c(-50, 50), ylim = c(-50, 50)) +
  geom_text_repel(data = subset(rotated_df, Label ),
                  label = rownames(subset(rotated_df, Label) ) ) +
  scale_fill_gradient2(low = "dodgerblue", mid = "white", 
                       midpoint = log2(50*50), high = "firebrick3", 
                       guide = "none") +
  theme_classic()

The sign of the rotation (direction of the transformation) is arbitrary, it can differ between different programs for PCA.

We can see that the eigenvector R of the covariance matrix is around 0.65 and 0.75 and not exactly $45^\circ$ or $\pi/4$ as those corresponds to the same value r round(sin(pi/4), 3) and r round(cos(pi/4), 3). By knowing the eigenvector we can calculate the rotation angle with:

cos_alpha <-  R[1,1]
round(180 * acos(cos_alpha) / pi, 3)

Which is still very close to $45^\circ$.

Some theory with formulas

As we saw, the first step of the PCA consists in calculating the covariance between X and Y which is defined as: $$ Cov(X,Y) = \frac{ \sum (x_i - \overline{x}) \times (y_i - \overline{y}) } { N -1}$$ Then the new coordinate system from a scaled matrix X can be obtaine with:

$$ R = X \cdot \overline{x}_i Cov(X,Y) $$ Eigenvalues $${\lambda}_i$$

Eigenvector $$\overline{x}_i$$

(To be finished)

My function

The inspiration for writing the showme_PCA2D() function comes from the pheatmap function (from the R package pheatmap) to plot nice looking heatmaps with minimal effort.

library(niar)
showme_PCA2D(mat)

With this very simple line the function performs all the PCA analysis and plotting in the background. On top of that the function is enriched with lot of other paramenter that allow for filtering, labelling, adding metadata, exploring the loadings and more.

Example with dummy data

To showcase most of the capabilities of this function let's look up a matrix of 10 samples with 15000 features that follow a negative binomial distribution (rnbinom). Let's assume that these features are genes and give them random names.

set.seed(16)
n <- 15000
sapply(1:n, function(x) {
  x <- paste0(c(sample(LETTERS, 1, replace = TRUE),
      tolower(sample(LETTERS, 4, replace = TRUE)),
      sample(1:9, 1, replace = TRUE)), collapse = "") } 
  ) -> rows_IDs

j <- length(unique(rows_IDs))

Let's now generate a count matrix with 10 mRNA-seq samples.

data.frame(Sample1 = rnbinom(n = j, mu = 400, size = 4)*rnorm(n = 1, mean = 2, sd = 0.5),
           Sample2 = rnbinom(n = j, mu = 395, size = 4)*rnorm(n = 1, mean = 2, sd = 0.5),
           Sample3 = rnbinom(n = j, mu = 396, size = 4)*rnorm(n = 1, mean = 2, sd = 0.5),
           Sample4 = rnbinom(n = j, mu = 397, size = 4)*rnorm(n = 1, mean = 2, sd = 0.5),
           Sample5 = rnbinom(n = j, mu = 430, size = 5)*rnorm(n = 1, mean = 2, sd = 0.5), 
           Sample6 = rnbinom(n = j, mu = 450, size = 5)*rnorm(n = 1, mean = 2, sd = 0.5), 
           Sample7 = rnbinom(n = j, mu = 420, size = 5)*rnorm(n = 1, mean = 2, sd = 0.5), 
           Sample8 = rnbinom(n = j, mu = 410, size = 6)*rnorm(n = 1, mean = 2, sd = 0.5), 
           Sample9 = rnbinom(n = j, mu = 400, size = 6)*rnorm(n = 1, mean = 2, sd = 0.5), 
           Sample0 = rnbinom(n = j, mu = 390, size = 6)*rnorm(n = 1, mean = 2, sd = 0.5), 
           stringsAsFactors = FALSE, 
           row.names = unique(rows_IDs) ) -> df
mat <- as.matrix(df)

To simplify things we can scale the matrix

mat <- log2(mat)

Now the do a PCA it's super easy!

showme_PCA2D(mat)

Show extra plots, also easy!

showme_PCA2D(mat = mat, show_variance = TRUE, show_stats = TRUE)

Checking loadings real quick? Very easy!

showme_PCA2D(mat = mat, n_loadings = 12)

What to show extra info on the plot? Make a metadata dataframe.

data.frame(sample_name = paste0("Sample", 0:9),
           replicate = c(rep(c(1:3), 3), 1), 
           condition = c(rep("A", 5), rep("Z", 5)),
           stringsAsFactors = FALSE) -> mt

Now specify the metadata with mt and the dataframe colum with the mat column names with mcol and now the extra info can be used to colour and label the points.

showme_PCA2D(mat = mat, mt = mt, mcol = "sample_name", x = 1, y = 3,
             m_label = "replicate", m_fill = "condition")

Select top rows to use for PCA.

showme_PCA2D(mat = mat, mt = mt, mcol = "sample_name", x = 1, y = 3,
             n_top_var = 40, m_fill = "replicate", m_label = F)

Do not use realistic aspect ratio.

showme_PCA2D(mat = mat,  mt = mt, mcol = "sample_name", 
             m_fill = "replicate", x = 1, y = 3,
             show_stats = F, m_label = F, 
             real_aspect_ratio = F)

Example with pasta ripiena dataset

Taking the data from this publication

Evolution of the Italian pasta ripiena: the first steps toward a scientific classification. (2024) Discover Food. Nazari, et al.

I generate in R the dataset presented in table 1.

pasta <- structure(list(Num = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37), 
               Name = c("Gyoza", "Manti", "Pierogi", "Pelmeni", "Maultaschen", "Kreplach", "Khinkali", "Momo", "Wonton", "Agnolotti_al_plin", "Anolini", "Balanzoni", 
                         "Cappellacci", "Cappelletti", "Caramelle", "Casoncelli", "Casunziei", "Ciaronciè", "Cjalzòns", "Cjarsons", "Culurgiones", "Fagottini", 
                         "Marubini", "Minestra_imbottita", "Pansoti", "Puligioni", "Ravioli", "Scarpinocc", "Schlutzkrapfen", "Tordelli", "Tortel_Dols", "Tortelli", 
                         "Tortelli_alla_lastra", "Tortellini", "Tortelloni", "Turtèl_sguasaròt", "Zembi_d_arzillo"), 
           Provenance = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
           Italian_region = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 1, 0, 0.5, 1, 0, 0, 0, 0), 
           Flour = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
           Eggs = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1), 
           Size = c(1.5, 0.5, 1, 1, 3, 2.5, 2, 1, 1.5, 0, 0, 1.5, 1.5, 0.5, 1, 1, 1, 1, 1, 1, 1.5, 0.5, 0.5, 0, 1, 1.5, 1, 1, 1, 1, 1.5, 1, 3, 0, 2, 3, 1), 
           Flatness = c(1, 1, 0, 1, 0, 0.5, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0.5, 1, 1, 0.5, 0, 1, 0, 0, 1, 0, 0, 0, 0.5, 0, 1, 1, 0, 0), 
           Folding = c(2, 2.5, 0, 4, 0, 2.33, 8, 8, 7.5, 6, 1, 4, 4.5, 4, 7, 3.5, 0, 0, 0, 0, 2, 3, 3, 1, 4, 1, 0, 7, 0, 0, 0, 2, 1, 5, 5, 0, 1), 
           Shape = c(0, 3, 0, 4.5, 1, 1.5, 5, 2.5, 5.5, 1, 3, 6, 6, 6, 4, 4, 0, 1, 0, 3, 5, 5, 4.5, 1, 6, 1, 1, 4, 0, 0, 1, 3.5, 1, 6, 6, 1, 2), 
           Veggie = c(1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0.5, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1), 
           Dairy = c(0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0), 
           Meat = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1), 
           Cooking = c(1, 0, 0.5, 0, 0, 0.5, 0, 3, 0, 0, 0, 0, 0, 0, 0.5, 0, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 0, 2, 0, 0, 0.5, 0), 
           Sauce = c(2, 2, 2, 2, 3, 3, 0, 0, 3, 0, 3, 0, 1, 3, 2, 2, 0, 0, 2, 2, 2, 0, 3, 3, 2, 2, 2, 2, 2, 1, 2, 0, 0, 3, 2, 2, 1), 
           Serving = c(1, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 2, 1, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 0, 0, 2, 1, 1), 
           Filling = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 1, 0.5, 0, 0, 0, 0, 0)), 
          class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), 
          row.names = c(NA, -37L), 
          spec = structure(list(cols = list(Num = structure(list(), class = c("collector_double", "collector")), Name = structure(list(), class = c("collector_character", "collector")), Provenance = structure(list(), class = c("collector_double", "collector")), Italian_region = structure(list(), class = c("collector_double", "collector")), Flour = structure(list(), class = c("collector_double", "collector")), Eggs = structure(list(), class = c("collector_double", "collector")), Size = structure(list(), class = c("collector_double", "collector")), Flatness = structure(list(), class = c("collector_double", "collector")), Folding = structure(list(), class = c("collector_double", "collector")), Shape = structure(list(), class = c("collector_double", "collector")), Veggie = structure(list(), class = c("collector_double", "collector")), Dairy = structure(list(), class = c("collector_double", "collector")), Meat = structure(list(), class = c("collector_double", "collector")), Cooking = structure(list(), class = c("collector_double", "collector")), Sauce = structure(list(), class = c("collector_double", "collector")), Serving = structure(list(), class = c("collector_double", "collector")), Filling = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))

I create an optional small metadata dataframe

pasta |> select(Name, Italian_region) |>
  mutate(Italian_region = factor(Italian_region)) -> mt

I turn the data into a matrix

pasta_mat <- pasta |> 
  select(-Num) |>
  column_to_rownames('Name') |> 
  as.matrix() |>
  t() 

Then I'm ready to explore the data!

showme_PCA2D(pasta_mat, mt = mt, mcol = 'Name', m_label = 'Name', m_fill = 'Italian_region')

PC1 separates by folding and PC2 by origin. You can check this with the loadings.

showme_PCA2D(pasta_mat, mt = mt, mcol = 'Name', m_label = 'Name', m_fill = 'Italian_region', n_loadings = 10)

By specifying x = 2 the loadings of PC2 are plotted instead of PC1.

showme_PCA2D(x = 2, pasta_mat, mt = mt, mcol = 'Name', m_label = 'Name', m_fill = 'Italian_region', n_loadings = 10)

One can explore another component with y = 3 and avoid showing the real aspect ratio better display the names in the plot.

showme_PCA2D(pasta_mat, mt = mt, mcol = 'Name', m_label = 'Name', m_fill = 'Italian_region', y = 3, real_aspect_ratio = F)

And even more!

showme_PCA2D(pasta_mat, mt = mt, mcol = 'Name', m_label = 'Name', m_fill = 'Italian_region', show_variance = T, show_stats = T)

FAQ

Why the underlying function to calculate a PCA is prcomp() and not princomp()?

The princomp() function uses the eigen() function to carry out the analysis on the covariance matrix or correlation matrix. Instead prcomp() uses a technique called singular value decomposition (SVD), which has greater numerical accuracy. However princomp() function is useful when you don’t have access to the original data (mat), but you do have a covariance or correlation matrix.



Ni-Ar/niar documentation built on Feb. 3, 2025, 9:25 a.m.