knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE
)
library(ggplot2)

First things first

We will need the packages corrplot, FactoMineR and factoextra:

library(corrplot)
library(FactoMineR)
library(factoextra)

We also need to load dplyr, tidyr and the "fruits" data, :

library(dplyr)
library(tidyr)
data("fruits", package = "ReMUSE")

Unsupervised/Exploratory vs Supervised Analysis

The two exploratory methods we are interested in...

Program for this morning

Covariance and correlation :

Principal Component Analysis:

Covariance and correlation

A little bit of history {.columns-2}

{width="100%"}

{width="100%"}

On fruits {.columns-2}

library(ggplot2)
ggplot(fruits, aes(Phosphore, Potassium)) + 
  geom_point() + theme_bw()

\newcolumn

ggplot(fruits, aes(Phosphore, Potassium)) + 
  geom_point() + theme_bw() + scale_y_log10() + scale_x_log10()

Exercise {.columns-2}

Make a graph as similar as the one the right!

Try to focus on the essential.

ggplot(fruits, aes(Phosphore, Potassium)) + 
  geom_point(color = "white", shape  = 2) + theme_dark() + scale_y_log10() + scale_x_log10() + theme(panel.grid = element_line(color = "lightgrey"))

Barycenter (reminder)

ggplot(fruits, aes(Phosphore, Potassium)) + 
  geom_point(alpha = 0.5) + 
  annotate("point", 
           x = mean(fruits$Phosphore), 
           y = mean(fruits$Potassium),
           shape = 17, size = 5, col = 2) +
  annotate("label", 
         x = mean(fruits$Phosphore) + 10, 
         y = mean(fruits$Potassium) + 50,
         label = "Barycenter",
         size = 5, col = 2) +
  annotate("text", 
       x = mean(fruits$Phosphore) + 6, 
       y = 50,
       label = "Mean of x",
       col = 2) +
  annotate("text", 
     x = 6.5, 
     y = mean(fruits$Potassium) + 50,
     label = "Mean of y",
     col = 2) +
  annotate("segment", 
       x = mean(fruits$Phosphore), 
       xend = 0,
       y = mean(fruits$Potassium),
       yend = mean(fruits$Potassium),
       col = 2 , linetype = "dashed") +
  annotate("segment", 
     x = mean(fruits$Phosphore), 
     xend = mean(fruits$Phosphore),
     y = mean(fruits$Potassium),
     yend = 0,
     col = 2 , linetype = "dashed") +
  scale_x_log10() + scale_y_log10() +
  theme_bw()

Covariance (reminder)

How far away is a "dot" from the barycenter ? Individual rectangle area :

[ \left(x_i - \bar x\right) \times \left(y_i - \bar y\right) ]

The covariance is (almost) the mean area:

[ \operatorname{cov} (x, y) = \frac{1}{n-1} \sum_{i=1}^n\left(x_i - \bar x\right) \times \left(y_i - \bar y\right) ]

Correlation coefficient

Covariance can vary between $-\infty$ and $+\infty$.

Correlation is, by definition, a measure of linear relationship between -1 and +1:

Pearson's correlation coefficient

[ \operatorname{cor}(x, y) = \frac{\displaystyle\sum_{i = 1}^n (x_i - \bar x)(y_i - \bar y)}{\displaystyle\sqrt{\sum_{i = 1}^n (x_i - \bar x)^2}\sqrt{\sum_{i = 1}^n (y_i - \bar y)^2}} ]

... in short...

[ \operatorname{cor}(x, y) = \frac{\operatorname{cov}(x, y)}{\operatorname{sd}(x) \operatorname{sd}(y) } ]

"Exercises" {.columns-2}

Covariance between two variables with R:

cov(fruits$Potassium, 
    fruits$Phosphore)

Correlation between two variables with R:

cor(fruits$Potassium, 
    fruits$Phosphore)

Exercise / challenge !

Compute the correlation between fruits$Potassium and fruits$Phosphore.

Constraints!

Use only the following functions/operations:

Spearman's correlation coefficient

Often noted $\rho$. Same as Pearson's, but on the ranks! Let:

[ \rho(x, y) = \operatorname{cor}(r_x, r_y) ]

Key properties:

"Exercises" {.columns-2}

With the argument method set to "spearman:

cor(fruits$Potassium, 
    fruits$Phosphore, 
    method = "spearman")

Same, but with the rank function:

cor(rank(fruits$Potassium),
    rank(fruits$Phosphore))

Kendall's correlation coefficient

Pairs of points: select two samples $i$ and $j$.

[ \tau(x, y) = \displaystyle \frac{n_C - n_D}{n_0}, ] where $n_C$ is the number of concordant pairs, $n_D$ the number of discordant pairs and $n_0$ total number of pairs.

"Exercise": compare and comment {.columns-2}

Pearson: nice linear relationship.

cor(fruits$Potassium, 
    fruits$Phosphore, 
    method = "pearson")

Spearman: relationship is non-linear but monotonous.

cor(fruits$Potassium, 
    fruits$Phosphore, 
    method = "spearman")

Kendall: ex-aequos.

cor(fruits$Potassium, 
    fruits$Phosphore, 
    method = "kendall")

Beware of naked numbers!

Datasaurus

Plot the correlation {.columns-2}

To compute all correlations:

cormat <- cor(fruits[, -(1:2)])

To make a "correlogram":

corrplot(cormat)

corrplot(cormat)

Exercise

Make a correlogram on the fruit data and change the color of the labels.

PCA on a two-dimensional data-set

Recall the first graph {.columns-2}

The real data:

ggplot(fruits, aes(Phosphore, Potassium)) + 
  geom_point() + theme_bw() + scale_y_log10() + scale_x_log10()

To make my life easier

xy <- data.frame(
  scale(
    cbind(
      x = log10(fruits$Phosphore),
      y = log10(fruits$Potassium))))

ggplot(xy, aes(x)) + 
  geom_point(aes(y = y)) + 
  theme_bw() + coord_equal()

Exercise / discussion: What did I do? {.columns-2}

ggplot(fruits, aes(Phosphore, Potassium)) + 
  geom_point() + theme_bw() + scale_y_log10() + scale_x_log10()

xy <- data.frame(
  scale(
    cbind(
      x = log10(fruits$Phosphore),
      y = log10(fruits$Potassium))))

ggplot(xy, aes(x)) + 
  geom_point(aes(y = y)) + 
  theme_bw() + coord_equal()

Linear Regression, aka Linear Model {.columns-2}

res.lm <- lm(y ~ x, data = xy)
ab <- coef(res.lm)

xy <- data.frame(
  xy, 
  yhat = predict(res.lm))

ggplot(xy, aes(x)) + 
  geom_point(aes(y = y)) + 
  geom_abline(intercept = ab[1], slope = ab[2], color = 3) + 
  geom_point(aes(y = yhat), color = 3, shape = 4) + 
  geom_segment(aes(y = y, xend = x, yend = yhat), color = 3) + 
  theme_bw() + coord_equal()

First principal component {.columns-2}

res.pc <- prcomp(xy[, 1:2], scale. = FALSE)
pcslope <- res.pc$rotation[2, 1] / res.pc$rotation[1, 1]
pcintercept <- 0
ROT1 <- res.pc$rotation[, 1] %*% t(res.pc$rotation[, 1])
pchatx <- (as.matrix(xy[, 1:2]) %*% ROT1)[, 1]
pchaty <- (as.matrix(xy[, 1:2]) %*% ROT1)[, 2]

xy <- data.frame(
  xy, 
  pchatx = pchatx,
  pchaty = pchaty)

dat.ell <- data.frame(ellipse::ellipse(cov(xy[, 1:2])))  

ggplot(xy) + 
  geom_point(aes(x = x, y = y)) + 
  geom_abline(intercept = pcintercept, slope = pcslope, color = 2) + 
  geom_point(aes(x = pchatx, y = pchaty), color = 2, shape = 3) + 
  geom_segment(aes(x = x, y = y, xend = pchatx, yend = pchaty), color = 2) +
  geom_path(mapping = aes(x, y), data = dat.ell, color = 2, alpha = 0.5) +
  theme_bw() + coord_equal()

Both on the same plot

ggplot(xy) + 
  geom_point(aes(x = x, y = y)) + 
  geom_abline(intercept = pcintercept, slope = pcslope, color = 2) + 
  geom_abline(intercept = ab[1], slope = ab[2], color = 3) + 
  theme_bw() + coord_equal()

Switching to another set of slides...



vguillemot/ReMUSE documentation built on Dec. 23, 2021, 3:09 p.m.