knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE ) library(ggplot2)
We will need the packages corrplot
, FactoMineR
and factoextra
:
corrplot
, FactoMineR
and factoextra
are installedlibrary(corrplot) library(FactoMineR) library(factoextra)
We also need to load dplyr
, tidyr
and the "fruits" data, :
library(dplyr) library(tidyr) data("fruits", package = "ReMUSE")
Principal Component Analysis, whose goal is to extract the most "important" sources of variability in the quantitative data
Hierachical Agglomerative Clustering, whose goal is to define clusters of samples
Covariance and correlation :
Principal Component Analysis:
FactoMineR
and visualization with factoextra
{width="100%"}
{width="100%"}
library(ggplot2) ggplot(fruits, aes(Phosphore, Potassium)) + geom_point() + theme_bw()
\newcolumn
ggplot(fruits, aes(Phosphore, Potassium)) + geom_point() + theme_bw() + scale_y_log10() + scale_x_log10()
Make a graph as similar as the one the right!
Try to focus on the essential.
ggplot(fruits, aes(Phosphore, Potassium)) + geom_point(color = "white", shape = 2) + theme_dark() + scale_y_log10() + scale_x_log10() + theme(panel.grid = element_line(color = "lightgrey"))
ggplot(fruits, aes(Phosphore, Potassium)) + geom_point(alpha = 0.5) + annotate("point", x = mean(fruits$Phosphore), y = mean(fruits$Potassium), shape = 17, size = 5, col = 2) + annotate("label", x = mean(fruits$Phosphore) + 10, y = mean(fruits$Potassium) + 50, label = "Barycenter", size = 5, col = 2) + annotate("text", x = mean(fruits$Phosphore) + 6, y = 50, label = "Mean of x", col = 2) + annotate("text", x = 6.5, y = mean(fruits$Potassium) + 50, label = "Mean of y", col = 2) + annotate("segment", x = mean(fruits$Phosphore), xend = 0, y = mean(fruits$Potassium), yend = mean(fruits$Potassium), col = 2 , linetype = "dashed") + annotate("segment", x = mean(fruits$Phosphore), xend = mean(fruits$Phosphore), y = mean(fruits$Potassium), yend = 0, col = 2 , linetype = "dashed") + scale_x_log10() + scale_y_log10() + theme_bw()
How far away is a "dot" from the barycenter ? Individual rectangle area :
[ \left(x_i - \bar x\right) \times \left(y_i - \bar y\right) ]
The covariance is (almost) the mean area:
[ \operatorname{cov} (x, y) = \frac{1}{n-1} \sum_{i=1}^n\left(x_i - \bar x\right) \times \left(y_i - \bar y\right) ]
Covariance can vary between $-\infty$ and $+\infty$.
Correlation is, by definition, a measure of linear relationship between -1 and +1:
[ \operatorname{cor}(x, y) = \frac{\displaystyle\sum_{i = 1}^n (x_i - \bar x)(y_i - \bar y)}{\displaystyle\sqrt{\sum_{i = 1}^n (x_i - \bar x)^2}\sqrt{\sum_{i = 1}^n (y_i - \bar y)^2}} ]
... in short...
[ \operatorname{cor}(x, y) = \frac{\operatorname{cov}(x, y)}{\operatorname{sd}(x) \operatorname{sd}(y) } ]
Covariance between two variables with R:
cov(fruits$Potassium, fruits$Phosphore)
Correlation between two variables with R:
cor(fruits$Potassium, fruits$Phosphore)
Compute the correlation between fruits$Potassium
and fruits$Phosphore
.
Use only the following functions/operations:
length
to compute $n$,*
to multiply, /
to divide, -
to substract,mean
to compute the mean,sd
to compute the standard-deviation,sum
for the "$\sum_{i=1}^n$"Often noted $\rho$. Same as Pearson's, but on the ranks! Let:
[ \rho(x, y) = \operatorname{cor}(r_x, r_y) ]
Key properties:
With the argument method
set to "spearman
:
cor(fruits$Potassium, fruits$Phosphore, method = "spearman")
Same, but with the rank
function:
cor(rank(fruits$Potassium), rank(fruits$Phosphore))
Pairs of points: select two samples $i$ and $j$.
[ \tau(x, y) = \displaystyle \frac{n_C - n_D}{n_0}, ] where $n_C$ is the number of concordant pairs, $n_D$ the number of discordant pairs and $n_0$ total number of pairs.
Pearson: nice linear relationship.
cor(fruits$Potassium, fruits$Phosphore, method = "pearson")
Spearman: relationship is non-linear but monotonous.
cor(fruits$Potassium, fruits$Phosphore, method = "spearman")
Kendall: ex-aequos.
cor(fruits$Potassium, fruits$Phosphore, method = "kendall")
To compute all correlations:
cormat <- cor(fruits[, -(1:2)])
To make a "correlogram":
corrplot(cormat)
corrplot(cormat)
Make a correlogram on the fruit data and change the color of the labels.
The real data:
ggplot(fruits, aes(Phosphore, Potassium)) + geom_point() + theme_bw() + scale_y_log10() + scale_x_log10()
To make my life easier
xy <- data.frame( scale( cbind( x = log10(fruits$Phosphore), y = log10(fruits$Potassium)))) ggplot(xy, aes(x)) + geom_point(aes(y = y)) + theme_bw() + coord_equal()
ggplot(fruits, aes(Phosphore, Potassium)) + geom_point() + theme_bw() + scale_y_log10() + scale_x_log10()
xy <- data.frame( scale( cbind( x = log10(fruits$Phosphore), y = log10(fruits$Potassium)))) ggplot(xy, aes(x)) + geom_point(aes(y = y)) + theme_bw() + coord_equal()
res.lm <- lm(y ~ x, data = xy) ab <- coef(res.lm) xy <- data.frame( xy, yhat = predict(res.lm)) ggplot(xy, aes(x)) + geom_point(aes(y = y)) + geom_abline(intercept = ab[1], slope = ab[2], color = 3) + geom_point(aes(y = yhat), color = 3, shape = 4) + geom_segment(aes(y = y, xend = x, yend = yhat), color = 3) + theme_bw() + coord_equal()
res.pc <- prcomp(xy[, 1:2], scale. = FALSE) pcslope <- res.pc$rotation[2, 1] / res.pc$rotation[1, 1] pcintercept <- 0 ROT1 <- res.pc$rotation[, 1] %*% t(res.pc$rotation[, 1]) pchatx <- (as.matrix(xy[, 1:2]) %*% ROT1)[, 1] pchaty <- (as.matrix(xy[, 1:2]) %*% ROT1)[, 2] xy <- data.frame( xy, pchatx = pchatx, pchaty = pchaty) dat.ell <- data.frame(ellipse::ellipse(cov(xy[, 1:2]))) ggplot(xy) + geom_point(aes(x = x, y = y)) + geom_abline(intercept = pcintercept, slope = pcslope, color = 2) + geom_point(aes(x = pchatx, y = pchaty), color = 2, shape = 3) + geom_segment(aes(x = x, y = y, xend = pchatx, yend = pchaty), color = 2) + geom_path(mapping = aes(x, y), data = dat.ell, color = 2, alpha = 0.5) + theme_bw() + coord_equal()
ggplot(xy) + geom_point(aes(x = x, y = y)) + geom_abline(intercept = pcintercept, slope = pcslope, color = 2) + geom_abline(intercept = ab[1], slope = ab[2], color = 3) + theme_bw() + coord_equal()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.