library(learnr) library(knitr) library(moxier) library(printr) knitr::opts_chunk$set(echo = TRUE, eval = TRUE, exercise = FALSE)
First of all, we load the required libraries to proceed.
library(MASS) library(plotrix) library(lattice) library(moments) library(tidyverse)
Our goal is to analyse the data contained within the magnesium.txt
file.
The file contains the measurements of the quantity of magnesium [mmol/l] in the blood of 140 healthy subjects.
Let us load the data.
magnesio <- moxier::magnesio attach(magnesio)
Let us double-check the length of the dataset.
(n <- length(magnesio$Magnesio))
We compute some location and shape indices. Namely, the minimum,
min(Magnesio)
the maximum,
max(Magnesio)
the mean,
mean(Magnesio)
the standard deviation,
sqrt(var(Magnesio)) sd(Magnesio)
the range,
range(Magnesio)
the quartiles and interquartile range,
quantile(Magnesio, 0.25) quantile(Magnesio, 0.5) quantile(Magnesio, 0.75) quantile(Magnesio) # if no specification are provided, all the quartiles are computed IQR(Magnesio)
(moments can also be computed like this:)
(mom <- all.moments(magnesio, central = TRUE, order.max = 4))
skewness (notice that the skewnes of symmetric data is $0$),
skewness(magnesio) (sum((Magnesio - mean(Magnesio))^3) / n) / (sum((Magnesio - mean(Magnesio))^2) / n)^(3 / 2) mom[3 + 1] / mom[2 + 1]^(3 / 2)
kurtosis (and notice that the kurtosis of the normal distribution is $3$.
kurtosis(magnesio) (sum((Magnesio - mean(Magnesio))^4) / n) / (sum((Magnesio - mean(Magnesio))^2) / n)^2 mom[4 + 1] / mom[2 + 1]^2
We detach the data.
detach(magnesio)
Note what happens to the skeweness and Kurtosis indices when distributions are symmetrical or not. Since skewness is a measure of symmetry (more precisely, of the lack of symmetry), data sets with high kurtosis tend to have heavy tails, or outliers. Since Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution, data sets with low kurtosis tend to have light tails, or lack of outliers.
We now check this against a Normal distrubtion (which is symmetric)
dat <- rnorm(100000, mean = 0, sd = 1) summary(dat)
The skewness is almost $0$, like in all symmetric distributions.
skewness(dat)
Kurtosis is close to $3$: light tails and no tendency to outlier with respect to the Normal distribution.
kurtosis(dat)
We make the same analysis simulating an exponential distribution. We set the rate $\lambda = 2$. The theoretical mean is equal to $\frac{1}{\lambda}$.
dat <- rexp(100000, rate = 2) # the theoretical mean is equal to 1/rate summary(dat)
We make an histogram.
hist(dat)
The positive skewness means that the right tail is long with respect to the left tail.
skewness(dat)
The high kurtosis implies a heavy tail and a tendency to outliers with respect to the Normal distribution.
kurtosis(dat)
We now consider the cavendish.txt
data file, from Stigler. It consists of $29$ measurements of earth density, carried out by Henry Cavendish in 1798.
The goal is to analyse Cavendish data and detect any outlier.
caven <- moxier::caven attach(caven)
We compute some location and shape indices. Namely, the minimum,
min(Density)
the maximum,
max(Density)
the mean,
mean(Density)
the standard deviation,
sqrt(var(Density)) sd(Density)
the range,
range(Density)
the quartiles and interquartile range,
(q1 <- quantile(Density, 0.25)) quantile(Density, 0.5) (q3 <- quantile(Density, 0.75)) quantile(Density) (iqr <- IQR(Density))
(moments can also be computed like this:)
(mom <- all.moments(Density, central = TRUE, order.max = 4))
skewness (notice that the skewnes of symmetric data is $0$),
skewness(Density) (sum((Density - mean(Density))^3) / n) / (sum((Density - mean(Density))^2) / n)^(3 / 2) mom[3 + 1] / mom[2 + 1]^(3 / 2)
kurtosis (and notice that the kurtosis of the normal distribution is $3$.
kurtosis(Density) (sum((Density - mean(Density))^4) / n) / (sum((Density - mean(Density))^2) / n)^2 mom[4 + 1] / mom[2 + 1]^2
We make some graphical representations of the continuous variable Density
.
histogram <- hist(Density, plot = TRUE, breaks = seq(4, 6, .25))
boxplot(Density, horizontal = FALSE, main = "Boxplot of Density", ylab = "earth density")
Notice the plot below the lower whisker.
We try to identify potential outliers.
Density[which(Density < q1 - 1.5 * iqr)] Density[which(Density > q3 + 1.5 * iqr)]
We find the point below the whisker! We remove the outlier.
new_density <- Density[-which(Density < q1 - 1.5 * iqr)] new_density
Check the lengths!
length(Density) length(new_density)
We compare location and shape indices before and after the removal of the outlier. Robust indices do not change a lot.
knitr::kable(rbind(quantile(Density), quantile(new_density)))
The mean and (in particular) the standard deviation are more heavily affected.
knitr::kable(rbind(cbind(mean(Density), sd(Density)), cbind(mean(new_density), sd(new_density))), col.names = c("Mean", "Standard Deviation"))
We also make a boxplot of the data without the outlier.
boxplot(new_density, horizontal = FALSE, main = "Boxplot Density", ylab = "earth density")
We are finally ready to detach the data set.
detach(caven)
Repeat the previous analysis on the dataset serum.txt
:
Data are taken from [Bland]. They contain the triglyceride concentration in blood serum [mmol/l] of the umbilical cord in 282 subjects.
serum <- moxier::serum
We now turn our attention to a third dataset, as found in Mackowiak and Wasserman. The dataset comprises 130 tri-variate observations:
The goals of the analysis are to:
Temperatura
through suitable indices.temp <- moxier::temp attach(temp) dim(temp) dimnames(temp) temp n <- length(Temperatura)
The mean is:
mean(Temperatura)
The variance is:
var(Temperatura)
The population variance is obtained donig $$\frac{n - 1}{n} \mathrm{Var}\left(\mathrm{Temperatura}\right).$$
The standard deviation is:
sd(Temperatura)
The population standard deviation is obtained donig $$\frac{n - 1}{n} \mathrm{sd}\left(\mathrm{Temperatura}\right).$$
The minimum and the maximum are:
min(Temperatura) max(Temperatura)
The range:
range(Temperatura) diff(range(Temperatura))
The quantiles:
quantile(Temperatura) Q1 <- quantile(Temperatura, 0.25) Q3 <- quantile(Temperatura, 0.75) IQR(Temperatura)
We make a histogram
hist(Temperatura, prob = TRUE, main = "Histogram", xlab = "body temperature [°F]", ylab = "density")
and a boxplot.
boxplot(Temperatura, horizontal = FALSE, main = "Boxplot", ylab = "body temperature [°F]", ylim = c(96, 101))
To aid us with the analysis, we use the tapply()
function.
It takes three main arguments:
We compute the mean.
tapply(Temperatura, Sesso, mean)
Exercise What do the following mean:
tapply(Temperatura, Sesso, var)
tapply(Temperatura, Sesso, sd)
tapply(Temperatura, Sesso, min)
tapply(Temperatura, Sesso, max)
We now compute the range.
tapply(Temperatura, Sesso, range)
To compute the range within the categories, we resort to the diff()
function.
diff(tapply(Temperatura, Sesso, range)$D) diff(tapply(Temperatura, Sesso, range)$U)
Let us investigate the quantiles. We print the IQR of women and men, respectively, noting that men's IQE is wider than women's IQR.
(Q <- tapply(Temperatura, Sesso, quantile)) Q1 <- c(Q$D[2], Q$U[2]) Q3 <- c(Q$D[4], Q$U[4]) Q3 - Q1
Exercise
Compute the $90^{\mathrm{th}}$ quantile for the variable Temperatura
using the tapply()
command.
Q_90 <- tapply(Temperatura, Sesso, quantile, probs = 0.9) Q_90
We now make some histrograms.
par(mfrow = c(1, 2)) # to make partition of the graph window hist(Temperatura[Sesso == "D"], prob = TRUE, main = "Histogram Women", xlab = "body temperature [°F]", ylab = "Density", col = "pink", xlim = range(Temperatura), breaks = seq(96, 101, .25), ylim = c(0, 1) ) hist(Temperatura[Sesso == "U"], prob = TRUE, main = "Histogram Men", xlab = "body temperature [°F]", ylab = "Density", col = "lightblue", xlim = range(Temperatura), breaks = seq(96, 101, .25), ylim = c(0, 1) )
par(mfrow = c(1, 2)) hist(Temperatura[Sesso == "D"], prob = TRUE, main = "Histogram Men Women", xlab = "body temperature [°F]", ylab = "Density", col = "pink", xlim = range(Temperatura), breaks = seq(96, 101, .25), ylim = c(0, 1) ) hist(Temperatura[Sesso == "U"], prob = TRUE, xlab = "body temperature [°F]", ylab = "Densit?", col = "lightblue", xlim = range(Temperatura), breaks = seq(96, 101, .25), ylim = c(0, 1), add = TRUE ) hist(Temperatura[Sesso == "U"], prob = TRUE, main = "Histogram Women Men", xlab = "body temperature [°F]", ylab = "Density", col = "lightblue", xlim = range(Temperatura), breaks = seq(96, 101, .25), ylim = c(0, 1) ) hist(Temperatura[Sesso == "D"], prob = TRUE, xlab = "body temperature [°F]", ylab = "Densit?", col = "pink", xlim = range(Temperatura), breaks = seq(96, 101, .25), ylim = c(0, 1), add = TRUE )
temp %>% tibble() %>% ggplot() + geom_density(mapping = aes(Temperatura, fill = Sesso), alpha = 0.2)
We now turn our attention to boxplots.
boxplot(Temperatura ~ Sesso, data = temp, horizontal = FALSE, main = "Boxplot", names = c("Donne", "Uomini"), col = c("pink", "lightblue"), ylab = "body temperature [°F]", ylim = c(94, 102) )
temp %>% tibble() %>% ggplot() + geom_boxplot(mapping = aes(x = Sesso, y = Temperatura, fill = Sesso)) + scale_x_discrete(name = "Gender") + scale_y_continuous(name = "Temperature [°F]")
To conclude, we detach the data.
detach(temp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.