library(learnr)
library(dplyr)
library(ggplot2)
knitr::opts_chunk$set(echo = FALSE)
## Variables shared by all exercices
set.seed(1)
large_jar <- sample(x = c(1, 3, 9), size = 200, replace = TRUE)

Welcome to this class

Learning goals

The goal of this class is to provide a numerical illustration of random variables and their properties. Unless specified otherwise, all the artworks used here are by the awesome @allison_horst whose work you can find here. We'll review basic concepts and show how you can compute them using R. Specifically, you’ll learn about:

Random variables

Overview

You can think of a random variable as a variable whose values depend on outcomes of a random phenomenon.

In the next section, we'll classify the different types of random variables

Discrete versus continuous

knitr::include_graphics(path = "images/continuous_discrete.png")

In many examples, even if a variables is not strictly continuous, it makes sense to sense to consider it as continuous.

For example, in the previous example about weight, the weight of a human at a given time can only take $\sim$ 5 billions values are there only $\sim$ 5 billions living adults. Yet, it is a lot more convenient to think of it as continuous.

Note also that a variable can be discrete but still take infinitely many values. If you note $X$ the number of throws of a single coin it takes to observe a head, $X$ can take any value in $\mathbb{N}^\star$.

Binary, Ordinal, Nominal

Discrete variables don't need to be numeric, as illustrated below.

knitr::include_graphics(path = "images/nominal_ordinal_binary.png")

Ordinal values are ordered and can often be recoded naturally using a numeric scale:

You can also recode nominal nominal variables on a numeric scale but this is quite arbitrary

Binary values are a special case as they can always be recoded to $0$ and $1$:

In that last case, we don't need to think of 1 as being greater than 0, it's just not the same.

question("What data structure should you use to encode a continuous variable?", 
         answer("logical vector"),
         answer("numeric/double vector", correct = TRUE), 
         answer("character vector"), 
         answer("factor vector"),
         answer("integer vector", message = "Your random variable may take non-integer values"), 
         allow_retry = TRUE
         )
question("What data structure should you use to encode a nominal variable?", 
         answer("logical vector"),
         answer("numeric/double vector"), 
         answer("character vector"), 
         answer("factor vector", correct = TRUE, message = "Indeed, the levels (potential values) are not ordered and the factor is a perfect way to capture that fact."),
         answer("integer vector", message = "You may use a integer vector but the encoding is arbitrary"), 
         allow_retry = TRUE
         )
question("What data structure should you use to encode a ordinal variable?", 
         answer("logical vector"),
         answer("numeric/double vector", correct = TRUE, message = "This is less memory efficient than integers but works nicely."),
         answer("character vector"), 
         answer("factor vector", message = "If you don't order the levels correctly, you lose the fact that your variable is ordinal."),
         answer("integer vector", correct = TRUE, message = "Indeed, that's the best way. Can you think of another one that works."), 
         allow_retry = TRUE
         )
question("What data structure should you use to encode a binary variable?", 
         answer("logical vector", correct = TRUE, message = "Indeed, that's the most effeicient way."),
         answer("numeric/double vector", correct = TRUE, message = "This is less memory efficient than booleans but works."),
         answer("character vector"), 
         answer("factor vector", correct = TRUE, message = "A nice solution if you want to keep informative labels."),
         answer("integer vector", correct = TRUE, message = "This is less memory efficient than booleans but works."), 
         allow_retry = TRUE
         )

Coding random variables

We've seen previously that different random variables can be encoded using different datatypes. In practice, we're only going to use numeric and factor:

Sampling

Overview

Let's move back to our dice example. If the dice is fair, when we throw the dice once, we're equally likely to get $1, 2, \dots, 6$. A simple way to mimick this in R is with the sample() function.

Look at the help of sample() using ?sample() (possibly in another session) and complete the following code to throw a dice in your computer. Repeat multiple times and try to understand what happens.

sample(, size = 1)
sample(x = 1:6, size = 1) ## remember that 1:6 is a shortcut for c(1, 2, 3, 4, 5, 6)

Multiple throws

Imagine now that you want to mimick mutiple throws.

This is called sampling with replacement.

Complete the code to generate a sample of 10 dice throws using sample().

sample()
"Look at the size argument. What value should it take?"
"Look at the replace argument. What value should it take?"
sample(x = 1:6, size = 10, replace = TRUE)

Diamonds in a jar

Let's consider a different example. Imagine a jar filled with diamonds of different sizes. The jar contains:

Using the function rep() (look at the help to see how it works), create a vector jar that encodes the content of the jar. Each element of jar should be a diamond and jar[i] should be the size of the $i$-th diamond.

One solution would be jar <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 8) but we want to do less typing.

jar <- rep()
"Look at the examples and the \"times\" arguments"
jar <- rep(c(1, 4, 8), times = )
jar <- rep(c(1, 4, 8), times = c(10, 5, 1))

Complete the code to draw one diamond from the jar.

jar <- rep()
first_diamond <- sample(x = , size = )
jar <- rep(c(1, 4, 8), times = c(10, 5, 1))
first_diamond <- sample(x = jar, size = 1)

Imagine that you were lucky and picked the big diamond (the one of size 8) during your first draw.

Write the code to draw a second diamond from the jar.

jar <- rep()
second_diamond <- sample(x = , size = )
"Think carefully about the current content of the jar"
jar <- rep(c(1, 4), times = c(10, 5))
second_diamond <- sample(x = , size = )
jar <- rep(c(1, 4), times = c(10, 5))
second_diamond <- sample(x = jar, size = 1)

Diamonds in a jar (II)

The previous example illsutrated sampling without replacement: the first draw modifies the content of the jar and hence, the second draw.

We drew two diamonds (first_diamond and second_diamond) sequentially but you can do it all at once using sample(). Try to do it by completing the following code:

jar <- rep(c(1, 4, 8), times = c(10, 5, 1))
first_and_second_diamonds <- sample(x = jar, )
"Look at the `size` and the `replace` arguments"
jar <- rep(c(1, 4, 8), times = c(10, 5, 1))
first_and_second_diamonds <- sample(x = jar, size = )
jar <- rep(c(1, 4, 8), times = c(10, 5, 1))
first_and_second_diamonds <- sample(x = jar, size = 2, replace = FALSE)

Finite versus infinite populations

The biggest difference between finite populations (diamonds) and infinite populations (dice rolls) is that you can roll a dice infinitely many times but you run out of diamonds at some point.

In particular:

Futhermore, whenever the random variable is associated to a finite population, it is discrete. Indeed if the population has size $N$, the random variables takes at most $N$ different values (and often less than that, as in our diamond example).

question("Note X the size of a diamond picked at random from the previous jar. How many different values can X take?", 
         answer("16", message = "There are indeed 16 diamonds but are there 16 different sizes?"), 
         answer("1"), 
         answer("3", TRUE), 
         answer("8"), 
         allow_retry = TRUE)

Descriptors of random variables

Overview

We're going to illustrate the notions of

on finite populations using our diamonds in a jar example. Those concepts can be extended to infinite populations (or equivalently random variables that can take infinitely many values) but we'll leave that for later.

Hereafter, we note $X$ (the random variable corresponding to) the size of a diamond picked at random from the jar.

Probability distribution

Theory

question("Select all possible values for X:", 
         answer("1", TRUE), 
         answer("4", TRUE), 
         answer("5", FALSE, message = "No diamond has size 5."), 
         answer("8", TRUE), 
         answer("10", FALSE, message = "No diamond has size 10."), 
         allow_retry = TRUE)

The set of all possible values for $X$ (i.e. diamond sizes) is noted $\Omega = {x_1, x_2, x_3}$.

The probability distribution function (or pdf) of $X$ if the function: $$ x \in \Omega \mapsto \mathbb{P}(X = x) $$ which associates to each diamond size the probability of picking a diamond of that size.

quiz(
  question("How much is $\\mathbb{P}(X = 0)$", 
           answer("0", TRUE), 
           answer("5/8"), 
           answer("5/16"), 
           answer("1/16"), 
           answer("1")), 
  question("How much is $\\mathbb{P}(X = 8)$", 
           answer("0"), 
           answer("1/8"), 
           answer("1/16", TRUE), 
           answer("1"))
)
question("Select all possible values for $\\mathbb{P}(X = x)$:", 
         answer("0", TRUE),
         answer("1/16", TRUE),
         answer("1/8"),
         answer("1/4"),
         answer("5/16", TRUE),
         answer("5/8", TRUE),
         answer("1"), 
         allow_retry = TRUE)

There is no straighforward way to build the pdf of $X$ (there is for standard random variables). You can however build a related quantity: the frequency of different diamond sizes in the full populations using table()

jar <- rep(c(1, 4, 8), times = c(10, 5, 1))
table(jar)

We consider a larger jar, whose content is stored in large_jar and note $Y$ the size of a diamond picked at random from the large jar.

large_jar
table(large_jar)

Use table() in the previous chunck to answer the following questions:

quiz(caption = "Large diamond jar", 
  question("How many diamonds are stored in the large jar?", 
           answer("16"), 
           answer("66"),
           answer("200", TRUE)),
  question("Select all diamond sizes in the large jar", 
           answer("1", TRUE), 
           answer("3", TRUE), 
           answer("4"), 
           answer("8"), 
           answer("9", TRUE), 
           allow_retry = TRUE),
  question("How much is $\\mathbb{P}(Y = 1)$ ?", 
           answer("5/8"), 
           answer("1/66"),
           answer("33/100", TRUE)) 
)

Expectation

The expectation of a random variable $X$, noted $\mathbb{E}(X)$, is its average value. Assume that $X$ is associated to a population of size $N$ and that individual $i$ has value $x_i$, so that

You can compute it using a formula based either on the full set of values $(x_i){i = 1\dots N}$ : $$ \mathbb{E}(X) = \frac{1}{N}\sum{i = 1}^N x_i $$ or the probability distribution: $$ \mathbb{E}(X) = \sum_{x \in \Omega} x \mathbb{P}(X = x) $$

In our (small) jar example, the first formula gives: $$ \mathbb{E}(X) = \frac{1}{16} \left( \underbrace{1 + \dots + 1}{\times 10} + \underbrace{4 + \dots + 4}{\times 5} + \underbrace{8}_{\times 1} \right) = 2.375 $$

You can of course simplify it to the second formula $$ \mathbb{E}(X) = 1 \times \underbrace{\frac{10}{16}}{\mathbb{P}(X = 1)} + 4 \times \underbrace{\frac{5}{16}}{\mathbb{P}(X = 4)} + 8 \times \underbrace{\frac{1}{16}}_{\mathbb{P}(X = 8)} = 2.375 $$ and obtain the same result.

Graphically we have:

ggplot(data.frame(id = 1:length(jar), size = jar), 
       aes(x = size)) + 
  geom_bar() + 
  scale_y_continuous(breaks = seq(0, 10, 2)) + 
  geom_vline(xintercept = mean(jar), color = "red") +
  annotate("text", x = mean(jar), y = 9, color = "red", 
           hjust = -0.1, vjust = 1, label = "E(X): Average\ndiamond size ") + 
  theme_bw() + 
  labs(x = "Size", y = "Number of diamonds")

Look at the functions sum() and length() and compute $\mathbb{E}(Y)$ using the first formula

sum_of_y <- 
length_of_y <- 
expectation_of_y <- 
expectation_of_y  
sum_of_y <- sum(y)
length_of_y <- 
expectation_of_y <- 
expectation_of_y  
sum_of_y <- sum(y)
length_of_y <- length(y)
expectation_of_y <- 
expectation_of_y  
sum_of_y <- sum(y)
length_of_y <- length(y)
expectation_of_y <- sum_of_y / length_of_y
expectation_of_y  

Now look at mean() and suggest an alternative way of computing $\mathbb{E}(Y)$.

expectation_of_y <- 
expectation_of_y
expectation_of_y <- mean(large_jar)
expectation_of_y

Computing $\mathbb{E}(Y)$ from the second formula is slightly more involved. The following code allows you to do it but is rarely used in practice.

## Find values of Y 
values_of_y <- as.numeric(names(table(large_jar)))
values_of_y
## Compute pdf of Y
pdf_of_y <- table(large_jar) / length(large_jar)
pdf_of_y
expectation_of_y <- sum(values_of_y * pdf_of_y)

Variance and Standard deviation

The variance of $X$, noted $\mathbb{V}(X)$, is a measure of dispersion around the mean defined as the quadratic deviation around the mean.

Formally: $$ \mathbb{V}(X) = \mathbb{E}( \left[ X - \mathbb{E}(X) \right]^2) = \frac{1}{N} \sum_{i = 1}^N (x_i - \mathbb{E}(X))^2 $$ A famous equality shows that the variance can be computed using the much simpler formula $$ \mathbb{V}(X) = \mathbb{E}(X^2) - \mathbb{E}(X)^2 = \frac{1}{N} \sum_{i=1}^N x_i^2 - \left(\frac{1}{N} \sum_{i=1}^N x_i \right)^2 $$ Caution Be very careful with the position of the exponents $^2$ and the parenthesis.

In our example:

plot_data <- data.frame(size = jar) %>% 
  count(size, name = "count") %>% 
  mutate(prob              = count / sum(count),
         mean              = mean(jar), 
         deviation         = size - mean, 
         squared_deviation = deviation^2, 
         squared_size      = size^2)

And we can compute the variance easily as: $$ \begin{align} \mathbb{E}(X^2) & = 1^2 \times \frac{10}{16} + 4^2 \times \frac{5}{16} + 8^2 \times \frac{1}{16} = 9.625 \ \mathbb{E}(X)^2 & = \left( 1 \times \frac{10}{16} + 4 \times \frac{5}{16} + 8 \times \frac{1}{16} \right)^2 = 5.640625 \ \mathbb{V}(X) & = \mathbb{E}(X^2) - \mathbb{E}(X)^2 = 3.984375 \end{align} $$

ggplot(plot_data, aes(x = squared_deviation, y = count)) + 
  geom_bar(stat = "identity") + 
  scale_y_continuous(breaks = seq(0, 10, 2)) + 
  geom_vline(xintercept = var(jar), color = "red") +
  annotate("text", x = var(jar), y = 9, color = "red", 
           hjust = -0.1, vjust = 1, label = "V(X): Average\nquadratic deviation ") + 
  theme_bw() + 
  labs(x = "Quadratic deviation to the mean", y = "Number of diamonds")

Using mean() and the power function ^, compute the variance of $Y$.

var_of_y <- 
var_of_y
var_of_y <- mean(large_jar^2) - mean(large_jar)^2
var_of_y

R provides the built-in function var() to compute the variance. Compare your result with one provided by var()

manu_var <- ## using the manual computation
auto_var <- ## using var
auto_var - manu_var
manu_var <- mean(large_jar^2) - mean(large_jar)^2
auto_var <- var(large_jar)
auto_var - manu_var

It seems that there is a small but non null difference between the two quantities. That's normal. For reasons you'll understand later, the variance $\mathbb{V}_R(X)$ computed by var() (and by the same function in most computer languages) is slightly different from the one defined above. The two are related by the relation

$$ \mathbb{V}R(X) = \frac{N}{N-1} \mathbb{V}(X) = \frac{1}{N-1} \sum{i=1}^N \left( x_i - \mathbb{E}(X) \right)^2 $$

As you can verify here

N <- length(large_jar)
manu_var <- 
auto_var <- 
auto_var - N * manu_var / (N - 1)

Cumulative distribution function

The cumulative distribution function (or cdf) of a numeric random variable $X$ is defined by: $$ F_X: x \in \mathbb{R} \mapsto \mathbb{P}(X \leq x) $$ Or in plain English, $\text{cdf}(x)$ is the probability that $X$ is lower than or equal to $x$.

This function makes a lot more sense for continuous variables but is still defined for discrete variables on finite populations. Indeed, we can rewrite $\mathbb{P}(X \leq x)$ as $$ F_X(x) = \mathbb{P}(X \leq x) = \frac{1}{N} \sum_{i=1}^N 1_{x_i \leq x} $$ where $1_{x_i \leq x}$ equals $1$ if $x_i \leq x$ and $0$ otherwise.

quiz(caption = "Cumulative distribution function", 
     question("How much is $F_X(4)$?", 
              answer("0"), 
              answer("5/16"),
              answer("5/8"),
              answer("15/16", TRUE),
              answer("1"), 
              allow_retry = TRUE
         ), 
          question("How much is $F_X(0.99)$?", 
              answer("0", TRUE), 
              answer("5/16"),
              answer("5/8"),
              answer("15/16"),
              answer("1"), 
              allow_retry = TRUE
         )
)

Using the tools you learned last lesson, compute $F_Y(3)$.


"Select only values from large_jar that are smaller 3 and count how many there are."
## Maybe start with 
large_jar[large_jar <= 3]
## Maybe start with 
length(large_jar[large_jar <= 3])
"You just need to divide the number of values smaller than 3 by the total number of values"
length(large_jar[large_jar <= 3]) / length(large_jar)

The cumulative distribution of $X$ is a bit weird:

FX <- ecdf(jar)
curve(FX, from = -1, to = 10, n = 10000, xlab = "Size", ylab = expression(F[X]))

It is flat almost everywhere, starts at $0$, jumps from $0$ to $5/8$ at $1$, then again from $5/8$ to $15/16$ at $4$ and finally to $1$ at $x = 8$.

That's essentially because

Quantiles

The quantiles of a random variable $X$ are computed by inverting the cumulative distribution function $F_X$. The quantile $q_\alpha$ of order $\alpha$ satisfies: $$ F_X(q_\alpha^-) \leq \alpha \text{ and } F_X(q_\alpha) \geq \alpha $$ or in terms of probabilities $$ \mathbb{P}(X < q_\alpha) \leq \alpha \text{ and } \mathbb{P}(X \leq q_\alpha) \geq \alpha $$ That definition can unfortunately not be simplified in general. However, for the very important special case of continuous random variables $X$, the quantile $q_\alpha$ of order $\alpha$ (with $\alpha \in [0, 1]$) is defined by the much simpler condition $$ F_X(q_\alpha) = \alpha $$ and quantiles are really found by inverting the cdf. For continuous random variables, $q_\alpha$ has the following simple interpretation: a fraction $\alpha$ of the values of $X$ are lower than or equal to $q_\alpha$.

Let's have a look at the graphical definition of the quantile $q_0.7$ of $X$:

FX <- ecdf(jar)
curve(FX, from = -1, to = 10, n = 10000, xlab = "Size", ylab = expression(F[X]))
alpha <- 0.7
qal <- quantile(jar, alpha)
segments(-1, alpha, qal, alpha, col = "red", lty = 2)
segments(qal, alpha, qal, 0, col = "red", lty = 2)
points(qal, 0, pch = 19, col = "red")

The graph suggests that $4$ is a quantile or order 70\% (or 0.7) of $X$. We can check indeed that: $$ \begin{align} F_X(4^-) & = \frac{5}{8} \leq 0.7 \ F_X(4) & = \frac{15}{16} \geq 0.7 \end{align} $$ And therefore that $4$ is the quantile or order 70\%. However, because $X$ is discrete we can't say 70% of the diamond sizes are smaller than 4. In fact, if you look carefully at the graph (and the definition), you can see that $4$ is the quantile of order 70\% but also of order 90\% and in fact or any order between $\frac{5}{8}$ (= 62.5\%) and $\frac{15}{16}$ (= 93.75\%)

question("What is the quantile of order 50% of the sequence $\\{0, \\dots, 100, \\}$?", 
         answer("1"),
         answer("50", correct = TRUE), 
         answer("51"), 
         answer("100"), 
         allow_retry = TRUE)

You can compute several quantiles of a random variable using the quantile() function.

quantile(x = jar, probs = c(0.7, 0.9))

Compute the quantile of order 50\% (also called the median) of $Y$:


"You should use the quantile function"
quantile(x = , probs = )
quantile(x = large_jar, probs = 0.5)


mahendra-mariadassou/jc2bimStatsTutorials documentation built on June 11, 2024, 7:52 p.m.