In emeyers/SDS100: Tools for the class Introductory Statistics

knitr::opts_chunk$set(echo = TRUE)

SDS100::download_data("gapminder_2007.Rda")

$\$

Load and visualize the data

library(SDS100)


# Get the gapminder data

load("gapminder_2007.Rda")

lifeExp <- gapminder_2007$lifeExp 


# Visualize the data
hist(lifeExp,
     xlab = "Life Expectancy",
     main = "Life Expectancy in difference countries")

$\$

Sample the data and calculate a statistic

# We can use the sample(data_vec, n) to get a sample of length n:
curr_sample <- sample(lifeExp, 10)


# How can we get x-bar from this sample in R? 
mean(curr_sample)

$\$

Sampling distributions

Q: How can we get a full sampling distribution?

A: Repeat this many times to get an approximation of the sampling distribution If we store the x-bars in a vector, we can then plot the sampling distribution as a histogram

$\$

# we can repeat a process many times using the SDS100 do_it() function

do_it(100)  *  {

  2 + 3

}

$\$

Create a sampling distribution

# Let's create a sampling distribution in R

sampling_dist <- do_it(10000)  *  {

  curr_sample <- sample(lifeExp, 10)  
  mean(curr_sample)

}

hist(sampling_dist)


# these are the same so no bias!
mean(sampling_dist)
mean(lifeExp)     

abline(v = mean(sampling_dist), col = "blue")  # add a vertical line at mu
abline(v = mean(lifeExp), col = "red")  # add a vertical line at mu


# calculate the standard error
sd(sampling_dist)

$\$

Changing the sample size

What happens to the sampling distribution as we change n? Experiment for n = 1, 5, 10, 20

$\$

sampling_dist <- do_it(10000)  *  {

  curr_sample <- sample(lifeExp, 20)  
  mean(curr_sample)  

}

hist(sampling_dist, breaks = 100)

abline(v = mean(lifeExp), col = "red")  # add a vertical line at mu

# calculate the standard error
sd(sampling_dist)