$\$

# install.packages("latex2exp")

library(latex2exp)

knitr::opts_chunk$set(echo = TRUE)

set.seed(123)
# download the okcupid profile data again
download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")

$\$

Overview

$\$

Part 0: The central limit theorm

The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.

Put another way, if we definte the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:

$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$

then the CTL tells us that:

$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$

You will explore this more through simulations on homework 2.

$\$

Part 1: Confidence intervals

$\$

Confidence intervals are ranges of values that capture a parameter a fix proportion of time.

$\$

Part 2: The bootstrap

The bootstrap is a method that can be used to create confidence intervals for a large range of parameters.

The central concept behind the bootstrap is the "plug-in principle" where we treat our sample of data as if it were the population. We then sample with replacement from our sample to create a bootstrap distribution which is a proxy for (the spread of) the sampling distribution.

$\$

Part 2.1: Creating a bootstrap distribution in R

To sample data in R we can use the sample(the_data) function. To sample data with replacement we use the replace = TRUE argument, i.e., sample(the_data, replace = TRUE).

Below we calculate the bootstrap distribution for mean age of OkCupid users using just the first 20 OkCupid users in the data set.

# read in the okcupid data
profiles <- read.csv("profiles_revised.csv")



# get the ages from the first 20 OkCupid profiles
ages <- profiles$age

ages_sample <- ages[1:20]


# create the bootstrap distribution

boot_dist <- NULL
for (i in 1:1000) {

  boot_sample <- sample(ages_sample, 20, replace = TRUE)
  boot_dist[i] <- mean(boot_sample)

}

boot_sample <- sample(ages_sample, 20, replace = TRUE)
mean(boot_sample)


# plot the bootstrap distribution to make sure it looks normal
hist(boot_dist)

(SE_boot <- sd(boot_dist))

(sample_mean <- mean(ages_sample))

CI_upper <- sample_mean + 2 * SE_boot
CI_lower <- sample_mean - 2 * SE_boot

c(CI_lower, CI_upper)

sd(ages_sample)

$\$

Part 2.2: Calculating the bootstrap standard error SE*

The standard deviation of the bootstrap distribution is usually a good approximation of the standard deviation of the sampling distribution - i.e, it is a good approximation of the standard error SE.

When our bootstrap distribution is relatively normal, we can use the fact that 95% of values fall within to standard deviations of a normal distribution to calculate 95% confidence intervals as:

$CI_{95} = [stat - 2 \cdot SE^, stat + 2 \cdot SE^]$

For example, for a our bootstrap distribution we have a 95% confidence interval for the mean $\mu$ as:

$CI_{95} = [\bar{x} - 2 \cdot SE^, \bar{x} + 2 \cdot SE^]$

# calculate the bootstrap standard error SE* as the standard deviation of the bootstrap distribution 



# calculate the 95% CI using SE*

Above we are using the bootstrap to create a 95% confidence interval which should capture the mean age $\mu_{age}$ 95% of the time.



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.