Home

/

GitHub

/

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# install.packages("latex2exp")

library(latex2exp)

knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

# download the okcupid profile data again
download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")

$\$

Overview

The central limit theorem
Confidence intervals
The bootstrap
Using the bootstrap to calculate confidence intervals in R

$\$

Part 0: The central limit theorm

The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.

Put another way, if we definte the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:

$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$

then the CTL tells us that:

$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$

You will explore this more through simulations on homework 2.

$\$

Part 1: Confidence intervals

$\$

Confidence intervals are ranges of values that capture a parameter a fix proportion of time.

$\$

Part 2: The bootstrap

The bootstrap is a method that can be used to create confidence intervals for a large range of parameters.

The central concept behind the bootstrap is the "plug-in principle" where we treat our sample of data as if it were the population. We then sample with replacement from our sample to create a bootstrap distribution which is a proxy for (the spread of) the sampling distribution.

$\$

Part 2.1: Creating a bootstrap distribution in R

To sample data in R we can use the sample(the_data) function. To sample data with replacement we use the replace = TRUE argument, i.e., sample(the_data, replace = TRUE).

Below we calculate the bootstrap distribution for mean age of OkCupid users using just the first 20 OkCupid users in the data set.

# read in the okcupid data
profiles <- read.csv("profiles_revised.csv")



# get the ages from the first 20 OkCupid profiles
ages <- profiles$age

ages_sample <- ages[1:20]


# create the bootstrap distribution






# plot the bootstrap distribution to make sure it looks normal




# calculate the standard error

$\$

Part 2.2: Calculating the bootstrap standard error SE*

The standard deviation of the bootstrap distribution is usually a good approximation of the standard deviation of the sampling distribution - i.e, it is a good approximation of the standard error SE.

When our bootstrap distribution is relatively normal, we can use the fact that 95% of values fall within to standard deviations of a normal distribution to calculate 95% confidence intervals as:

$CI_{95} = [stat - 2 \cdot SE^, stat + 2 \cdot SE^]$

For example, for a our bootstrap distribution we have a 95% confidence interval for the mean $\mu$ as:

$CI_{95} = [\bar{x} - 2 \cdot SE^, \bar{x} + 2 \cdot SE^]$

# calculate the bootstrap standard error SE* as the standard deviation of the bootstrap distribution 



# calculate the 95% CI using SE*

Above we are using the bootstrap to create a 95% confidence interval which should capture the mean age $\mu_{age}$ 95% of the time.

emeyers/SDS230 documentation built on Feb. 6, 2025, 4:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 0: The central limit theorm

Part 1: Confidence intervals

Part 2: The bootstrap

Part 2.1: Creating a bootstrap distribution in R

Part 2.2: Calculating the bootstrap standard error SE*

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 0: The central limit theorm

Part 1: Confidence intervals

Part 2: The bootstrap

Part 2.1: Creating a bootstrap distribution in R

Part 2.2: Calculating the bootstrap standard error SE*

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis