$\$

# get some data and install a package that is needed #install.packages("latex2exp") # download the okcupid data if you don't have it already download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")

knitr::opts_chunk$set(echo = TRUE) set.seed(230)

$\$

For loops are useful when you want to repeat a piece of code many times under similar conditions

Print the numbers from 1 to 50...

$\$

For loops are particular useful in combination with vectors that can store the results.

Create a vector with the squares of the numbers from 1 to 50.

# create a loop that creates a vector with the squares of the numbers from 1 to 50. for (i in 1:50) { } # plot the results

$\$

Use a for loop to create a vector called `the_results`

that holds the values at multiples of 3 from 3 to 300; i.e., `the_results`

should hold the numbers 3, 6, 9, ..., 300

$\$

R has built in functions to generate data from different distributions. All these functions start with the letter `r`

.

We can set the random number generator **seed** to always get the same sequence of random numbers.

Let's get a sample of n = 200 random points from the uniform distribution using `runif()`

# set the seed to a specific number to always get the same sequence of random numbers set.seed(230) # generate n = 100 points from U(0, 1) using runif() function # plot a histogram of these random numbers

There are many other distributions we can get random numbers from including:

- Normal distributions:
`rnorm()`

- Uniform distribution
`runif()`

- Exponential distributions:
`rexp()`

And many more!

The first argument to all these functions is the number of random points you want to generate (`n`

) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution),

# generate n = 1000 points from standard normal distribution N(0, 1) # plot a histogram of these random numbers

$\$

We can sample random `n`

random values from a vector `v`

using the `sample(v, n)`

function.

We can also set the `replace`

argument to `TRUE`

to sample values with replacement; e.g., to sample with replacement we can use `sample(v, n, replace = TRUE)`

.

Let's create a vector of numbers from 1 to 100 and sample 30 of them randomly (i.e., n = 30).

# set the seed to always get the same results # in general, best to just do this once at the top of the RMarkdown document set.seed(230) # create a vector of values from 1 to 100 # sample 30 random values # plot the values sorted using the sort() function # sample 30 random values with replacement # plot the values sorted using the sort() function

$\$

A distribution of statistics is called a **sampling distribution**.

Can you generate and plot an approximate sampling distribution for: * sample means $\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution

Note the shape of the *sampling distribution* can be quite different from the shape of the data distribution (which is uniform here).

# create a sampling distribution of the mean using data from a uniform distribution set.seed(67) sampling_dist <- NULL for (i in 1:10000) { } # plot a histogram of the sampling distribution of these means

$\$

The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?

$\$

We generate samples from an actual data set we have using the `sample()`

function.

Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample.

# read in the okcupid data profiles <- read.csv("profiles_revised.csv") # get the heights for the OkCupid data # get one random sample of heights from 100 people # get the mean of this sample

$\$

We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop.

# repeat the process 1,000 times sampling_dist <- NULL # plot a histogram of this sampling distribution

**Question:** What would have to be true for this to be an actual sampling distribution?

$\$

The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.

Put another way, if we define the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size *n* as:

$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$

then the CTL tells us that:

$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$

You will explore this more through simulations on homework 2.

$\$

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.