$\$
# get some data and install a package that is needed #install.packages("latex2exp") # download the okcupid data if you don't have it already download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
knitr::opts_chunk$set(echo = TRUE) set.seed(230)
$\$
For loops are useful when you want to repeat a piece of code many times under similar conditions
Print the numbers from 1 to 50...
$\$
For loops are particular useful in combination with vectors that can store the results.
Create a vector with the squares of the numbers from 1 to 50.
# create a loop that creates a vector with the squares of the numbers from 1 to 50. for (i in 1:50) { } # plot the results
$\$
Use a for loop to create a vector called the_results that holds the values at multiples of 3 from 3 to 300; i.e., the_results should hold the numbers 3, 6, 9, ..., 300
$\$
R has built in functions to generate data from different distributions. All these functions start with the letter r.
We can set the random number generator seed to always get the same sequence of random numbers.
Let's get a sample of n = 200 random points from the uniform distribution using runif()
# set the seed to a specific number to always get the same sequence of random numbers set.seed(230) # generate n = 100 points from U(0, 1) using runif() function # plot a histogram of these random numbers
There are many other distributions we can get random numbers from including:
rnorm()runif()rexp()And many more!
The first argument to all these functions is the number of random points you want to generate (n) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution),
# generate n = 1000 points from standard normal distribution N(0, 1) # plot a histogram of these random numbers
$\$
We can sample random n random values from a vector v using the sample(v, n) function.
We can also set the replace argument to TRUE to sample values with replacement; e.g., to sample with replacement we can use sample(v, n, replace = TRUE).
Let's create a vector of numbers from 1 to 100 and sample 30 of them randomly (i.e., n = 30).
# set the seed to always get the same results # in general, best to just do this once at the top of the RMarkdown document set.seed(230) # create a vector of values from 1 to 100 # sample 30 random values # plot the values sorted using the sort() function # sample 30 random values with replacement # plot the values sorted using the sort() function
$\$
A distribution of statistics is called a sampling distribution.
Can you generate and plot an approximate sampling distribution for: * sample means $\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution
Note the shape of the sampling distribution can be quite different from the shape of the data distribution (which is uniform here).
# create a sampling distribution of the mean using data from a uniform distribution set.seed(67) sampling_dist <- NULL for (i in 1:10000) { } # plot a histogram of the sampling distribution of these means
$\$
The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?
$\$
We generate samples from an actual data set we have using the sample() function.
Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample.
# read in the okcupid data profiles <- read.csv("profiles_revised.csv") # get the heights for the OkCupid data # get one random sample of heights from 100 people # get the mean of this sample
$\$
We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop.
# repeat the process 1,000 times sampling_dist <- NULL # plot a histogram of this sampling distribution
Question: What would have to be true for this to be an actual sampling distribution?
$\$
The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.
Put another way, if we define the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:
$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$
then the CTL tells us that:
$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$
You will explore this more through simulations on homework 2.
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.