$\$
# get some data and install a package that is needed #install.packages("latex2exp") # download the okcupid data if you don't have it already download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
knitr::opts_chunk$set(echo = TRUE) set.seed(230)
$\$
For loops are useful when you want to repeat a piece of code many times under similar conditions
Print the numbers from 1 to 50...
$\$
For loops are particular useful in combination with vectors that can store the results.
Create a vector with the squares of the numbers from 1 to 50.
# create a loop that creates a vector with the squares of the numbers from 1 to 50. for (i in 1:50) { } # plot the results
$\$
Use a for loop to create a vector called the_results
that holds the values at multiples of 3 from 3 to 300; i.e., the_results
should hold the numbers 3, 6, 9, ..., 300
$\$
R has built in functions to generate data from different distributions. All these functions start with the letter r
.
We can set the random number generator seed to always get the same sequence of random numbers.
Let's get a sample of n = 200 random points from the uniform distribution using runif()
# set the seed to a specific number to always get the same sequence of random numbers set.seed(230) # generate n = 100 points from U(0, 1) using runif() function # plot a histogram of these random numbers
There are many other distributions we can get random numbers from including:
rnorm()
runif()
rexp()
And many more!
The first argument to all these functions is the number of random points you want to generate (n
) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution),
# generate n = 1000 points from standard normal distribution N(0, 1) # plot a histogram of these random numbers
$\$
We can sample random n
random values from a vector v
using the sample(v, n)
function.
We can also set the replace
argument to TRUE
to sample values with replacement; e.g., to sample with replacement we can use sample(v, n, replace = TRUE)
.
Let's create a vector of numbers from 1 to 100 and sample 30 of them randomly (i.e., n = 30).
# set the seed to always get the same results # in general, best to just do this once at the top of the RMarkdown document set.seed(230) # create a vector of values from 1 to 100 # sample 30 random values # plot the values sorted using the sort() function # sample 30 random values with replacement # plot the values sorted using the sort() function
$\$
A distribution of statistics is called a sampling distribution.
Can you generate and plot an approximate sampling distribution for: * sample means $\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution
Note the shape of the sampling distribution can be quite different from the shape of the data distribution (which is uniform here).
# create a sampling distribution of the mean using data from a uniform distribution set.seed(67) sampling_dist <- NULL for (i in 1:10000) { } # plot a histogram of the sampling distribution of these means
$\$
The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?
$\$
We generate samples from an actual data set we have using the sample()
function.
Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample.
# read in the okcupid data profiles <- read.csv("profiles_revised.csv") # get the heights for the OkCupid data # get one random sample of heights from 100 people # get the mean of this sample
$\$
We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop.
# repeat the process 1,000 times sampling_dist <- NULL # plot a histogram of this sampling distribution
Question: What would have to be true for this to be an actual sampling distribution?
$\$
The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.
Put another way, if we define the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:
$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$
then the CTL tells us that:
$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$
You will explore this more through simulations on homework 2.
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.