$\$
# get some data and install a package that is needed #install.packages("latex2exp") # download the okcupid data if you don't have it already download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
knitr::opts_chunk$set(echo = TRUE) set.seed(230)
$\$
For loops are useful when you want to repeat a piece of code many times under similar conditions
Print the numbers from 1 to 50...
for (i in 1:50) { print(i) }
$\$
For loops are particular useful in combination with vectors that can store the results.
Create a vector with the squares of the numbers from 1 to 50.
# create a loop that creates a vector with the squares of the numbers from 1 to 50. the_results <- NULL for (i in 1:50) { the_results[i] <- i^2 } # plot the results plot(the_results, type = "o", xlab = "x", ylab = "2^x")
$\$
Use a for loop to create a vector called the_results
that holds the values at multiples of 3 from 3 to 300; i.e., the_results
should hold the numbers 3, 6, 9, ..., 300
$\$
R has built in functions to generate data from different distributions. All these functions start with the letter r
.
We can set the random number generator seed to always get the same sequence of random numbers.
Let's get a sample of n = 200 random points from the uniform distribution using runif()
# set the seed to a specific number to always get the same sequence of random numbers set.seed(530) # generate n = 100 points from U(0, 1) using runif() function rand_data <- runif(200) # plot a histogram of these random numbers hist(rand_data)
There are many other distributions we can get random numbers from including:
rnorm()
runif()
rexp()
And many more!
The first argument to all these functions is the number of random points you want to generate (n
) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution),
# generate n = 1000 points from standard normal distribution N(0, 1) rand_data <- rnorm(1000) # plot a histogram of these random numbers hist(rand_data, breaks = 50)
$\$
We can sample random n
random values from a vector v
using the sample(v, n)
function.
We can also set the replace
argument to TRUE
to sample values with replacement; e.g., to sample with replacement we can use sample(v, n, replace = TRUE)
.
Let's create a vector of numbers from 1 to 100 and sample 30 of them randomly (i.e., n = 30).
# set the seed to always get the same results # in general, best to just do this once at the top of the RMarkdown document set.seed(230) # create a vector of values from 1 to 100 my_vec <- 1:100 # sample 30 random values rand_sample <- sample(my_vec, 30) # plot the values sorted usng the sort() function sort(rand_sample) # sample 30 random values with replacement rand_sample_with_replacement <- sample(my_vec, 30, replace = TRUE) sort(rand_sample_with_replacement)
$\$
A distribution of statistics is called a sampling distribution.
Can you generate and plot an approximate sampling distribution for: * sample means $\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution
Note the shape of the sampling distribution can be quite different from the shape of the data distribution (which is uniform here).
sampling_dist <- NULL # create a sampling distribution of the mean using data from a uniform distribution for (i in 1:1000){ rand_sample <- runif(100) sampling_dist[i] <- mean(rand_sample) } # plot a histogram of the sampling distribution of these means hist(sampling_dist, nclass = 100, xlab = bquote(bar(x)), main = "Sampling distribution of the sample mean")
$\$
The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?
(SE <- sd(sampling_dist))
$\$
We generate samples from an actual data set we have using the sample()
function.
Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample.
# read in the okcupid data profiles <- read.csv("profiles_revised.csv") # get the heights for the OkCupid data heights <- profiles$height # get one random sample of heights from 100 people height_sample <- sample(heights, 100) # get the mean of this sample mean(height_sample)
$\$
We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop.
# repeat the process 1,000 times sampling_dist <- NULL for (i in 1:1000) { height_sample <- sample(heights, 100) # sample n = 100 random heights sampling_dist[i] <- mean(height_sample) # save the mean } # plot a histogram of this sampling distribution hist(sampling_dist)
Question: What would have to be true for this to be an actual sampling distribution?
Answer: The population of interest would have to be just the data in the OkCupid profiles data frame. Also, we would have to calculate all possible statistics (from all possible samples) for it to be a completely accurate sampling distribution. Since there are 59,946 heights in the OkCupid data, if we used a sample size of n = 100, this would be $59946 \choose 100$ samples which is a very large number.
$\$
The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.
Put another way, if we define the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:
$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$
then the CTL tells us that:
$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$
You will explore this more through simulations on homework 2.
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.