$\$
# get some data and install a package that is needed #install.packages("latex2exp") # get some images that are used in this document SDS230::download_image("which_are_prob_densities.png") SDS230::download_image("area_pdf.png") SDS230::download_image("probability_area.png") SDS230::download_image("Combined_Cumulative_Distribution_Graphs.png") download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
knitr::opts_chunk$set(echo = TRUE) set.seed(230)
$\$
For loops are useful when you want to repeat a piece of code many times under similar conditions
Print the numbers from 1 to 50...
for (i in 1:50) { print(i) }
$\$
For loops are particular useful in combination with vectors that can store the results.
Create a vector with the squares of the numbers from 1 to 50.
# create a loop that creates a vector with the squares of the numbers from 1 to 50. # plot the results
$\$
Use a for loop to create a vector called the_results
that holds the values at multiples of 3 from 3 to 300; i.e., the_results
should hold the numbers 3, 6, 9, ..., 300
$\$
R has built in functions to generate data from different distributions. All these functions start with the letter r
.
We can set the random number generator seed to always get the same sequence of random numbers.
Let's get a sample of n = 200 random points from the uniform distribution using runif()
# set the seed to a specific number to always get the same sequence of random numbers set.seed(530) # generate n = 100 points from U(0, 1) using runif() function # plot a histogram of these random numbers
There are many other distributions we can get random numbers from including:
rnorm()
rexp()
rbinom()
And many more!
The first argument to all these functions is the number of random points you want to generate (n
) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution),
# generate n = 1000 points from standard normal distribution N(0, 1) # plot a histogram of these random numbers
$\$
$\$
Probability density functions can be used to model random events. All probability density functions, f(x), have these properties:
Which of the following are probability density functions?
$\$
For continuous (quantitative) data, we use density function f(x) to find the probability (e.g., the long run frequency) that a random number X is between two values a and b using:
$P(a < X < b) = \int_{a}^{b}f(x)dx$
$\$
$\$
If we want to plot the true probability density function for the standard uniform distribution U(0, 1) we can use the dunif()
function. All density function in base R start with d
.
# the x-value domain for the density function f(x) # plot the probability density function
Question: Can you create a density plot for the standard normal distribution?
$\$
Cumulative probability distribution functions give us the probability of getting a random number X that is less than (or equal to) a particular value x; i.e., they give us $P(X \le x)$. For example, they could be used to give us the probability that a random number will be less than 2: $P(X \le 2)$.
Cumulative probability distribution functions are obtained by integrating a probability density function:
$P(X \le x) = F_X(x) = \int_{-\infty}^x f(x)dx$
where f(x)
is a probability density function and $F_X(x)$ is the cumulative distribution function.
$\$
To get the values that a random number X is less than a particular value x using R, we can use a series of functions that start with the letter d
.
For example, to get the probability a random number X generated from the standard uniform distribution U(0, 1) will be less than .25; i.e., $P(X \le .25)$ we can use dunif()
.
$\$
A distribution of statistics is called a sampling distribution.
Can you generate and plot an approximate sampling distribution for: * sample means $\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution
Note the shape of the sampling distribution can be quite different from the shape of the data distribution (which is uniform here).
# create a sampling distribution of the mean using data from a uniform distribution sampling_dist <- NULL # plot a histogram of the sampling distribution of these means
$\$
The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?
$\$
We generate samples from an actual data set we have using the sample()
function.
Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample.
# read in the okcupid data profiles <- read.csv("profiles_revised.csv") # get the heights for the OkCupid data # get one random sample of heights from 100 people # get the mean of this sample
$\$
We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop.
# repeat the process 1,000 times sampling_dist <- NULL # plot a histogram of this sampling distribution
Question: What would have to be true for this to be an actual sampling distribution?
$\$
The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.
Put another way, if we define the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:
$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$
then the CTL tells us that:
$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$
You will explore this more through simulations on homework 2.
$\$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.