This template is going to explore sampling distributions. Make sure you write down the definitions of all highlighted terms.
# Load your libraries here library('lehmansociology') library('ggplot2') library('gtools')
Let's set up our tiny population
#type your code here age <- c(20, 21, 22, 23, 24)
Let's plot this population data.
ggplot(as.data.frame(age), aes(x = age)) + geom_histogram(binwidth = 1)
This is called a uniform distribution because all of the values of age have the same heights.
Let's also get the mean, standard deviation and sample variance of age. Also let's get the population variance by multiplying the sample variance by (n-1)/n where n is 5, the number of observations in our data.
Now let's get all of the samples of size 2.
Notice that repeats.allowed is set to TRUE. This is the same as setting replacement = TRUE
in other functions.
The n
parameter gives the initial size of the population (5) and the r
indicates how many observations to select.
all_samples_size_2 <- gtools::permutations(n=length(age), r = 2, v=age, repeats.allowed = TRUE) all_samples_size_2
Now let's take all 25 of the sample means. Notice that the rows are samples and the columns are people.
The samples are our observations.
Because of how the data are set up in rows, we have to use a new function rowMeans()
.
all_samples_means_2 <- rowMeans(all_samples_size_2) all_samples_means_2
This list of 25 means is the sampling distribution of sample means for size 2
So now the list of 25 means is our dataset.
Now let's look at the mean, variance and standard deviation of those means.
mean(all_samples_means_2) var(all_samples_means_2) sd(all_samples_means_2)
Now let's repeat the whole process for sample size of 3. Remember this time there will be 125, which is 555.
all_samples_size_3 <- gtools::permutations(n=length(age), r = 3, v=age, repeats.allowed = TRUE) all_samples_means_3 <- rowMeans(all_samples_size_3) all_samples_means_3 # Using our data set of all means mean(all_samples_means_3) var(all_samples_means_3) sd(all_samples_means_3)
We might as well finish up with 4 and 5.
Show your calculation strategy.
Show your calculation strategy.
all_samples_size_4 <- gtools::permutations(n=length(age), r = 4, v=age, repeats.allowed = TRUE) all_samples_means_4 <- rowMeans(all_samples_size_4) all_samples_means_4 # Using our data set of all means mean(all_samples_means_4) var(all_samples_means_4) sd(all_samples_means_4)
Set the last one up yourself.
Let's look more closely at how the standard error (special name for the standard deviation of a sampling distribution) changes as we increase sample size. We can do this two ways, with numbers or graphically.
#Change from 2 to 3 sd(all_samples_means_2) - sd(all_samples_means_3) #Change from 3 to 4 sd(all_samples_means_3) - sd(all_samples_means_4) #Change from 4 to 5 sd(all_samples_means_4) - sd(all_samples_means_5)
ggplot(as.data.frame(all_samples_means_2), aes(x = all_samples_means_2)) + geom_histogram(binwidth = .5) ggplot(as.data.frame(all_samples_means_3), aes(x = all_samples_means_3)) + geom_histogram(binwidth = .33) ggplot(as.data.frame(all_samples_means_4), aes(x = all_samples_means_4)) + geom_histogram(binwidth = .25) ggplot(as.data.frame(all_samples_means_5), aes(x = all_samples_means_5)) + geom_histogram(binwidth = .2)
ggplot(data.frame(x = c(-5, 5)), aes(x)) + stat_function(fun = dnorm)
This is a really important result in statistics.
The mean of the sampling distribution of the means will equal the population mean.
As sample size gets larger, the sampling distribution will go toward having a normal distribution.
As sample size gets larger the standard error gets smaller, proportionate to $\sqrt{n}$
What this means.
With random sampling, on average we expect the sample mean will equal the population mean. But we don't expect that to happen very often.
Under random sampling, most sample means will be fairly close to the population mean.
Bigger samples are better, but because of "proportionate to $\sqrt{n}$" the biggest changes happen when we increase small samples to somewhat bigger samples.
Because the sampling distribution is normal for big samples, we can use the Empirical Rule and other rules from the normal distribution to say certain things. For example we can say that 95% of all sample means will be within 1.96 (or roughly 2) standard errors of the population mean. Also 99% of all sample means will be within 2.56 standard errors of the population mean. And (from the empirical rule) 99.7% will be within 3 standard errors of the population mean.
This is really powerful. This is used for significance tests and for confidence intervals both of which you will see in articles.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.