In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# get some data and install a package that is needed
#install.packages("latex2exp")


# download the okcupid data if you don't have it already
download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")

knitr::opts_chunk$set(echo = TRUE)

set.seed(230)

$\$

Part 0: For loops

For loops are useful when you want to repeat a piece of code many times under similar conditions

Print the numbers from 1 to 50...

for (i in 1:50) {
    print(i)
}

$\$

For loops are particular useful in combination with vectors that can store the results.

Create a vector with the squares of the numbers from 1 to 50.

# create a loop that creates a vector with the squares of the numbers from 1 to 50. 
the_results <- NULL
for (i in 1:50) {
    the_results[i] <- i^2
}


# plot the results
plot(the_results, 
     type = "o",
     xlab = "x", 
     ylab = "2^x")

$\$

Try this at home!

Use a for loop to create a vector called the_results that holds the values at multiples of 3 from 3 to 300; i.e., the_results should hold the numbers 3, 6, 9, ..., 300

$\$

Part 1: Probability functions and generating random numbers in R

Part 1.1a: Generating random numbers in R

R has built in functions to generate data from different distributions. All these functions start with the letter r.

We can set the random number generator seed to always get the same sequence of random numbers.

Let's get a sample of n = 200 random points from the uniform distribution using runif()

# set the seed to a specific number to always get the same sequence of random numbers
set.seed(530)

# generate n = 100 points from U(0, 1) using runif() function 
rand_data <- runif(200) 

# plot a histogram of these random numbers
hist(rand_data)

Part 1.1b: Generating random numbers in R

There are many other distributions we can get random numbers from including:

Normal distributions: rnorm()
Uniform distribution runif()
Exponential distributions: rexp()

And many more!

The first argument to all these functions is the number of random points you want to generate (n) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution),

# generate n = 1000 points from standard normal distribution N(0, 1)
rand_data <- rnorm(1000) 


# plot a histogram of these random numbers
hist(rand_data, breaks = 50)

$\$

Part 1.1c: Sampling random values from a vector

We can sample random n random values from a vector v using the sample(v, n) function.

We can also set the replace argument to TRUE to sample values with replacement; e.g., to sample with replacement we can use sample(v, n, replace = TRUE).

Let's create a vector of numbers from 1 to 100 and sample 30 of them randomly (i.e., n = 30).

# set the seed to always get the same results
# in general, best to just do this once at the top of the RMarkdown document
set.seed(230)

# create a vector of values from 1 to 100
my_vec <- 1:100


# sample 30 random values 
rand_sample <- sample(my_vec, 30)

# plot the values sorted usng the sort() function
sort(rand_sample)


# sample 30 random values with replacement
rand_sample_with_replacement <- sample(my_vec, 30, replace = TRUE)

sort(rand_sample_with_replacement)

$\$

Part 2: Sampling distributions

Part 2.1

A distribution of statistics is called a sampling distribution.

Can you generate and plot an approximate sampling distribution for: * sample means $\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution

Note the shape of the sampling distribution can be quite different from the shape of the data distribution (which is uniform here).

sampling_dist <- NULL

# create a sampling distribution of the mean using data from a uniform distribution
for (i in 1:1000){

  rand_sample <- runif(100) 
  sampling_dist[i] <- mean(rand_sample)

}


# plot a histogram of the sampling distribution of these means 
hist(sampling_dist, nclass = 100,
     xlab = bquote(bar(x)), 
     main = "Sampling distribution of the sample mean")

$\$

Part 2.2: The standard error (SE)

The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?

(SE <- sd(sampling_dist))

$\$

Part 2.3: Approximating a sampling distributions from a data set

We generate samples from an actual data set we have using the sample() function.

Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample.

# read in the okcupid data
profiles <- read.csv("profiles_revised.csv")


# get the heights for the OkCupid data
heights <- profiles$height

# get one random sample of heights from 100 people
height_sample <- sample(heights, 100)

# get the mean of this sample
mean(height_sample)

$\$

We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop.

# repeat the process 1,000 times
sampling_dist <- NULL 
for (i in 1:1000) {
     height_sample <- sample(heights, 100)    # sample n = 100 random heights
     sampling_dist[i] <- mean(height_sample)    # save the mean
}


# plot a histogram of this sampling distribution 
hist(sampling_dist)

Question: What would have to be true for this to be an actual sampling distribution?

Answer: The population of interest would have to be just the data in the OkCupid profiles data frame. Also, we would have to calculate all possible statistics (from all possible samples) for it to be a completely accurate sampling distribution. Since there are 59,946 heights in the OkCupid data, if we used a sample size of n = 100, this would be $59946 \choose 100$ samples which is a very large number.

$\$

Part 2.4: The central limit theorm

The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution.

Put another way, if we define the average random (i.i.d) sample {$X_1$, $X_2$, ..., $X_n$} of size n as:

$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$

then the CTL tells us that:

$\sqrt{n}(S_{n} - \mu)$ $\xrightarrow {d} N(0,\sigma^{2})$

You will explore this more through simulations on homework 2.

$\$

emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Part 0: For loops

Try this at home!

Part 1: Probability functions and generating random numbers in R

Part 1.1a: Generating random numbers in R

Part 1.1b: Generating random numbers in R

Part 1.1c: Sampling random values from a vector

Part 2: Sampling distributions

Part 2.1

Part 2.2: The standard error (SE)

Part 2.3: Approximating a sampling distributions from a data set

Part 2.4: The central limit theorm

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Part 0: For loops

Try this at home!

Part 1: Probability functions and generating random numbers in R

Part 1.1a: Generating random numbers in R

Part 1.1b: Generating random numbers in R

Part 1.1c: Sampling random values from a vector

Part 2: Sampling distributions

Part 2.1

Part 2.2: The standard error (SE)

Part 2.3: Approximating a sampling distributions from a data set

Part 2.4: The central limit theorm

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis