# In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

# get some data and install a package that is needed
#install.packages("latex2exp")

# get some images that are used in this document
SDS230::download_image("which_are_prob_densities.png")
SDS230::download_image("area_pdf.png")
SDS230::download_image("probability_area.png")
SDS230::download_image("Combined_Cumulative_Distribution_Graphs.png")

download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/profiles_revised.csv", "profiles_revised.csv", mode = "wb")
knitr::opts_chunk$set(echo = TRUE) set.seed(230)$\$## Part 0: For loops For loops are useful when you want to repeat a piece of code many times under similar conditions Print the numbers from 1 to 50... for (i in 1:50) { print(i) }$\$For loops are particular useful in combination with vectors that can store the results. Create a vector with the squares of the numbers from 1 to 50. # create a loop that creates a vector with the squares of the numbers from 1 to 50. # plot the results$\$### Try this at home! Use a for loop to create a vector called the_results that holds the values at multiples of 3 from 3 to 300; i.e., the_results should hold the numbers 3, 6, 9, ..., 300$\$## Part 1: Probability functions and generating random numbers in R ### Part 1.1a: Generating random numbers in R R has built in functions to generate data from different distributions. All these functions start with the letter r. We can set the random number generator seed to always get the same sequence of random numbers. Let's get a sample of n = 200 random points from the uniform distribution using runif() # set the seed to a specific number to always get the same sequence of random numbers set.seed(530) # generate n = 100 points from U(0, 1) using runif() function # plot a histogram of these random numbers ### Part 1.1b: Generating random numbers in R There are many other distributions we can get random numbers from including: • Normal distributions: rnorm() • Exponential distributions: rexp() • Binomial distributions rbinom() And many more! The first argument to all these functions is the number of random points you want to generate (n) and then there are additional arguments that can be used to control the shape of the distribution (i.e., that set the "parameters" of the distribution), # generate n = 1000 points from standard normal distribution N(0, 1) # plot a histogram of these random numbers$\$#### Part 1.2a: Probability density functions$\$Probability density functions can be used to model random events. All probability density functions, f(x), have these properties: 1. The function are always non-negative. 2. The area under the function integrates (sums) to 1. Which of the following are probability density functions?$\$For continuous (quantitative) data, we use density function f(x) to find the probability (e.g., the long run frequency) that a random number X is between two values a and b using:$P(a < X < b) = \int_{a}^{b}f(x)dx\$### Part 1.2b: Probability density functions in R$\$If we want to plot the true probability density function for the standard uniform distribution U(0, 1) we can use the dunif() function. All density function in base R start with d. # the x-value domain for the density function f(x) # plot the probability density function Question: Can you create a density plot for the standard normal distribution?$\$### Part 1.3a: Cumulative probability distribution functions Cumulative probability distribution functions give us the probability of getting a random number X that is less than (or equal to) a particular value x; i.e., they give us$P(X \le x)$. For example, they could be used to give us the probability that a random number will be less than 2:$P(X \le 2)$. Cumulative probability distribution functions are obtained by integrating a probability density function:$P(X \le x) = F_X(x) = \int_{-\infty}^x f(x)dx$where f(x) is a probability density function and$F_X(x)$is the cumulative distribution function.$\$### Part 1.3b: Cumulative probability distribution functions in R To get the values that a random number X is less than a particular value x using R, we can use a series of functions that start with the letter d. For example, to get the probability a random number X generated from the standard uniform distribution U(0, 1) will be less than .25; i.e.,$P(X \le .25)$we can use dunif().$\$## Part 2: Sampling distributions #### Part 2.1 A distribution of statistics is called a sampling distribution. Can you generate and plot an approximate sampling distribution for: * sample means$\bar{x}$'s * sample size n = 100 * for data that come from uniform distribution Note the shape of the sampling distribution can be quite different from the shape of the data distribution (which is uniform here). # create a sampling distribution of the mean using data from a uniform distribution sampling_dist <- NULL # plot a histogram of the sampling distribution of these means$\$#### Part 2.2: The standard error (SE) The deviation of a sampling distribution is called the standard error (SE). Can you calculate (an approximate) standard error for the sampling distribution you created above?$\$#### Part 2.3: Approximating a sampling distributions from a data set We generate samples from an actual data set we have using the sample() function. Let's start by just generate a single sample of size n = 100 from the OkCupid users' heights and calculating the mean of this sample. # read in the okcupid data profiles <- read.csv("profiles_revised.csv") # get the heights for the OkCupid data # get one random sample of heights from 100 people # get the mean of this sample$\$We can then create an approximation of a sampling distribution from the OkCupid users' data set by repeating this many times in a for loop. # repeat the process 1,000 times sampling_dist <- NULL # plot a histogram of this sampling distribution Question: What would have to be true for this to be an actual sampling distribution?$\$#### Part 2.4: The central limit theorm The central limit theorm (CTL) establishes that (in most situations) when independent random variables are added their (normalized) sum converges to a normal distribution. Put another way, if we define the average random (i.i.d) sample {$X_1$,$X_2$, ...,$X_n$} of size n as:$S_{n}:={\frac{X_{1}+\cdots +X_{n}}{n}}$then the CTL tells us that:$\sqrt{n}(S_{n} - \mu)\xrightarrow {d} N(0,\sigma^{2})$You will explore this more through simulations on homework 2.$\\$

emeyers/SDS230 documentation built on Jan. 13, 2023, 5:16 a.m.