In daviddalpiaz/appliedstats: Applied Statistics with R

Introduction to `R`

"Measuring programming progress by lines of code is like measuring aircraft building progress by weight."

--- Bill Gates

After reading this chapter you will be able to:

Interact with R using RStudio.
Use R as a calculator.
Work with data as vectors and data frames.
Make basic data visualizations.
Write your own R functions.
Perform hypothesis tests using R.
Perform basic simulations in R.

`R` Resources

R is both a programming language and software environment for statistical computing, which is free and open-source. To get started, you will need to install two pieces of software:

R, the actual programming language.
- Chose your operating system, and select the most recent version, r paste0(version$major, "." ,version$minor).
RStudio, an excellent IDE for working with R.
- Note, you must have R installed to use RStudio. RStudio is simply an interface used to interact with R.

The popularity of R is on the rise, and everyday it becomes a better tool for statistical analysis. It even generated this book! (A skill you will learn in this course.) There are many good resources for learning R. They are not necessary for this course, but you may find them useful if you would like a deeper understanding of R:

Try R from Code School.
- An interactive introduction to the basics of R. Could be very useful for getting up to speed on R's syntax.
Quick-R by Robert Kabacoff.
- A good reference for R basics.
R Tutorial by Chi Yau.
- A combination reference and tutorial for R basics.
R Markdown from RStudio.
- Reference materials for RMarkdown.
The Art of R Programming by Norman Matloff.
- Gentle introduction to the programming side of R. (Whereas we will focus more on the data analysis side.) A free electronic version is available through the Illinois library.
Advanced R by Hadley Wickham.
- From the author of several extremely popular R packages. Good follow-up to The Art of R Programming. (And more up-to-date material.)
R for Data Science by Hadley Wickham and Garrett Grolemund.
- Similar to Advanced R, but focuses more on data analysis, while still introducing programming concepts. At the time of writing, currently under development.
The R Inferno by Patrick Burns.
- Likens learning the tricks of R to descending through the levels of hell. Very advanced material, but may be important if R becomes a part of your everyday toolkit.

RStudio has a large number of useful keyboard shortcuts. A list of these can be found using a keyboard shortcut -- the keyboard shortcut to rule them all:

On Windows: Alt + Shift + K
On Mac: Option + Shift + K

The RStudio team has developed a number of "cheatsheets" for working with both R and RStudio. This particular cheatseet for Base R will summarize many of the concepts in this document.

When programming, it is often a good practice to follow a style guide. (Where do spaces go? Tabs or spaces? Underscores or CamelCase when naming variables?) No style guide is "correct" but it helps to be aware of what others do. The more import thing is to be consistent within your own code.

For this course, our main deviation from these two guides is the use of = in place of <-. (More on that later.)

`R` Basics

Basic Calculations

To get started, we'll use R like a simple calculator. Note, in R the # symbol is used for comments. In this book, lines which begin with two such symbols, ##, indicate output.

Addition, Subtraction, Multiplication and Division {-}

Exponents {-}

3 ^ 2
2 ^ (-3)
100 ^ (1 / 2)
sqrt(1 / 2)
exp(1)

Mathematical Constants {-}

pi
exp(1)

Logarithms {-}

log(10)           # natural log
log10(1000)       # base 10 log
log2(8)           # base 2 log
log(16, base = 4) # base 4 log

Trigonometry {-}

sin(pi / 2)
cos(0)

Getting Help

In using R as a calculator, we have seen a number of functions: sqrt(), exp(), log() and sin(). To get documentation about a function in R, simply put a question mark in front of the function name and RStudio will display the documentation, for example:

?log
?sin
?paste
?lm

Frequently one of the most difficult things to do when learning R is asking for help. First, you need to decide to ask for help, then you need to know how to ask for help. Your very first line of defense should be to Google your error message or a short description of your issue. (The ability to solve problems using this method is quickly becoming an extremely valuable skill.) If that fails, and it eventually will, you should ask for help. There are a number of things you should include when emailing an instructor, or posting to a help website such as Stack Exchange.

Describe what you expect the code to do.
State the end goal you are trying to achieve. (Sometimes what you expect the code to do, is not what you want to actually do.)
Provide the full text of any errors you have received.
Provide enough code to recreate the error. Often for the purpose of this course, you could simply email your entire .R or .Rmd file.
Sometimes it is also helpful to include a screenshot of your entire RStudio window when the error occurs.

If you follow these steps, you will get your issue resolved much quicker, and possibly learn more in the process. Do not be discouraged by running into errors and difficulties when learning R. (Or any technical skill.) It is simply part of the learning process.

Installing Packages

R comes with a number of built-in functions and datasets, but one of the main strengths of R as an open-source project is its package system. Packages add additional functions and data. Frequently if you want to do something in R, and it isn't available by default, there is a good chance that there is a package that will fulfill your needs.

To install a package, use the install.packages() function. Think of this as buying a recipe book from the store, bringing it home, and putting it on your shelf.

install.packages("ggplot2")

Once a package is installed, it must be loaded into your current R session before being used. Think of this as taking the book off of the shelf and opening it up to read.

library(ggplot2)

Once you close R, all the packages are closed and put back on the imaginary shelf. The next time you open R, you do not have to install the package again, but you do have to load any packages you intend to use by invoking library().

Data Types

R has a number of basic data types.

Numeric
- Also known as Double. The default type when dealing with numbers.
- Examples: 1, 1.0, 42.5
Integer
- Examples: 1L, 2L, 42L
Complex
- Example: 4 + 2i
Logical
- Two possible values: TRUE and FALSE
- You can also use T and F, but this is not recommended.
- NA is also considered logical.
Character
- Examples: "a", "Statistics", "1 plus 2."

R also has a number of basic data structures. A data structure is either homogeneous (all elements are of the same data type) or heterogeneous (elements can be of more than one data type).

| Dimension | Homogeneous | Heterogeneous | |-----------|-----------------|-------------------| | 1 | Vector | List | | 2 | Matrix | Data Frame | | 3+ | Array | |

Vectors

Many operations in R make heavy use of vectors. Vectors in R are indexed starting at 1. That is what the [1] in the output is indicating, that the first element of the row being displayed is the first element of the vector. Larger vectors will start additional rows with [*] where * is the index of the first element of the row.

Possibly the most common way to create a vector in R is using the c() function, which is short for "combine."" As the name suggests, it combines a list of numbers separated by commas.

c(1, 3, 5, 7, 8, 9)

Here R simply outputs this vector. If we would like to store this vector in a variable we can do so with the assignment operator =. In this case the variable x now holds the vector we just created, and we can access the vector by typing x.

x = c(1, 3, 5, 7, 8, 9)
x

As an aside, there is a long history of the assignment operator in R, partially due to the keys available on the keyboards of the creators of the S language. (Which preceded R.) For simplicity we will use =, but know that often you will see <- as the assignment operator.

The pros and cons of these two are well beyond the scope of this book, but know that for our purposes you will have no issue if you simply use =. If you are interested in the weird cases where the difference matters, check out The R Inferno.

If you wish to use <-, you will still need to use =, however only for argument passing. Some users like to keep assignment (<-) and argument passing (=) separate. No matter what you choose, the more important thing is that you stay consistent. Also, if working on a larger collaborative project, you should use whatever style is already in place.

Frequently you may wish to create a vector based on a sequence of numbers. The quickest and easiest way to do this is with the : operator, which creates a sequence of integers between two specified integers.

(y = 1:100)

Here we see R labeling the rows after the first since this is a large vector. Also, we see that by putting parentheses around the assignment, R both stores the vector in a variable called y and automatically outputs y to the console.

To subset a vector, we use square brackets, [].

x
x[1]
x[3]

We see that x[1] returns the first element, and x[3] returns the third element.

x[-2]

We can also exclude certain indexes, in this case the second element.

x[1:3]
x[c(1,3,4)]

Lastly we see that we can subset based on a vector of indices.

One of the biggest strengths of R is its use of vectorized operations. (Frequently the lack of understanding of this concept leads of a belief that R is slow. R is not the fastest language, but it has a reputation for being slower than it really is.)

x = 1:10
x + 1
2 * x
2 ^ x
sqrt(x)
log(x)

We see that when a function like log() is called on a vector x, a vector is returned which has applied the function to each element of the vector x.

vec_1 = 1:10
vec_2 = 1:1000
vec_3 = 42

The length of a vector can be obtained with the length() function.

length(vec_1)
length(vec_2)
length(vec_3)

Note that scalars do not exists in R. They are simply vectors of length 1.

If we want to create a sequence that isn't limited to integers and increasing by 1 at a time, we can use the seq() function.

seq(from = 1.5, to = 4.2, by = 0.1)

We will discuss functions in detail later, but note here that the input labels from, to, and by are optional.

seq(1.5, 4.2, 0.1)

Another common operation to create a vector is rep(), which can repeat a single value a number of times.

rep("A", times = 10)

Or, rep() can be used to repeat a vector a number of times.

rep(x, times = 3)

We have now seen four different ways to create vectors:

c()
:
seq()
rep()

So far we have mostly used them in isolation, but they are often used together.

c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)

Summary Statistics

R has built in functions for a large number of summary statistics.

Central Tendency {-}

mean(y)
median(y)

Spread {-}

var(y)
sd(y)
IQR(y)
min(y)
max(y)
range(y)

Matrices

R can also be used for matrix calculations. Matrices have rows and columns containing a single data type. In a matrix, the order of rows and columns is important. (This is not true of data frames, which we will see later.)

Matrices can be created using the matrix function.

x = 1:9
x
X = matrix(x, nrow = 3, ncol = 3)
X

Note here that we are using two different variables: lower case x, which stores a vector and capital X, which stores a matrix. (Following the usual mathematical convention.) We can do this because R is case sensitive.

By default the matrix function reorders a vector into columns, but we can also tell R to use rows instead.

Y = matrix(x, nrow = 3, ncol = 3, byrow = TRUE)
Y

We can also create a matrix of a specified dimension where every element is the same, in this case 0.

Z = matrix(0, 2, 4)
Z

Like vectors, matrices can be subsetted using square brackets, []. However, since matrices are two-dimensional, we need to specify both a row and a column when subsetting.

X
X[1, 2]

Here we accessed the element in the first row and the second column. We could also subset an entire row or column.

X[1, ]
X[, 2]

We can also use vectors to subset more than one row or column at a time. Here we subset to the first and third column of the second row.

X[2, c(1, 3)]

Matrices can also be created by combining vectors as columns, using cbind, or combining vectors as rows, using rbind.

x = 1:9
rev(x)
rep(1, 9)

cbind(x, rev(x), rep(1, 9))

rbind(x, rev(x), rep(1, 9))

R can then be used to perform matrix calculations.

x = 1:9
y = 9:1
X = matrix(x, 3, 3)
Y = matrix(y, 3, 3)
X
Y

X + Y
X - Y
X * Y
X / Y

Note that X * Y is not matrix multiplication. It is element by element multiplication. (Same for X / Y). Instead, matrix multiplication uses %*%. Other matrix functions include t() which gives the transpose of a matrix and solve() which returns the inverse of a square matrix if it is invertible.

X %*% Y
t(X)

Z = matrix(c(9, 2, -3, 2, 4, -2, -3, -2, 16), 3, byrow = TRUE)
Z
solve(Z)

R has a number of matrix specific functions for obtaining dimension and summary information.

X = matrix(1:6, 2, 3)
X
dim(X)
rowSums(X)
colSums(X)
rowMeans(X)
colMeans(X)

The diag() function can be used in a number of ways. We can extract the diagonal of a matrix.

diag(Z)

Or create a matrix with specified elements on the diagonal. (And 0 on the off-diagonals.)

diag(1:5)

Or, lastly, create a square matrix of a certain dimension with 1 for every element of the diagonal and 0 for the off-diagonals.

diag(5)

Calculations with Vectors and Matrices {-}

Certain operations in R, for example %*% have different behavior on vectors and matrices. To illustrate this, we will first create two vectors.

a_vec = c(1, 2, 3)
b_vec = c(2, 2, 2)

Note that these are indeed vectors. They are not matrices.

c(is.vector(a_vec), is.vector(b_vec))
c(is.matrix(a_vec), is.matrix(b_vec))

When this is the case, the %*% operator is used to calculate the dot product, also know as the inner product of the two vectors.

The dot product of vectors $\boldsymbol{a} = \lbrack a_1, a_2, \cdots a_n \rbrack$ and $\boldsymbol{b} = \lbrack b_1, b_2, \cdots b_n \rbrack$ is defined to be

[ \boldsymbol{a} \cdot \boldsymbol{b} = \sum_{i = 1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots a_n b_n. ]

a_vec %*% b_vec # inner product
a_vec %o% b_vec # outer product

The %o% operator is used to calculate the outer product of the two vectors.

When vectors are coerced to become matrices, they are column vectors. So a vector of length $n$ becomes an $n \times 1$ matrix after coercion.

as.matrix(a_vec)

If we use the %*% operator on matrices, %*% again performs the expected matrix multiplication. So you might expect the following to produce an error, because the dimensions are incorrect.

as.matrix(a_vec) %*% b_vec

At face value this is a $3 \times 1$ matrix, multiplied by a $3 \times 1$ matrix. However, when b_vec is automatically coerced to be a matrix, R decided to make it a "row vector", a $1 \times 3$ matrix, so that the multiplication has conformable dimensions.

If we had coerced both, then R would produce an error.

as.matrix(a_vec) %*% as.matrix(b_vec)

Another way to calculate a dot product is with the crossprod() function. Given two vectors, the crossprod() function calculates their dot product. The function has a rather misleading name.

crossprod(a_vec, b_vec)  # inner product
tcrossprod(a_vec, b_vec)  # outer product

This function could be very useful later. When used with matrices $X$ and $Y$ as arguments, it calculates

[ X^\top Y. ]

When dealing with linear models, the calculation

[ X^\top X ]

is used repeatedly.

C_mat = matrix(c(1, 2, 3, 4, 5, 6), 2, 3)
D_mat = matrix(c(2, 2, 2, 2, 2, 2), 2, 3)

This is useful both as a shortcut for a frequent calculation and as a more efficient implementation than using t() and %*%.

crossprod(C_mat, D_mat)
t(C_mat) %*% D_mat
all.equal(crossprod(C_mat, D_mat), t(C_mat) %*% D_mat)

crossprod(C_mat, C_mat)
t(C_mat) %*% C_mat
all.equal(crossprod(C_mat, C_mat), t(C_mat) %*% C_mat)

Data Frames

We have previously seen vectors and matrices for storing data as we introduced R. We will now introduce a data frame which will be the most common way that we store and interact with data in this course.

example_data = data.frame(x = c(1, 3, 5, 7, 9, 1, 3, 5, 7, 9),
                          y = rep("Hello", 10),
                          z = rep(c("TRUE", "FALSE"), 5))

Unlike a matrix, which can be thought of as a vector rearranged into rows and columns, a data frame is not required to have the same data type for each element. A data frame is a list of vectors. So, each vector must contain the same data type, but the different vectors can store different data types.

example_data

The data.frame() function above is one way to create a data frame. We can also import data from various file types in into R, as well as use data stored in packages.

write.csv(example_data, "data/example_data.csv", row.names = FALSE)

The example data above can also be found here as a .csv file. To read this data into R, we would use the read.csv() function.

example_data_from_csv = read.csv("data/example_data.csv")

This particular line of code assumes that the file example_data.csv exists in a folder called data in your current working directory.

Alternatively, we could use the "Import Dataset" feature in RStudio which can be found in the environment window. (By default, the top-right pane of RStudio.)

RStudio Import Screen

Once completed, this process will automatically generate the code to import a file. The resulting code will be shown in the console window.

Earlier we looked at installing packages, in particular the ggplot2 package. (A package for visualization. While not necessary for this course, it is quickly growing in popularity.)

library(ggplot2)

Inside the ggplot2 package is a dataset called mpg. By loading the package using the library() function, we can now access mpg.

When using data from inside a package, there are three things we would generally like to do:

Look at the raw data.
Understand the data. (Where did it come from? What are the variables? Etc.)
Visualize the data.

To look at the data, we have two useful commands: head() and str().

head(mpg, n = 10)

The function head() will display the first n observations of the data frame.

str(mpg)

The function str() will display the "structure" of the data frame. It will display the number of observations and variables, list the variables, give the type of each variable, and show some elements of each variable.

It is important to note that while matrices have rows and columns, data frames instead have observations and variables. When displayed in the console or viewer, each row is an observation and each column is a variable. However generally speaking, their order does not matter, it is simply a side-effect of how the data was entered or stored.

In this dataset an observation is for a particular model-year of a car, and the variables describe attributes of the car, for example its highway fuel efficiency.

To understand more about the data set, we use the ? operator to pull up the documentation for the data.

?mpg

R has a number of functions for quickly working with and extracting basic information from data frames. To quickly obtain a vector of the variable names, we use the names() function.

names(mpg)

To access one of the variables as a vector, we use the $ operator.

mpg$year
mpg$hwy

We can use the dim(), nrow() and ncol() functions to obtain information about the dimension of the data frame.

dim(mpg)
nrow(mpg)
ncol(mpg)

Here nrow() is also the number of observations, which in most cases is the sample size.

Subsetting data frames can work much like subsetting matrices using square brackets, [,]. Here, we find fuel efficient vehicles earning over 35 miles per gallon and only display manufacturer, model and year.

mpg[mpg$hwy > 35, c("manufacturer", "model", "year")]

An alternative would be to use the subset() function, which has a much more readable syntax.

subset(mpg, subset = hwy > 35, select = c("manufacturer", "model", "year"))

Lastly, we could use the filter and select functions from the dplyr package which introduces the %>% operator from the magrittr package. This is not necessary for this course, however the dplyr package is something you should be aware of as it is becoming a popular tool in the R world.

library(dplyr)
mpg %>% filter(hwy > 35) %>% select(manufacturer, model, year)

All three approaches produce the same results. Which you use will be largely based on a given situation as well as user preference.

Plotting

Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks is to visualize the data. Often, a proper visualization can illuminate features of the data that can inform further analysis.

We will look at three methods of visualizing data that we will use throughout the course:

Histograms
Boxplots
Scatterplots

Histograms

When visualizing a single numerical variable, a histogram will be our go-to tool, which can be created in R using the hist() function.

hist(mpg$cty)

The histogram function has a number of parameters which can be changed to make our plot look much nicer. Use the ? operator to read the documentation for the hist() to see a full list of these parameters.

hist(mpg$cty,
     xlab   = "Miles Per Gallon (City)",
     main   = "Histogram of MPG (City)",
     breaks = 12,
     col    = "dodgerblue",
     border = "darkorange")

Importantly, you should always be sure to label your axes and give the plot a title. The argument breaks is specific to hist(). Entering an integer will give a suggestion to R for how many bars to use for the histogram. By default R will attempt to intelligently guess a good number of breaks, but as we can see here, it is sometimes useful to modify this yourself.

Boxplots

To visualize the relationship between a numerical and categorical variable, we will use a boxplot. In the mpg dataset, the drv variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.

unique(mpg$drv)

First note that we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. To do so in R, we use the boxplot() function.

boxplot(mpg$hwy)

However, more often we will use boxplots to compare a numerical variable for different values of a categorical variable.

boxplot(hwy ~ drv, data = mpg)

Here used the boxplot() command to create side-by-side boxplots. However, since we are now dealing with two variables, the syntax has changed. The R syntax hwy ~ drv, data = mpg reads "Plot the hwy variable against the drv variable using the dataset mpg." We see the use of a ~ (which specifies a formula) and also a data = argument. This will be a syntax that is common to many functions we will use in this course.

boxplot(hwy ~ drv, data = mpg,
     xlab   = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
     ylab   = "Miles Per Gallon (Highway)",
     main   = "MPG (Highway) vs Drivetrain",
     pch    = 20,
     cex    = 2,
     col    = "darkorange",
     border = "dodgerblue")

Again, boxplot() has a number of additional arguments which have the ability to make our plot more visually appealing.

Scatterplots

Lastly, to visualize the relationship between two numeric variables we will use a scatterplot. This can be done with the plot() function and the ~ syntax we just used with a boxplot. (The function plot() can also be used more generally; see the documentation for details.)

plot(hwy ~ displ, data = mpg)

plot(hwy ~ displ, data = mpg,
     xlab = "Engine Displacement (in Liters)",
     ylab = "Miles Per Gallon (Highway)",
     main = "MPG (Highway) vs Engine Displacement",
     pch  = 20,
     cex  = 2,
     col  = "dodgerblue")

Distributions

When working with different statistical distributions, we often want to make probabilistic statements based on the distribution.

We typically want to know one of four things:

The density (pdf) at a particular value.
The distribution (cdf) at a particular value.
The quantile value corresponding to a particular probability.
A random draw of values from a particular distribution.

This used to be done with statistical tables printed in the back of textbooks. Now, R has functions for obtaining density, distribution, quantile and random values.

The general naming structure of the relevant R functions is:

dname calculates density (pdf) at input x.
pname calculates distribution (cdf) at input x.
qname calculates the quantile at an input probability.
rname generates a random draw from a particular distribution.

Note that name represents the name of the given distribution.

For example, consider a random variable $X$ which is $N(\mu = 2, \sigma^2 = 25)$. (Note, we are parameterizing using the variance $\sigma^2$. R however uses the standard deviation.)

To calculate the value of the pdf at x = 3, that is, the height of the curve at x = 3, use:

dnorm(x = 3, mean = 2, sd = 5)

To calculate the value of the cdf at x = 3, that is, $P(X \leq 3)$, the probability that $X$ is less than or equal to 3, use:

pnorm(q = 3, mean = 2, sd = 5)

Or, to calculate the quantile for probability 0.975, use:

qnorm(p = 0.975, mean = 2, sd = 5)

Lastly, to generate a random sample of size n = 10, use:

rnorm(n = 10, mean = 2, sd = 5)

These functions exist for many other distributions, including but not limited to:

| Command | Distribution | |----------|--------------| | *binom | Binomial | | *t | t | | *pois | Poisson | | *f | F | | *chisq | Chi-Squared |

Where * can be d, p, q, and r. Each distribution will have its own set of parameters which need to be passed to the functions as arguments. For example, dbinom() would not have arguments for mean and sd, since those are not parameters of the distribution. Instead a binomial distribution is usually parameterized by $n$ and $p$, however R chooses to call them something else. To find the names that R uses we would use ?dbinom and see that R instead calls the arguments size and prob. For example:

dbinom(x = 6, size = 10, prob = 0.75)

Also note that, when using the dname functions with discrete distributions, they are the pmf of the distribution. For example, the above command is $P(Y = 6)$ if $Y \sim b(n = 10, p = 0.75)$. (The probability of flipping an unfair coin 10 times and seeing 6 heads, if the probability of heads is 0.75.)

Programming Basics

Logical Operators

| Operator | Summary | Example | Result | |----------|-----------------------|-----------------------|--------| | x < y | x less than y | 3 < 42 | r 3 < 42 | | x > y | x greater than y | 3 > 42 | r 3 > 42 | | x <= y | x less than or equal to y | 3 <= 42 | r 3 <= 42 | | x >= y | x greater than or equal to y | 3 >= 42 | r 3 >= 42 | | x == y | xequal to y | 3 == 42 | r 3 == 42 | | x != y | x not equal to y | 3 != 42 | r 3 != 42 | | !x | not x | !(3 > 42) | r !(3 > 42) | | x | y | x or y | (3 > 42) | TRUE | r (3 > 42) | TRUE | | x & y | x and y | (3 < 4) & ( 42 > 13) | r (3 < 4) & ( 42 > 13) |

In R, logical operators are vectorized. To demonstrate this, we will use the following height and weight data.

heights = c(110, 120, 115, 136, 205, 156, 175)
weights = c(64, 67, 62, 60, 77, 70, 66)

First, using the < operator, when can find which heights are less than 121. Further, we could also find which heights are less than 121 or exactly equal to 156.

heights < 121
heights < 121 | heights == 156

Often, a vector of logical values is useful for subsetting a vector. For example, we can find the heights that are larger than 150. We can then use the resulting vector to subset the heights vector, thus actually returning the heights that are above 150, instead of a vector of which values are above 150. Here we also obtain the weights corresponding to heights above 150.

heights > 150
heights[heights > 150]
weights[heights > 150]

When comparing vectors, be sure you are comparing vectors of the same length.

a = 1:10
b = 2:4
a < b

What happened here? R still performed the operation, but it also gives us a warning. (To perform the operation, R automatically made b longer by repeating b as needed.)

The one exception to this behavior is comparing to a vector of length 1. R does not warn us in this case, as comparing each value of a vector to a single value is a common operation that is usually reasonable to perform.

a > 5

Often we will want to convert TRUE and FALSE values to 1 and 0. When performing mathematical operations on TRUE and FALSE, this is done automatically through type coercion.

5 + (a > 5)

By calling sum() on a vector of logical values, we can essentially count the number of TRUE values.

sum(a > 5)

Here we count the elements of a that are larger than 5. This is an extremely useful feature.

Control Flow

In R, the if/else syntax is:

if (...) {
  some R code
} else {
  more R code
}

For example,

x = 1
y = 3
if (x > y) {
  z = x * y
  print("x is larger than y")
} else {
  z = x + 5 * y
  print("x is less than or equal to y")
}

z

R also has a special function ifelse() which is very useful. It returns one of two specified values based on a conditional statement.

ifelse(4 > 3, 1, 0)

The real power of ifelse() comes from its ability to be applied to vectors.

fib = c(1, 1, 2, 3, 5, 8, 13, 21)
ifelse(fib > 6, "Foo", "Bar")

Now a for loop example,

x = 11:15
for (i in 1:5) {
  x[i] = x[i] * 2
}

x

Note that this for loop is very normal in many programming languages, but not in R. In R we would not use a loop, instead we would simply use a vectorized operation.

x = 11:15
x = x * 2
x

Functions

So far we have been using functions, but haven't actually discussed some of their details.

function_name(arg1 = 10, arg2 = 20)

To use a function, you simply type its name, followed by an open parenthesis, then specify values of its arguments, then finish with a closing parenthesis.

An argument is a variable which is used in the body of the function. Specifying the values of the arguments is essentially providing the inputs to the function.

We can also write our own functions in R. For example, we often like to "standardize" variables, that is, subtracting the sample mean, and dividing by the sample standard deviation.

[ \frac{x - \bar{x}}{s} ]

In R we would write a function to do this. When writing a function, there are three thing you must do.

Give the function a name. Preferably something that is short, but descriptive.
Specify the arguments using function()
Write the body of the function within curly braces, {}.

standardize = function(x) {
  m = mean(x)
  std = sd(x)
  result = (x - m) / std
  result
}

Here the name of the function is standardize, and the function has a single argument x which is used in the body of function. Note that the output of the final line of the body is what is returned by the function. In this case the function returns the vector stored in the variable results.

To test our function, we will take a random sample of size n = 10 from a normal distribution with a mean of 2 and a standard deviation of 5.

(test_sample = rnorm(n = 10, mean = 2, sd = 5))
standardize(x = test_sample)

This function could be written much more succinctly, simply performing all the operations on one line and immediately returning the result, without storing any of the intermediate results.

standardize = function(x) {
  (x - mean(x)) / sd(x)
}

When specifying arguments, you can provide default arguments.

power_of_num = function(num, power = 2) {
  num ^ power
}

Let's look at a number of ways that we could run this function to perform the operation 10^2 resulting in 100.

power_of_num(10)
power_of_num(10, 2)
power_of_num(num = 10, power = 2)
power_of_num(power = 2, num = 10)

Note that without using the argument names, the order matters. The following code will not evaluate to the same output as the previous example.

power_of_num(2, 10)

Also, the following line of code would produce an error since arguments without a default value must be specified.

power_of_num(power = 5)

To further illustrate a function with a default argument, we will write a function that calculates sample variance two ways.

By default, is will calculate the unbiased estimate of $\sigma^2$, which we will call $s^2$.

[ s^2 = \frac{1}{n - 1}\sum_{i=1}^{n}(x - \bar{x})^2 ]

It will also have the ability to return the biased estimate (based on maximum likelihood) which we will call $\hat{\sigma}^2$.

[ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x - \bar{x})^2 ]

get_var = function(x, biased = FALSE) {
  n = length(x) - 1 * !biased
  (1 / n) * sum((x - mean(x)) ^ 2)
}

get_var(test_sample)
get_var(test_sample, biased = FALSE)
var(test_sample)

We see the function is working as expected, and when returning the unbiased estimate it matches R's built in function var(). Finally, let's examine the biased estimate of $\sigma^2$.

get_var(test_sample, biased = TRUE)

Hypothesis Tests in `R`

A prerequisite for STAT 420 is an understanding of the basics of hypothesis testing. Recall the basic structure of hypothesis tests:

An overall model and related assumptions are made. (The most common being observations following a normal distribution.)
The null ($H_{0}$) and alternative ($H_{1}$ or $H_{A}$) hypothesis are specified. Usually the null specifies a particular value of a parameter.
With given data, the value of the test statistic is calculated.
Under the general assumptions, as well as assuming the null hypothesis is true, the distribution of the test statistic is known.
Given the distribution and value of the test statistic, as well as the form of the alternative hypothesis, we can calculate a p-value of the test.
Based on the p-value and pre-specified level of significance, we make a decision. One of:
- Fail to reject the null hypothesis.
- Reject the null hypothesis.

We'll do some quick review of two of the most common tests to show how they are performed using R.

One Sample t-Test: Review

Suppose $x_{i} \sim \mathrm{N}(\mu,\sigma^{2})$ and we want to test $H_{0}: \mu = \mu_{0}$ versus $H_{1}: \mu \neq \mu_{0}.$

Assuming $\sigma$ is unknown, we use the one-sample Student's $t$ test statistic:

[ t = \frac{\bar{x}-\mu_{0}}{s/\sqrt{n}} \sim t_{n-1}, ]

where $\bar{x} = \displaystyle\frac{\sum_{i=1}^{n}x_{i}}{n}$ and $s = \sqrt{\displaystyle\frac{1}{n - 1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$.

A $100(1 - \alpha)$\% confidence interval for $\mu$ is given by,

[ \bar{x} \pm t_{n-1}(\alpha/2)\frac{s}{\sqrt{n}} ]

where $t_{n-1}(\alpha/2)$ is the critical value such that $P\left(t>t_{n-1}(\alpha/2)\right) = \alpha/2$ for $n-1$ degrees of freedom.

One Sample t-Test: Example

Suppose a grocery store sells "16 ounce" boxes of Captain Crisp cereal. A random sample of 9 boxes was taken and weighed. The weight in ounces are stored in the data frame capt_crisp.

capt_crisp = data.frame(weight = c(15.5, 16.2, 16.1, 15.8, 15.6, 16.0, 15.8, 15.9, 16.2))

The company that makes Captain Crisp cereal claims that the average weight of a box is at least 16 ounces. We will assume the weight of cereal in a box is normally distributed and use a 0.05 level of significance to test the company's claim.

To test $H_{0}: \mu \geq 16$ versus $H_{1}: \mu < 16$, the test statistic is

[ t = \frac{\bar{x} - \mu_{0}}{s / \sqrt{n}} ]

The sample mean $\bar{x}$ and the sample standard deviation $s$ can be easily computed using R. We also create variables which store the hypothesized mean and the sample size.

x_bar = mean(capt_crisp$weight)
s     = sd(capt_crisp$weight)
mu_0  = 16
n     = 9

We can then easily compute the test statistic.

t = (x_bar - mu_0) / (s / sqrt(n))
t

Under the null hypothesis, the test statistic has a $t$ distribution with $n - 1$ degrees of freedom, in this case r n - 1.

To complete the test, we need to obtain the p-value of the test. Since this is a one-sided test with a less-than alternative, we need to area to the left of r t for a $t$ distribution with r n - 1 degrees of freedom. That is,

[ P(t_{r n - 1} < r t) ]

pt(t, df = n - 1)

We now have the p-value of our test, which is greater than our significance level (0.05), so we fail to reject the null hypothesis.

Alternatively, this entire process could have been completed using one line of R code.

t.test(x = capt_crisp$weight, mu = 16, alternative = c("less"), conf.level = 0.95)

We supply R with the data, the hypothesized value of $\mu$, the alternative, and the confidence level. R then returns a wealth of information including:

The value of the test statistic.
The degrees of freedom of the distribution under the null hypothesis.
The p-value of the test.
The confidence interval which corresponds to the test.
An estimate of $\mu$.

Since the test was one-sided, R returned a one-sided confidence interval. If instead we wanted a two-sided interval for the mean weight of boxes of Captain Crisp cereal we could modify our code.

capt_test_results = t.test(capt_crisp$weight, mu = 16,
                           alternative = c("two.sided"), conf.level = 0.95)

This time we have stored the results. By doing so, we can directly access portions of the output from t.test(). To see what information is available we use the names() function.

names(capt_test_results)

We are interested in the confidence interval which is stored in conf.int.

capt_test_results$conf.int

Let's check this interval "by hand." The one piece of information we are missing is the critical value, $t_{n-1}(\alpha/2) = t_{8}(0.025)$, which can be calculated in R using the qt() function.

qt(0.975, df = 8)

So, the 95\% CI for the mean weight of a cereal box is calculated by plugging into the formula,

[ \bar{x} \pm t_{n-1}(\alpha/2) \frac{s}{\sqrt{n}} ]

c(mean(capt_crisp$weight) - qt(0.975, df = 8) * sd(capt_crisp$weight) / sqrt(9),
  mean(capt_crisp$weight) + qt(0.975, df = 8) * sd(capt_crisp$weight) / sqrt(9))

Two Sample t-Test: Review

Suppose $x_{i} \sim \mathrm{N}(\mu_{x}, \sigma^{2})$ and $y_{i} \sim \mathrm{N}(\mu_{y}, \sigma^{2}).$

Want to test $H_{0}: \mu_{x} - \mu_{y} = \mu_{0}$ versus $H_{1}: \mu_{x} - \mu_{y} \neq \mu_{0}.$

Assuming $\sigma$ is unknown, use the two-sample Student's $t$ test statistic:

[ t = \frac{(\bar{x} - \bar{y})-\mu_{0}}{s_{p}\sqrt{\frac{1}{n}+\frac{1}{m}}} \sim t_{n+m-2}, ]

where $\displaystyle\bar{x}=\frac{\sum_{i=1}^{n}x_{i}}{n}$, $\displaystyle\bar{y}=\frac{\sum_{i=1}^{m}y_{i}}{m}$, and $s_p^2 = \displaystyle\frac{(n-1)s_x^2+(m-1)s_y^2}{n+m-2}$.

A $100(1-\alpha)$\% CI for $\mu_{x}-\mu_{y}$ is given by

[ (\bar{x} - \bar{y}) \pm t_{n+m-2}(\alpha/2) \left(s_{p}\textstyle\sqrt{\frac{1}{n}+\frac{1}{m}}\right), ]

where $t_{n+m-2}(\alpha/2)$ is the critical value such that $P\left(t>t_{n+m-2}(\alpha/2)\right)=\alpha/2$.

Two Sample t-Test: Example

Assume that the distributions of $X$ and $Y$ are $\mathrm{N}(\mu_{1},\sigma^{2})$ and $\mathrm{N}(\mu_{2},\sigma^{2})$, respectively. Given the $n = 6$ observations of $X$,

x = c(70, 82, 78, 74, 94, 82)
n = length(x)

and the $m = 8$ observations of $Y$,

y = c(64, 72, 60, 76, 72, 80, 84, 68)
m = length(y)

we will test $H_{0}: \mu_{1} = \mu_{2}$ versus $H_{1}: \mu_{1} > \mu_{2}$.

First, note that we can calculate the sample means and standard deviations.

x_bar = mean(x)
s_x   = sd(x)
y_bar = mean(y)
s_y   = sd(y)

We can then calculate the pooled standard deviation.

[ s_{p} = \sqrt{\frac{(n-1)s_{x}^{2}+(m-1)s_{y}^{2}}{n+m-2}} ]

s_p = sqrt(((n - 1) * s_x ^ 2 + (m - 1) * s_y ^ 2) / (n + m - 2))

Thus, the relevant $t$ test statistic is given by

[ t = \frac{(\bar{x}-\bar{y})-\mu_{0}}{s_{p}\sqrt{\frac{1}{n}+\frac{1}{m}}}. ]

t = ((x_bar - y_bar) - 0) / (s_p * sqrt(1 / n + 1 / m))
t

Note that $t \sim t_{n + m - 2} = t_{r n + m - 2}$, so we can calculate the p-value, which is

[ P(t_{r n + m - 2} > r t). ]

1 - pt(t, df = n + m - 2)

But, then again, we could have simply performed this test in one line of R.

t.test(x, y, alternative = c("greater"), var.equal = TRUE)

Recall that a two-sample $t$-test can be done with or without an equal variance assumption. Here var.equal = TRUE tells R we would like to perform the test under the equal variance assumption.

Above we carried out the analysis using two vectors x and y. In general, we will have a preference for using data frames.

t_test_data = data.frame(values = c(x, y),
                         group  = c(rep("A", length(x)), rep("B", length(y))))

We now have the data stored in a single variables (values) and have created a second variable (group) which indicates which "sample" the value belongs to.

t_test_data

Now to perform the test, we still use the t.test() function but with the ~ syntax and a data argument.

t.test(values ~ group, data = t_test_data,
       alternative = c("greater"), var.equal = TRUE)

Simulation

Simulation and model fitting are related but opposite processes.

In simulation, the data generating process is known. We will know the form of the model as well as the value of each of the parameters. In particular, we will often control the distribution and parameters which define the randomness, or noise in the data.
In model fitting, the data is known. We will then assume a certain form of model and find the best possible values of the parameters given the observed data. Essentially we are seeking to uncover the truth. Often we will attempt to fit many models, and we will learn metrics to assess which model fits best.

Simulation vs Modeling

Often we will simulate data according to a process we decide, then use a modeling method seen in class. We can then verify how well the method works, since we know the data generating process.

One of the biggest strengths of R is its ability to carry out simulations using built-in functions for generating random samples from certain distributions. We'll look at two very simple examples here, however simulation will be a topic we revisit several times throughout the course.

Paired Differences

Consider the model:

[ \begin{split} X_{11}, X_{12}, \ldots, X_{1n} \sim N(\mu_1,\sigma^2)\ X_{21}, X_{22}, \ldots, X_{2n} \sim N(\mu_2,\sigma^2) \end{split} ]

Assume that $\mu_1 = 6$, $\mu_2 = 5$, $\sigma^2 = 4$ and $n = 25$.

Let

[ \begin{aligned} \bar{X}1 &= \displaystyle\frac{1}{n}\sum{i=1}^{n}X_{1i}\ \bar{X}2 &= \displaystyle\frac{1}{n}\sum{i=1}^{n}X_{2i}\ D &= \bar{X}_1 - \bar{X}_2. \end{aligned} ]

Suppose we would like to calculate $P(0 < D < 2)$. First we will need to obtain the distribution of $D$.

Recall,

[ \bar{X}_1 \sim N\left(\mu_1,\frac{\sigma^2}{n}\right) ]

and

[ \bar{X}_2 \sim N\left(\mu_2,\frac{\sigma^2}{n}\right). ]

Then,

[ D = \bar{X}_1 - \bar{X}_2 \sim N\left(\mu_1-\mu_2, \frac{\sigma^2}{n} + \frac{\sigma^2}{n}\right) = N\left(6-5, \frac{4}{25} + \frac{4}{25}\right). ]

So,

[ D \sim N(\mu = 1, \sigma^2 = 0.32). ]

Thus,

[ P(0 < D < 2) = P(D < 2) - P(D < 0). ]

This can then be calculated using R without a need to first standardize, or use a table.

pnorm(2, mean = 1, sd = sqrt(0.32)) - pnorm(0, mean = 1, sd = sqrt(0.32))

An alternative approach, would be to simulate a large number of observations of $D$ then use the empirical distribution to calculate the probability.

Our strategy will be to repeatedly:

Generate a sample of 25 random observations from $N(\mu_1 = 6,\sigma^2 = 4)$. Call the mean of this sample $\bar{x}_{1s}$.
Generate a sample of 25 random observations from $N(\mu_1 = 5,\sigma^2 = 4)$. Call the mean of this sample $\bar{x}_{2s}$.
Calculate the differences of the means, $d_s = \bar{x}{1s} - \bar{x}{2s}$.

We will repeat the process a large number of times. Then we will use the distribution of the simulated observations of $d_s$ as an estimate for the true distribution of $D$.

set.seed(42)
num_samples = 10000
differences = rep(0, num_samples)

Before starting our for loop to perform the operation, we set a seed for reproducibility, create and set a variable num_samples which will define the number of repetitions, and lastly create a variables differences which will store the simulate values, $d_s$.

By using set.seed() we can reproduce the random results of rnorm() each time starting from that line.

for (s in 1:num_samples) {
  x1 = rnorm(n = 25, mean = 6, sd = 2)
  x2 = rnorm(n = 25, mean = 5, sd = 2)
  differences[s] = mean(x1) - mean(x2)
}

To estimate $P(0 < D < 2)$ we will find the proportion of values of $d_s$ (among the r num_samples values of $d_s$ generated) that are between 0 and 2.

mean(0 < differences & differences < 2)

Recall that above we derived the distribution of $D$ to be $N(\mu = 1, \sigma^2 = 0.32)$

If we look at a histogram of the differences, we find that it looks very much like a normal distribution.

hist(differences, breaks = 20, 
     main   = "Empirical Distribution of D",
     xlab   = "Simulated Values of D",
     col    = "dodgerblue",
     border = "darkorange")

Also the sample mean and variance are very close to to what we would expect.

mean(differences)
var(differences)

We could have also accomplished this task with a single line of more "idiomatic" R.

set.seed(42)
diffs = replicate(10000, mean(rnorm(25, 6, 2)) - mean(rnorm(25, 5, 2)))

Use ?replicate to take a look at the documentation for the replicate function and see if you can understand how this line performs the same operations that our for loop above executed.

mean(differences == diffs)

We see that by setting the same seed for the randomization, we actually obtain identical results!

Distribution of a Sample Mean

For another example of simulation, we will simulate observations from a Poisson distribution, and examine the empirical distribution of the sample mean of these observations.

Recall, if

[ X \sim Pois(\mu) ]

then

[ E[X] = \mu ]

and

[ Var[X] = \mu. ]

Also, recall that for a random variable $X$ with finite mean $\mu$ and finite variance $\sigma^2$, the central limit theorem tells us that the mean, $\bar{X}$ of a random sample of size $n$ is approximately normal for large values of $n$. Specifically, as $n \to \infty$,

[ \bar{X} \overset{d}{\to} N\left(\mu, \frac{\sigma^2}{n}\right). ]

The following verifies this result for a Poisson distribution with $\mu = 10$ and a sample size of $n = 50$.

set.seed(1337)
mu          = 10
sample_size = 50
samples     = 100000
x_bars      = rep(0, samples)

for(i in 1:samples){
  x_bars[i] = mean(rpois(sample_size, lambda = mu))
}

x_bar_hist = hist(x_bars, breaks = 50, 
                  main = "Histogram of Sample Means",
                  xlab = "Sample Means")

Now we will compare sample statistics from the empirical distribution with their known values based on the parent distribution.

c(mean(x_bars), mu)

c(var(x_bars), mu / sample_size)

c(sd(x_bars), sqrt(mu) / sqrt(sample_size))

And here, we will calculate the proportion of sample means that are within 2 standard deviations of the population mean.

mean(x_bars > mu - 2 * sqrt(mu) / sqrt(sample_size) &
     x_bars < mu + 2 * sqrt(mu) / sqrt(sample_size))

This last histogram uses a bit of a trick to approximately shade the bars that are within two standard deviations of the mean.)

shading = ifelse(x_bar_hist$breaks > mu - 2 * sqrt(mu) / sqrt(sample_size) & 
                   x_bar_hist$breaks < mu + 2 * sqrt(mu) / sqrt(sample_size),
                  "darkorange", "dodgerblue")

x_bar_hist = hist(x_bars, breaks = 50, col = shading,
                  main = "Histogram of Sample Means, Two Standard Deviations",
                  xlab = "Sample Means")

daviddalpiaz/appliedstats documentation built on Nov. 21, 2024, 7:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

daviddalpiaz/appliedstats Applied Statistics with R