In ehassler/MontgomeryDAE: MontgomeryDAE - Data and examples from Montgomery's "Design and Analysis of Experiments"

source('_format.R')

R is an interpreted language for analyzing and visualizing data. It's very different from products like Minitab and JMP even though the users' goals are often the same. Perhaps the most important factor for the success of R is that it's free and open source. Because of its accessibility, many people have chosen to adopt it for research, business purposes, and teaching. Some of them have created packages that would let R do the analysis they needed to do, or as a way to enable people to do a new kind of analysis altogether. R grew and grew. Today it is an indispensable tool for many statisticians and data scientists.

There's no way to cover most or even much of R and the important packages. Instead, I've tried to provide just a taste of R, enough to get people through the examples contained in the subsequent sections. Each following chapter corresponds to the same numbered chapter in Design and Analysis of Experiments. This document is very much a companion --- on its own it will likely make little to no sense, but used with the textbook it should allow R users to follow along with the examples.

This chapter is divided into two main parts. First, I will describe how to get and install R and associated tools. Second, I will cover how key features of the R language work.

Getting Started

To get started we will first download the R interpreter so that we can run R code. Then we will download the RStudio editor to make editing nice and easy. Finally we will install the MontgomeryDAE package from GitHub to bring in all of the data sets from the textbook.

Installing the R Interpreter

At present you can find R through the website https://www.r-project.org. There you can download a version of R best suited for your operating system, and learn what code names for each release. (The present release 3.6.1 is Action of the Toes.)

For macOS users, instead of installing R from the pkg, I'd recommend installing homebrew and installing R using homebrew. At present you can do this by running the following from Terminal: ```{bash eval=FALSE} /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

\noindent Next, to install R using homebrew simply run the following in the terminal:
```{bash eval=FALSE}
brew install r

\noindent The homebrew installation method makes things such as installing R source packages easier down the road. Windows users can install Rtools to provide a toolchain for compiling source packages as well.

When reading about R you'll see CRAN a lot. CRAN standards for Comprehensive R Archive Network and it's where you'll actually be downloading R and most packages. Packages are collections of code (in R, C, Fortran, or other languages), data sets, and documentation that we load using the library function. There is a huge variety of R packages that cover almost anything you could want (note the "almost" qualifier).

Microsoft began seriously investing in R back in 2015 when it acquired Revolution Analytics. Since then they've developed Microsoft R Open (MRO) which is their own runtime for R that has improved performance (see https://mran.microsoft.com/open) but often lags behind the current version of CRAN R. Microsoft has integrated R into other products as well --- you can run R code directly inside SQL Server queries. I tend to use CRAN R over MRO since that's the version packages are tested against.

Install RStudio

Most people use an integrated development environment (IDE) called RStudio (https://www.rstudio.com/) when they work with R. In fact, once you start using RStudio if you ever have to go back to the regular R environment it feels very punishing. A free version of RStudio is available for most operating systems. The RStudio IDE is also extremely useful for publishing via RMarkdown, package development, and a host of other R-related activities.

R Studio Window

The RStudio window has four main panes. The top left is where you can edit files and highlight code to run. The bottom left is the interactive R session and you can enter commands there directly if you wish. The bottom left also has a Terminal tab so that you can use the command prompt or shell from within the application.

options(tinytex.verbose = TRUE)
include_graphics('rstudio.png')

\noindent The top right includes a list of all variables in the global environment and also has a History tab of all commands you've run. The bottom right will show files, packages, plots, and help files while you work. RStudio has many more features which I will not cover.

Packages and Libraries

As I mentioned earlier, packages are collections of code, data, and help files that extend the capabilities of R. Some packages, such as MASS, are shipped with R. Others can be downloaded and installed by R using the install.packages function (or in RStudio go to Tools > Install Packages). Let's load the MASS package and use fractions to convert decimals to their nearest rational expression:

library("MASS")
fractions(0.333333)

Above and throughout this document the output of commands will be displayed following the command as a comment. Comments in R are everything after # to the end of the line. Above, the output of the fractions command is 1/3, and so it is displayed as # [1] 1/3.

The MontgomeryDAE package contains data files from the textbook as well as a copy of this document. Download this package from https://github.com/ehassler/MontgomeryDAE/releases looking for the most current release. A PDF version of this guide will also be attached to the release.

To install a downloaded source package in RStudio you can use Install Packages from the Tools menu and select Package Archive File (tar.gz) from the Install From drop-down menu. Then, browse to the location where the MontgomeryDAE package was downloaded. If you're using R without RStudio simply use the command

install.packages("...path to MontgomeryDAE package...", repos = NULL, type = "source")

\noindent to install the package.

\noindent Once installed, you can use the help system to see this document as a series of help files:

help(package = 'MontgomeryDAE')

\noindent Follow the link for user guides, package vignettes, and other documentation to see an HTML version of this document for each chapter.

The R Language

If you've worked with a C-like language you'll feel comfortable with R. The syntax of R is something between Javascript and Python. You don't have to worry about declaring types or compiling your code. Instead you can work with R pretty interactively, typing different things into a session and seeing how things work out. This enables an R user to efficiently do exploratory data analysis while preserving the reproducibility of that exploration (since we can simply re-run that set of commands to get the same results).

Everything in R is an object. You can assign functions to variables and return functions from functions. An odd thing about R is that it has functions to interact with the parse tree of the code that's currently being evaluated. Sometimes you'll see R somehow "knows" the name of the variable you passed into a function and uses that as a label on a graph. Don't be alarmed. R is not becoming self-aware.

One example of this oddity is in loading libraries. Previously we loaded the MASS package using

library("MASS")

\noindent but we also could have used

library(MASS)

\noindent to load it.

R is an interpreted language with an interactive interpreter. Most of the time I work by writing code, highlighting it, and then have RStudio execute it in the interactive session. I'll then play around with that session to figure out what kind of bugs are occurring or to pull up a help file. R's help system is very convenient. Once I feel like my analysis is where I want it to be I'll perform a final check before submitting it to a stakeholder: I'll restart my R environment and run the script I've made to make sure the results are what I expect.

To get help with a function or package in R all you have to do is enter a question mark followed by the name. Help files usually include examples as well. Try entering the following:

?cat

\noindent We can use this function to print to the output:

cat('hello world')

\noindent We also could have used the print function:

print("Hello cat!")

We printed the string 'hello world' and "Hello cat!", each using different quotation marks. This makes no difference, the objects are the same.

print(class("Hello"))
print(class('Hello'))
print('Hello' == "Hello")

Arithmetic operators work as expected as well. For example, if we wish to raise 12 to the second power and add one we simply do:

print(12^2 + 1)

Assignment to a variable can be done a number of ways, but I try to stick to using <-. Let's create a variable and change its value:

friends <- 1
friends <- friends + 1
print(friends)

\noindent Here we see that we assigned a value of 1 to friends, then we added 1 to it, to end up with 2 friends. Use of = is also very common. Using that our code would be:

friends = 1
friends = friends + 1
print(friends)

An odd thing about R is that you can use a period in variable names. The identifier total.weight is perfectly valid, but one should be careful to note that this is not referencing the weight property on a total object. Instead, it's pretty much equivalent to total_weight. The use of a period instead of an underscore can be seen lots of places in R because way back in the day underscores were how assignment was performed. It's not anymore so don't worry about it.

There are some important constants in R as well. TRUE and FALSE mean what you'd expect but must be all capitalized, as does NULL. R also has NA to represent a not-available value. Let's use the c function to concatenate several numbers and then get the median:

x <- c(1, 2.2, 3.2e-1)
print(x)
print(median(x))

\noindent If we add NA into our data then the median would also be not available, so:

x <- c(1, 2.2, 3.2e-1, NA)
print(median(x))

\noindent However, we can tell the median function to just ignore NA values:

print(median(x, na.rm=TRUE))

\noindent Here, I used the name of the argument instead of its position to call the function. This simplifies calling functions with many arguments that have default values.

Functions

Let's create a function to increase our friends. I'll give the second argument a default value so that, if we leave it out of our function call it will take on that value:

f <- function(friend.count, new.friends=1){
  return(friend.count + new.friends)
}

friends <- 2
print(f(friends, 2))
print(f(friends))
print(f(new.friends=1, friend.count=3))

\noindent Note that at the end we reversed the order of the arguments, but since we named them everything acts as we would expect.

In R almost all arguments of a function are passed by value --- that is to say a copy is made and then thrown away when the function completes. Consider this example:

x <- 1

f <- function(y){
  y <- y + 1
  return(y)
}

print(f(x))
print(f(x))

One final note about functions. The last expression in a function is returned automatically if no return value is specified. In the interactive session the returned value is sent to print. Thus we can rewrite our previous function as follows:

x <- 1

f <- function(y){
  y + 1
}

f(x)

Vectors

Most things in R are stored as vectors of the same type. Usually your vector will be numeric, logical (booleans), characters, or a more general object of some kind. R will coerce values to make sure they are the same type within a vector.

c(1, 2, 3, '4')

\noindent Above, R decided that character was the best way to store all of those values together.

c(1, 2, 3, as.numeric('4'))

\noindent Here, as.numeric converted the string to a numeric value and so the vector remained as numeric.

Instead of writing the whole sequence of numbers from one to four we could have used R's shortcut notation:

1:4

\noindent For more complicated sequences we can use the seq function.

Vectors can also have names for their elements. This is very helpful in displaying data and accessing it.

x <- c('first'=1, 'second'=2, 'third'=3)
x

\noindent We can then retrieve a specific element either with the numeric offset (starting with 1 for the first element) or by the name

print(x[1])
print(x['second'])

\noindent It's worth repeating that indexing in R starts with 1. Many programming languages start with 0, and R does not emit an error when you attempt to access the 0th element.

x[0]

\noindent This is a common source of bugs for in code.

When we compare vectors we get a boolean vector back, and we can use that vector to select elements. Consider the following example of random high temperatures in Phoenix, AZ:

x <- c(102, 82, 93, 118, 117, 89, 84, 74, 114, 115)
index <- x < 100
print(index)
x[index]

\noindent We can also use an integer vector of offsets to select from a vector:

index <- 1:(length(x)/2)
print(index)
index <- index * 2
print(index)
print(x[index])

\noindent Finally, if we provide no index at all we get the full vector back:

print(x[])

Matrices and Arrays

Matrices are essentially two-dimensional vectors, and are a type of array. In R we can define an arbitrary-dimensional vector structure with the array function. The methods for accessing elements in a matrix are much the same as the methods we used to access elements of vectors except now we will have commas separating each dimension of the data structure (so matrices have two dimensions, and arrays have one or more dimensions).

First, let's construct a matrix:

X <- matrix(c(1,0,0,0,1,0,0,0,1), ncol=3)
X

\noindent We can use our fancy indices to access sub-matrices:

X[c(1,2),2]

\noindent A three dimensional array can also easily be constructed:

W <- array(
  c(
    c(2,1,1,1,2,1,1,1,2),
    c(3,1,1,1,3,1,1,1,3),
    c(4,1,1,1,4,1,1,1,4)
  ),
  dim=c(3,3,3)
)
W

\noindent We can access elements of an array in the same way:

W[1,1:2,1:2]

\noindent As in the case of vectors, if we leave an index blank it returns everything along that dimension. I'll leave the first two places of the index of W empty and choose a value for the third.

W[ , , 2]

\noindent Certain operators are expressed between percent symbols. For example, matrix multiplication uses %*%.

X <- diag(c(1,2,3))
Y <- matrix(c(0,1,0,1,0,0,0,0,1), ncol=3)
X %*% Y

Lists

Lists are a less strict association of data. A list doesn't require that the types of entries be the same.

y <- list(2, 4, '6', expression(8))
print(y)

\noindent Note the "key" is given between [[ and ]]. This is because we can access an element of a list or get a sublist composed of parts of the original list by using [ and ] instead. The following code returns an element:

print(y[[1]])
print(y[[2]])

\noindent Whereas the following code return a new list:

print(y[1:2])
print(y[1])

\noindent Lists can have string keys as well:

friends <- list(
  'Caroline'='Phoenix',
  'John'='California',
  'Tommy'='California',
  'Gina'='Iowa',
  'Adam'='Vancouver',
  'Paul'='Tempe',
  'Mollie'='Utah'
)
print(friends)
print(friends[['Adam']])
print(friends[c('Adam', 'Caroline')])

\noindent Note that the key was printed using a $ when we printed this list. This is another way we can access elements of a list:

friends$Paul

Data Frames

Data frames are the bread-and-butter data structure of R. A data.frame is a collection of column vectors that we'll use to feed data into our statistical models:

df <- data.frame(
  'time'=c(12.1, 13.3, 13.2, 12.6, 12.8),
  'temp'=c(251, 250, 261, 249, 255)
)
df

\noindent We can access elements of a data.frame similar to how we access elements of matrices, for example:

df[,'time']

Investigating a Variable

Two very useful commands for figuring out what a variable is are str and dput. The str function displays the structure of a variable:

str(df)

\noindent Here we can see that the data.frame has five observations in each of the two columns. Further, both columns are numeric.

The dput function writes an ASCII text representation of the object (i.e. we can paste dput's output into R and be guaranteed to get the same object back):

dput(df)

\noindent We can show their equivalence:

df2 <- structure(list(time = c(12.1, 13.3, 13.2, 12.6, 12.8), temp = c(251, 
250, 261, 249, 255)), class = "data.frame", row.names = c(NA, 
-5L))
df==df2

\noindent You'll note that dput told us that under the hood the data.frame is a list, and indeed we can access it like a list:

df$time

Factors

Sometimes we want to indicate that a column of a data.frame represents qualitative information. For example, if we had assigned people to groups then it doesn't make sense for group 2 to be twice as much a group as group 1 and group 4 to be twice as much a group as group 2. In R we handle this with factors.

factor(c(1,2,3,1,3,2,3,3,2,1,2,2,3))

\noindent Some functions understand that levels represent some qualitative information. Let's load and plot the canonical Iris data set that describes 50 of each of three species of iris flower in terms of the petal length and width and the sepal length and width (see https://en.wikipedia.org/wiki/Iris_flower_data_set for more about the data set. You can find the data set in the datasets package as iris or iris3.)

library(datasets)
data(iris)

This loaded the iris data set into a data.frame named iris. To verify this we can use str to examine the structure of the iris object:

str(iris)

We can get summary statistics on each column of the data.frame as follows:

summary(iris)

\noindent We see there are 3 kinds of iris. Let's plot Petal.Length in terms of Species.

species <- iris[,'Species']
petalLength <- iris[,'Petal.Length']
plot(species, petalLength)

\noindent R decided that, since we're using a factor for $x$, a boxplot is most appropriate for showing the data. Now let's plot Petal.Length by Petal.Width:

plot(iris[,c('Petal.Width','Petal.Length')])

\noindent We get a totally different kind of plot. This scatter plot relates the petal width to the petal length in a way we'd expect. Let's go hog wild!

plot(iris)

\noindent This produces a lattice plot that we can use to see all kinds of two way relationships. You may be asking, "Why does Species have levels '1.0', '2.0', and '3.0'?" Well, those are the indices of the levels of the species factor:

levels(iris$Species)

\noindent If we were to convert the factor to numeric we would get back those offsets:

as.numeric(iris$Species)

\noindent If we wanted to get the original strings back we could do

as.character(iris$Species)

Formula

Another odd thing about R (the third odd thing for those keeping count) is formula objects. Let's say we want to perform a linear regression to predict petal length using petal width and species. R uses something called Wilkinson's notation to describe this relationship. The following is an example formula relating Petal.Length to Petal.Width and Species.

formula <- Petal.Length ~ Petal.Width + Species
print(class(formula))
print(formula)

\noindent We can use the formula with the lm function to get the least squares fit:

lm(Petal.Length ~ Petal.Width + Species, data=iris)

\noindent We can also pass a string to lm to get the same results:

lm("Petal.Length ~ Petal.Width + Species", data=iris)

\noindent The formula also describes how the raw data is to be transformed for modeling. If we used * instead of + then R would assume we want Petal.Width, Species, and the two-factor interaction term Species:Petal.Width:

lm(Petal.Length ~ Petal.Width * Species, data=iris)

We can do a lot of fancy things using Wilkinson's notation and it's becoming common to see it used in Python (patsy package) and Matlab. I'll refer you to a short guide for R from Richard Hahn and Matlab's documentation for more information about the notation.

Where to Learn More

There are a lot of resources you can use to learn more about R.

A (Very) Short Introduction to R by Torfs and Braur is an excellent concise introductory PDF.
The book R In Action and website Quick-R by Rob Kabacoff are also excellent resources.
DataCamp has a free introduction to R course.
Though much older, the book "Modern Applied Statistics With S-Plus" by Venables and Ripley is a great go-to resource for doing statistical analysis in S. This often explains some of the more arcane details of how to use R itself.
The R community and user groups remind me of a friendly data cult: my experience has been of people who are passionate and go out of their way to be helpful to people learning R. Perform a web search to find an R user group or meetup in your area.

------------------

# Edgar's Notes for Himself

 * Determining Sample Size --- The JMP table in the book seems to want way more samples than what I calculate.
 * Smelting Experiment of Section 3.8.3 --- I match the variance component point estimates but the CIs are way off from mine.
 * Example 11.6 --- I can't get results to match jmp.  The rsm package doesn't do no intercept.
 * Chapter 14 --- Not sure how to get whole, sub, and residual data... This whole chapter is trouble.

ehassler/MontgomeryDAE documentation built on March 20, 2021, 7:08 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ehassler/MontgomeryDAE
MontgomeryDAE - Data and examples from Montgomery's "Design and Analysis of Experiments"

In ehassler/MontgomeryDAE: MontgomeryDAE - Data and examples from Montgomery's "Design and Analysis of Experiments"

Getting Started

Installing the R Interpreter

Install RStudio

R Studio Window

Packages and Libraries

The R Language

Functions

Vectors

Matrices and Arrays

Lists

Data Frames

Investigating a Variable

Factors

Formula

Where to Learn More

R Package Documentation

Browse R Packages

We want your feedback!

ehassler/MontgomeryDAE MontgomeryDAE - Data and examples from Montgomery's "Design and Analysis of Experiments"

In ehassler/MontgomeryDAE: MontgomeryDAE - Data and examples from Montgomery's "Design and Analysis of Experiments"

Getting Started

Installing the R Interpreter

Install RStudio

R Studio Window

Packages and Libraries

The R Language

Functions

Vectors

Matrices and Arrays

Lists

Data Frames

Investigating a Variable

Factors

Formula

Where to Learn More

R Package Documentation

Browse R Packages

We want your feedback!

ehassler/MontgomeryDAE
MontgomeryDAE - Data and examples from Montgomery's "Design and Analysis of Experiments"