(PART) Base R Basics {-}

R Basics {#baser-rbasics}

Here is a quick overview of the basics. We will dive deep into R's basic data structures and then how to subset those data structures later in the course. This will give us a good overview of base R and the background needed to dive into R for Data Science.

The three most important functions in R ?, ??, and str:

``{block2, type='rmdtip'}glimpseis a tidyverse equivalent tostr` but with nicer output for complicated data structures.

See this [vocabulary list](http://adv-r.had.co.nz/Vocabulary.html) for a good starting point on the basics functions in base R and some important libraries.  

A book to learn the basics is [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/)

In R there three basic constructs[^constructs]; objects, functions, and environments.

[^constructs]: Technically speaking functions and environments are objects which allows one to do things in R you can't do in many other languages.

## Assignment Operators

We saw this is Coding Style. Use `<-` for assignment and use `=` for parameters.  While you can use `=` for assignment it is generally considered bad practice.

## Naming Rules

R has strict rules about what constitutes a valid name. A __syntactic__ name must consist of letters[^letters], digits, `.` and `_`, and can't begin with `_`. Additionally, it can not be one of a list of __reserved words__ like `TRUE`, `NULL`, `if`, and `function` (see the complete list in `?Reserved`). Names that don't follow these rules are called __non-syntactic__ names, and if you try to use them, you'll get an error:

```r
_abc <- 1
#> Error: unexpected input in "_"

if <- 10
#> Error: unexpected assignment in "if <-"

``{block type='rmdwarning'} WhileTRUEandFALSEare reserved wordsTandFare not. However, you can useTandFas logical. If someone assigns either of those a different value you will get a __very__ hard to track down bug. Always spell out theTRUEandFALSE`.

[^letters]: Surprisingly, what constitutes a letter is determined by your current locale. That means that the syntax of R code actually differs from computer to computer, and it's possible for a file that works on one computer to not even parse on another!

## Objects

### Vector

You create a vector with `c`.  These have to be the same data type (See next section).

```r
v <- c("my", "first", "vector")
v

# length of our vector
length(v)

There are several shortcut functions for common vector creation.

# create an ordered sequence
2:10
9:3

# generate regular sequences
seq(1, 20, by = 3)

# replicate a number n times
rep(3, times = 4)

# arguments are generally vectorized
rep(1:3, times = 3:1)

# common mistake using 1:length(n) in loops
# but if n = 0
1:0

# use seq_len(n) instead and the loop won't execute
seq_len(0)

# another common mistake
n <- 6
1:n+1        # is (1:n) + 1, so 2:(n + 1)
1:(n+1)      # usually what is meant
seq_len(n+1) # a better way

Atomic Vectors

There are many "atomic" types of data: logical, integer, double and character (in this order, see below). There are also raw and complex but they are rarely used.

You can't mix types in an atomic vector (you can in a list). Coercion will automatically occur if you mix types:

(a <- FALSE)
typeof(a)

(b <- 1:10)
typeof(b)
c(a, b)         ## FALSE is coerced to integer 0

(c <- 10.5)
typeof(c)
(d <- c(b, c))  ## coerced to double

c(d, "a")       ## coerced to character

50 < "7"

You can force coercion with as.logical, as.integer, as.double, as.numeric, and as.character. Most of the time the coercion rules are straight forward, but not always.

x <- c(TRUE, FALSE)
typeof(x)

as.integer(x)
as.numeric(x)
as.character(x)

However, coercion is not associative.

x <- c(TRUE, FALSE)

x2 <- as.integer(x)
x3 <- as.numeric(x2)
as.character(x3)

What would you expect this to return?

x <- c(TRUE, FALSE)

as.integer(as.character(x))

You can test for an "atomic" types of data with: is.logical, is.integer, is.double, is.numeric[^numeric], and is.character.

[^numeric]: is.numeric() is a general test for the “numberliness” of a vector and returns TRUE for both integer and double vectors. It is not a specific test for double vectors, which are often called numeric.

x <- c(TRUE, FALSE)

is.logical(x)
is.integer(x)

What would you expect these to return?

x <- 2

is.integer(x)
is.numeric(x)
is.double(x)

Missing values are specified with NA, which is a logical vector of length 1. NA will always be coerced to the correct type if used inside c(), or you can create NAs of a specific type with NA_real_ (a double vector), NA_integer_ and NA_character_.

Matrix

Matrices are 2D vectors, with all elements of the same type. Generally used for mathematics.

# fill in column order (default)
matrix(1:12, nrow = 3)

# fill in row order
matrix(1:12, nrow = 3, byrow = TRUE)

# can also specify the number of columns instead
matrix(1:12, ncol = 3)

You find the dimensions of a matrix with nrow, ncol, and dim

m <- matrix(1:12, ncol = 3)
dim(m)
nrow(m)
ncol(m)

List

A list is a generic vector containing other objects. These do NOT have to be the same type or the same length.

s <- c("aa", "bb", "cc", "dd", "ee") 
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE) 
# x contains copies of n, s, b and our matrix from above
x <- list(n = c(2, 3, 5) , s, b, 3, m)   
x

# length gives you length of the list not the elements in the list
length(x)

We'll discuss lists in more detail later in the course.

Data frame

A data frame is a list with each vector of the same length. This is the main data structure used and is analogous to a data set in SAS. While these look like matrices they behave very different.

df = data.frame(n = c(2, 3, 5),
                s = c("aa", "bb", "cc"),
                b = c(TRUE, FALSE, TRUE),
                y = v
                )       # df is a data frame 
df

# dimensions
dim(df)
nrow(df)
ncol(df)
length(df)

We'll discuss data frames in greater detail later in the course.

Comparision

log_ops <- data.frame(Operator = c(">", ">=", "<", "<=", "==", "!="),
                      Description = c("greater than",
                                      "greater than or equal to",
                                      "less than",
                                      "less than or equal to",
                                      "exactly equal to",
                                      "not equal to "))

knitr::kable(
  log_ops, booktabs = TRUE,
  caption = 'Logical Operators'
)
v <- 1:12
v[v > 9]

Equality can be tricky to test for since real numbers can't be expressed exactly in computers.

x <- sqrt(2)
(y <- x^2)
y == 2
print(y, digits = 20)
all.equal(y, 2)          ## equality with some tolerance
all.equal(y, 3)
isTRUE(all.equal(y, 3))  ## if you want a boolean, use isTRUE()

Logical and sets

x <- c(TRUE, FALSE)
df <- data.frame(expand.grid(x, x))
names(df) <- c("x", "y")
df$and  <- df$x & df$y     # logical and
df$or   <- df$x | df$y     # logical or
df$notx <- !df$x           # negation
df$xor  <- xor(df$x, df$y) # exlusive or
df

R has two versions of the logical operators & and && (| and ||). The single version is the vectorized version while the the double version returns a length-one vector. Use the double version in logical control structures (if, for, while, etc).

df$x && df$y  # only and the first elements
df$x || df$y  # only or the first elements

This is a common source of bugs in control structures (if, for, while, etc) where you must have a single TRUE / FALSE.

``{block type='rmdcaution'}=is used for assignment while==is used for comparison. A common bug is to use=instead of==` inside a control structure.

It also has useful helpers `any` and `all`

```r
x <- c(FALSE, FALSE, FALSE, TRUE)
any(x)
all(x)
all(!x[1:3])

And also some useful set operations intersect, union, setdiff, setequal

x <- 1:5
y <- 3:7

intersect(x, y) # in x and in y
union(x, y)     # different than c()
c(x,y)          # not a set operation
setdiff(x, y)   # in x but not in y
setdiff(y, x)   # in y but not in x
setequal(x, y)
z <- 5:1
setequal(x, z)

Control Structures

Control structures allow you to put some "logic" into your R code, rather than just always executing the same R code every time. Control structures allow you to respond to inputs or to features of the data and execute different R expressions accordingly.

Commonly used control structures are

if-else

The if-else combination is probably the most commonly used control structure in R (or perhaps any language). This structure allows you to test a condition and act on it depending on whether it's true or false.

For starters, you can just use the if statement.

if(<condition>) {
        # do something
} 
# Continue with rest of code

The above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an else clause.

if(<condition>) {
        # do something
} 
else {
        # do something else
}

You can have a series of tests by following the initial if with any number of else ifs.

if(<condition1>) {
        # do something
} else if(<condition2>)  {
        # do something different
} else {
        # do something else different
}

``{block type='rmdwarning'} There is also anifelsefunction which is vectorized version. It is essentially anif-elsewrapped in afor` loop so that the condition, and action, is performed on each element in a vector.

### `for` Loops

For loops are pretty much the only looping construct that you will
need in R. While you may occasionally find a need for other types of
loops, in my experience doing data analysis, I've found very few
situations where a for loop wasn't sufficient. 

In R, for loops take an iterator variable and assign it successive
values from a sequence or vector. For loops are most commonly used for
iterating over the elements of an object (list, vector, etc.)

The following three loops all have the similar behavior.

```r
x <- c("a", "b", "c", "d")

for(i in 1:length(x)) {
        ## Print out each element of 'x'
        print(x[i])  
}

The seq_along() function is commonly used in conjunction with for loops in order to generate an integer sequence based on the length of an object (in this case, the object x).

## Generate a sequence based on length of 'x'
for(i in seq_along(x)) {   
        print(x[i])
}

It is not necessary to use an index-type variable.

for(letter in x) {
        print(letter)
}

```{block2, type='rmdtip'} Nested loops are commonly needed for multidimensional or hierarchical data structures (e.g. matrices, lists). Be careful with nesting though. Nesting beyond 2 to 3 levels often makes it difficult to read/understand the code. If you find yourself in need of a large number of nested loops, you probably want to break up the loops by using functions (discussed later).

We will discus looping and the other control structures in more detail when we get to the section on iterators.

## Vectorization & Recycling

Many operations in R are _vectorized_, meaning that operations occur in parallel in certain R objects. This allows you to write code that is efficient, concise, and easier to read than in non-vectorized languages.

The simplest example is when adding two vectors together.

```r
x <- 1:3
y <- 11:13
z <- x + y
z

In most other languages you would have to do something like

z <- numeric(length(x))

for(i in seq_along(x)) {
      z[i] <- x[i] + y[i]
}
z

We saw a form of vectorization above in the logical operators.

x
x > 2
x[x > 2]

Matrix operations are also vectorized, making for nice compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element.

x <- matrix(1:4, 2, 2)
y <- matrix(rep(10, 4), 2, 2)
x
y
x * y  # element-wise multiplication
x / y  # element-wise division
x %*% y  # true matrix multiplication

R also recycles arguments.

x <- 1:10
z <- x + .1  # add .1 to each element
z

While you usually either want the same length vector or a length one vector. You are not limited to just these options.

x <- 1:10
y <- x + c(.1, .2) 
y
z <- x + c(.1, .2, .3)
z

Example

One (not so good) way to estimate pi is through Monte-Carlo simulation.

Suppose we wish to estimate the value of pi using a Monte-Carlo method. Essentially, we throw darts at the unit square and count the number of darts that fall within the unit circle. We'll only deal with quadrant one. Thus the $Area = \frac{\pi}{4}$

Monte-Carlo pseudo code:

  1. Initialize hits = 0
  2. for i in 1:N
  3. Generate two random numbers, $U_1$ and $U_2$, between 0 and 1
  4. If $U_1^2 + U_2^2 < 1$, then hits = hits + 1
  5. end for
  6. Area estimate = hits / N
  7. $\hat{pi} = 4 * Area Estimate$
pi_naive <- function(N) {
  hits <- 0
  for(i in seq_len(N)) {
    U1 <- runif(1)
    U2 <- runif(1)
    if ((U1^2 + U2^2) < 1) {
      hits <- hits + 1
    }
  }

  4*hits/N
}
N <- 1e6
(t1 <- system.time(pi_est_naive <- pi_naive(N)))
pi_est_naive

That's a long run time (and bad estimate). Let's vectorize it.

pi_vect <- function(N) {
  U1 <- runif(N)
  U2 <- runif(N)
  hits <- sum(U1^2 + U2^2 < 1)
  4*hits/N
}
(t2 <- system.time(pi_est_vect <- pi_vect(N)))
pi_est_vect

The speed up from vectorization is impressive.

floor(t1/t2)

Reading For Next Class

  1. Read the chapter on Tibbles
  2. Try the exercises at the end of the chapter.
    • Problem 2: Create a tibble (or convert the data frame) and compare. Also compare str or glimpse on the objects.
    • Problem 3: Try to extract the column "mpg" from the mtcars data frame (convert to a tibble and compare) using var <- "mpg". This illustrates a common source of confusion for people coming from SAS.
    • Problem 4: non-syntactic names usually occur when importing data from various file types. Knowing how to use / correct them is very useful. Don't worry about creating the plot.

Exercises

  1. Browse this vocabulary list and read the help file for functions that interest you.
  2. Re-run the three cases in the For loop section with x <- NULL
  3. Vectorization / function practice.

We'll calculate pi using the Gregory-Leibniz series. Mathematicians will be quick to point out that this is a poor way to calculate pi, since the series converges very slowly. But our goal is not calculating pi, our goal is examining the performance benefit that be be achieved using vectorization.

Here is a formula for the Gregory-Leibniz series:

\begin{equation} 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \frac{1}{11} + \cdots = \frac{\pi}{4} \end{equation}

Here is the Gregory-Leibniz series in summation notation:

\begin{equation} \sum_{\text{n}=0}^{\infty} \frac{(-1)^n}{2\cdot n + 1} = \frac{\pi}{4} \end{equation}

The straightforward implementation using an R loop would look like this:

GL_naive <- function (limit) {
  p <- 0
  for (n in 0:limit) {
    p <- (-1)^n/(2 * n + 1) + p
    }
  4*p
}

N <- 1e7
system.time(pi_est <- GL_naive(N))
pi_est

Your task is to vectorize this function. Do not use any looping or apply functions. This one is a bit tricky. Hint: It may be easier to think about it in terms of the series notation and not the summation notation.

GL_vect <- function(limit) {
  # your code here
  # use only base functions and no looping mechanisms
}


DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.