Here is a quick overview of the basics. We will dive deep into R's basic data structures and then how to subset those data structures later in the course. This will give us a good overview of base R and the background needed to dive into R for Data Science.
The three most important functions in R ?
, ??
, and str
:
?topic
provides access to the documentation for topic.??topic
searches the documentation for topic.str
displays the structure of an R object in human readable form.``{block2, type='rmdtip'}
glimpseis a tidyverse equivalent to
str` but with nicer output for complicated data structures.
See this [vocabulary list](http://adv-r.had.co.nz/Vocabulary.html) for a good starting point on the basics functions in base R and some important libraries. A book to learn the basics is [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/) In R there three basic constructs[^constructs]; objects, functions, and environments. [^constructs]: Technically speaking functions and environments are objects which allows one to do things in R you can't do in many other languages. ## Assignment Operators We saw this is Coding Style. Use `<-` for assignment and use `=` for parameters. While you can use `=` for assignment it is generally considered bad practice. ## Naming Rules R has strict rules about what constitutes a valid name. A __syntactic__ name must consist of letters[^letters], digits, `.` and `_`, and can't begin with `_`. Additionally, it can not be one of a list of __reserved words__ like `TRUE`, `NULL`, `if`, and `function` (see the complete list in `?Reserved`). Names that don't follow these rules are called __non-syntactic__ names, and if you try to use them, you'll get an error: ```r _abc <- 1 #> Error: unexpected input in "_" if <- 10 #> Error: unexpected assignment in "if <-"
``{block type='rmdwarning'}
While
TRUEand
FALSEare reserved words
Tand
Fare not. However, you can use
Tand
Fas logical. If someone assigns either of those a different value you will get a __very__ hard to track down bug. Always spell out the
TRUEand
FALSE`.
[^letters]: Surprisingly, what constitutes a letter is determined by your current locale. That means that the syntax of R code actually differs from computer to computer, and it's possible for a file that works on one computer to not even parse on another! ## Objects ### Vector You create a vector with `c`. These have to be the same data type (See next section). ```r v <- c("my", "first", "vector") v # length of our vector length(v)
There are several shortcut functions for common vector creation.
# create an ordered sequence 2:10 9:3 # generate regular sequences seq(1, 20, by = 3) # replicate a number n times rep(3, times = 4) # arguments are generally vectorized rep(1:3, times = 3:1) # common mistake using 1:length(n) in loops # but if n = 0 1:0 # use seq_len(n) instead and the loop won't execute seq_len(0) # another common mistake n <- 6 1:n+1 # is (1:n) + 1, so 2:(n + 1) 1:(n+1) # usually what is meant seq_len(n+1) # a better way
There are many "atomic" types of data: logical
, integer
, double
and character
(in this order, see below). There are also raw
and complex
but they are rarely used.
You can't mix types in an atomic vector (you can in a list). Coercion will automatically occur if you mix types:
(a <- FALSE) typeof(a) (b <- 1:10) typeof(b) c(a, b) ## FALSE is coerced to integer 0 (c <- 10.5) typeof(c) (d <- c(b, c)) ## coerced to double c(d, "a") ## coerced to character 50 < "7"
You can force coercion with as.logical
, as.integer
, as.double
, as.numeric
, and as.character
. Most of the time the coercion rules are straight forward, but not always.
x <- c(TRUE, FALSE) typeof(x) as.integer(x) as.numeric(x) as.character(x)
However, coercion is not associative.
x <- c(TRUE, FALSE) x2 <- as.integer(x) x3 <- as.numeric(x2) as.character(x3)
What would you expect this to return?
x <- c(TRUE, FALSE) as.integer(as.character(x))
You can test for an "atomic" types of data with: is.logical
, is.integer
, is.double
, is.numeric
[^numeric], and is.character
.
[^numeric]: is.numeric()
is a general test for the “numberliness” of a vector and returns TRUE for both integer and double vectors. It is not a specific test for double vectors, which are often called numeric.
x <- c(TRUE, FALSE) is.logical(x) is.integer(x)
What would you expect these to return?
x <- 2 is.integer(x) is.numeric(x) is.double(x)
Missing values are specified with NA
, which is a logical vector of length 1. NA
will always be coerced to the correct type if used inside c()
, or you can create NA
s of a specific type with NA_real_
(a double vector), NA_integer_
and NA_character_
.
Matrices are 2D vectors, with all elements of the same type. Generally used for mathematics.
# fill in column order (default) matrix(1:12, nrow = 3) # fill in row order matrix(1:12, nrow = 3, byrow = TRUE) # can also specify the number of columns instead matrix(1:12, ncol = 3)
You find the dimensions of a matrix with nrow
, ncol
, and dim
m <- matrix(1:12, ncol = 3) dim(m) nrow(m) ncol(m)
A list is a generic vector containing other objects. These do NOT have to be the same type or the same length.
s <- c("aa", "bb", "cc", "dd", "ee") b <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # x contains copies of n, s, b and our matrix from above x <- list(n = c(2, 3, 5) , s, b, 3, m) x # length gives you length of the list not the elements in the list length(x)
We'll discuss lists in more detail later in the course.
A data frame is a list with each vector of the same length. This is the main data structure used and is analogous to a data set in SAS. While these look like matrices they behave very different.
df = data.frame(n = c(2, 3, 5), s = c("aa", "bb", "cc"), b = c(TRUE, FALSE, TRUE), y = v ) # df is a data frame df # dimensions dim(df) nrow(df) ncol(df) length(df)
We'll discuss data frames in greater detail later in the course.
log_ops <- data.frame(Operator = c(">", ">=", "<", "<=", "==", "!="), Description = c("greater than", "greater than or equal to", "less than", "less than or equal to", "exactly equal to", "not equal to ")) knitr::kable( log_ops, booktabs = TRUE, caption = 'Logical Operators' )
v <- 1:12 v[v > 9]
Equality can be tricky to test for since real numbers can't be expressed exactly in computers.
x <- sqrt(2) (y <- x^2) y == 2 print(y, digits = 20) all.equal(y, 2) ## equality with some tolerance all.equal(y, 3) isTRUE(all.equal(y, 3)) ## if you want a boolean, use isTRUE()
x <- c(TRUE, FALSE) df <- data.frame(expand.grid(x, x)) names(df) <- c("x", "y") df$and <- df$x & df$y # logical and df$or <- df$x | df$y # logical or df$notx <- !df$x # negation df$xor <- xor(df$x, df$y) # exlusive or df
R has two versions of the logical operators &
and &&
(|
and ||
). The single version is the vectorized version while the the double version returns a length-one vector. Use the double version in logical control structures (if, for, while, etc).
df$x && df$y # only and the first elements df$x || df$y # only or the first elements
This is a common source of bugs in control structures (if, for, while, etc) where you must have a single TRUE / FALSE.
``{block type='rmdcaution'}
=is used for assignment while
==is used for comparison. A common bug is to use
=instead of
==` inside a control structure.
It also has useful helpers `any` and `all` ```r x <- c(FALSE, FALSE, FALSE, TRUE) any(x) all(x) all(!x[1:3])
And also some useful set operations intersect
, union
, setdiff
, setequal
x <- 1:5 y <- 3:7 intersect(x, y) # in x and in y union(x, y) # different than c() c(x,y) # not a set operation setdiff(x, y) # in x but not in y setdiff(y, x) # in y but not in x setequal(x, y) z <- 5:1 setequal(x, z)
Control structures allow you to put some "logic" into your R code, rather than just always executing the same R code every time. Control structures allow you to respond to inputs or to features of the data and execute different R expressions accordingly.
Commonly used control structures are
if
and else
: testing a condition and acting on it
for
: execute a loop a fixed number of times
while
: execute a loop while a condition is true
repeat
: execute an infinite loop (must break
out of it to stop)
break
: break the execution of a loop
next
: skip an iteration of a loop
if
-else
The if
-else
combination is probably the most commonly used control
structure in R (or perhaps any language). This structure allows you to
test a condition and act on it depending on whether it's true or
false.
For starters, you can just use the if
statement.
if(<condition>) { # do something } # Continue with rest of code
The above code does nothing if the condition is false. If you have an
action you want to execute when the condition is false, then you need
an else
clause.
if(<condition>) { # do something } else { # do something else }
You can have a series of tests by following the initial if
with any
number of else if
s.
if(<condition1>) { # do something } else if(<condition2>) { # do something different } else { # do something else different }
``{block type='rmdwarning'}
There is also an
ifelsefunction which is vectorized version. It is essentially an
if-
elsewrapped in a
for` loop so that the condition, and action, is performed on each element in a vector.
### `for` Loops For loops are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop wasn't sufficient. In R, for loops take an iterator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.) The following three loops all have the similar behavior. ```r x <- c("a", "b", "c", "d") for(i in 1:length(x)) { ## Print out each element of 'x' print(x[i]) }
The seq_along()
function is commonly used in conjunction with for
loops in order to generate an integer sequence based on the length of
an object (in this case, the object x
).
## Generate a sequence based on length of 'x' for(i in seq_along(x)) { print(x[i]) }
It is not necessary to use an index-type variable.
for(letter in x) { print(letter) }
```{block2, type='rmdtip'} Nested loops are commonly needed for multidimensional or hierarchical data structures (e.g. matrices, lists). Be careful with nesting though. Nesting beyond 2 to 3 levels often makes it difficult to read/understand the code. If you find yourself in need of a large number of nested loops, you probably want to break up the loops by using functions (discussed later).
We will discus looping and the other control structures in more detail when we get to the section on iterators. ## Vectorization & Recycling Many operations in R are _vectorized_, meaning that operations occur in parallel in certain R objects. This allows you to write code that is efficient, concise, and easier to read than in non-vectorized languages. The simplest example is when adding two vectors together. ```r x <- 1:3 y <- 11:13 z <- x + y z
In most other languages you would have to do something like
z <- numeric(length(x)) for(i in seq_along(x)) { z[i] <- x[i] + y[i] } z
We saw a form of vectorization above in the logical operators.
x x > 2 x[x > 2]
Matrix operations are also vectorized, making for nice compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element.
x <- matrix(1:4, 2, 2) y <- matrix(rep(10, 4), 2, 2) x y x * y # element-wise multiplication x / y # element-wise division x %*% y # true matrix multiplication
R also recycles arguments.
x <- 1:10 z <- x + .1 # add .1 to each element z
While you usually either want the same length vector or a length one vector. You are not limited to just these options.
x <- 1:10 y <- x + c(.1, .2) y z <- x + c(.1, .2, .3) z
One (not so good) way to estimate pi
is through Monte-Carlo simulation.
Suppose we wish to estimate the value of pi
using a Monte-Carlo method. Essentially, we throw darts at the unit square and count the number of darts that fall within the unit circle. We'll only deal with quadrant one. Thus the $Area = \frac{\pi}{4}$
Monte-Carlo pseudo code:
hits = 0
hits = hits + 1
hits / N
pi_naive <- function(N) { hits <- 0 for(i in seq_len(N)) { U1 <- runif(1) U2 <- runif(1) if ((U1^2 + U2^2) < 1) { hits <- hits + 1 } } 4*hits/N } N <- 1e6 (t1 <- system.time(pi_est_naive <- pi_naive(N))) pi_est_naive
That's a long run time (and bad estimate). Let's vectorize it.
pi_vect <- function(N) { U1 <- runif(N) U2 <- runif(N) hits <- sum(U1^2 + U2^2 < 1) 4*hits/N } (t2 <- system.time(pi_est_vect <- pi_vect(N))) pi_est_vect
The speed up from vectorization is impressive.
floor(t1/t2)
str
or glimpse
on the objects.var <- "mpg"
. This illustrates a common source of confusion for people coming from SAS.x <- NULL
We'll calculate pi using the Gregory-Leibniz series. Mathematicians will be quick to point out that this is a poor way to calculate pi, since the series converges very slowly. But our goal is not calculating pi, our goal is examining the performance benefit that be be achieved using vectorization.
Here is a formula for the Gregory-Leibniz series:
\begin{equation} 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \frac{1}{11} + \cdots = \frac{\pi}{4} \end{equation}
Here is the Gregory-Leibniz series in summation notation:
\begin{equation} \sum_{\text{n}=0}^{\infty} \frac{(-1)^n}{2\cdot n + 1} = \frac{\pi}{4} \end{equation}
The straightforward implementation using an R loop would look like this:
GL_naive <- function (limit) { p <- 0 for (n in 0:limit) { p <- (-1)^n/(2 * n + 1) + p } 4*p } N <- 1e7 system.time(pi_est <- GL_naive(N)) pi_est
Your task is to vectorize this function. Do not use any looping or apply functions. This one is a bit tricky. Hint: It may be easier to think about it in terms of the series notation and not the summation notation.
GL_vect <- function(limit) { # your code here # use only base functions and no looping mechanisms }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.