knitr::opts_chunk$set( # code chunk options echo = TRUE , eval = TRUE , warning = FALSE , message = FALSE , cached = FALSE , exercise = TRUE , exercise.completion = TRUE )
library(learnr) library(dplyr) # library(rvest)
In this tutorial, we will refresh your basic R programming skills. It covers some of the essential programming concepts we will need to scrape and clean web data. Specifically, we will review
If you're having trouble to understand the code in other tutorials, we recommend you go through this tutorial and make sure you understand each of the concepts introduced.
To create an object in R, we use the <-
operator.
For example, we can create an object called 'x' with the value 1 as follows:
x <- 1
This is called assignment.
So the expression x <- 1
reads "I assign the value 1 to an object called 'x'."
<-
, not =
for assignmentActually, you could also use =
to assign values to objects.
However, it is an R code convention to reserve =
for passing values to function arguments.
This makes reading code easier.
For example:
res <- length(x = 1:3)
After assigning an object, it can be used by referring to its name. (The technical term is "evaluation".)
# create 'x' x <- 1 # "call"/"print" 'x' (i.e. evaluate it) x
R has many data types (see ?typeof
).
The most common (atomic) types are:
NULL
The simplest data type is "logical".
There are only three valid logical values: TRUE
, FALSE
, or NA
("missing").
You can check whether a value has type logical using is.logical()
.
typeof(TRUE) typeof(FALSE) typeof(NA) is.logical(TRUE) is.logical(FALSE) is.logical(NA)
Integer values are numbers: ..., -1, 0, 1, ...
You can check whether a value has type logical using is.integer()
.
A missing integer can be created with NA_integer_
.
Note that you have to explicitly force a number to type integer for R to recognize it as such.
This can be done by adding an L
behind the number (e.g. 1L
).
You need to force this behavior, because R interprets numbers as "double" type by default.
is.integer(1) # FALSE !!! is.integer(1L) # TRUE =) is.integer(NA_integer_)
Integer values are real numbers in (-∞, ∞)
You can check whether a value has type logical using is.double()
.
A missing integer can be created with NA_real_
(not NA_double_
!!!).
typeof(1) is.double(1) is.double(NA_real_)
Note that because the double type implements real-valued mathematical operations in R, there are two special values reserved for infinity and not-a-number (undefined values):
is.double(Inf) Inf -Inf
typeof(NaN) is.na(NaN) # the logical NA and NaN are two different pairs of shoes identical(NA, NaN) # the same applies to double-type NA identical(NA_real_, NaN)
Character values are strings, for example "hello world!"
You can create a character value by wrapping it inside single or double quotation marks.
A missing character values can be created with NA_character_
.
You can check whether a value has type character using is.character()
.
is.character('this is a valid character values') is.character("this is also a valid character values") is.character(NA_character_)
NULL
NULL
is an object that represents "no data."
Contrast this with the NA*
values introduced above.
# in the case of NULL, there is really nothing! length(NULL) # NA values, in contrast, represent some data # (the value is unknown, but this is considered valid information) length(NA) length(NA_integer_) length(NA_real_) length(NA_character_)
(Ordered) collections of objects of the same type are called vectors. You probably know vectors from math in school.
1 # a "scalar" (1, 2, 0) # a "vector"
In the above example, the vector has three elements: 1, 2, and 0
In R, you create vectors by combining logical, integer, double or character values using the c()
function.
For example, you can combine the three valid logical values in a logical vector:
# create a logical vector logicals <- c(TRUE, FALSE, NA) typeof(logicals) is.logical(logicals)
Note however that you can always only combine objects of the same type in an (atomic) vector
If you violate this rule, R will convert all values to the type according to the following hierarchy: logical
↦ integer
↦ double
↦ character
.
That is, if you combine a logical and an integer value, both will be represented as integers; if you combine a logical, an integer and a double value, all will be represented as double values; and so on.
This is called type conversion.
# logical converts (upwards) to integer typeof(c(TRUE, 1L)) c(TRUE, 1L) # logical and integer convert (upwards) to double typeof(c(TRUE, 1L, 1.1)) c(TRUE, 1L, 1.1) # logical, integer and double convert (upwards) to character typeof(c(TRUE, 1L, 1.1, "one")) c(TRUE, 1L, 1.1, "one")
NULL
is ignoredNote that because NULL
represents "no data," it is ignored when creating an atomic vector!
c(NULL, "value")
Accordingly, it is also ignored when performing type conversion.
c(NULL, TRUE, 1L, 1.1, "one") c("NULL", NULL)
One thing you haven't seen so far in this tutorial are named vectors. A named vector is a vector where each element has a corresponding character value that denotes its "name".
You can create a named vector using name--value syntax, or with the setNames()
function.
You can assigning names to an already existing vector using the names()
function.
# creating a named vector c("apple" = 3, "banana" = 2, "lemon" = 1.5) setName(c(3, 2, 1.5), c("apple", "banana", "lemon")) # assigning names to an existing vector capitals <- c("Lisbon", "Bejing", "Cape Town") names(capitals) <- c("Portugal", "China", "South Afric")
You can access the names of a vector using the names()
function.
prices <- c("apple" = 3, "banana" = 2, "lemon" = 1.5) names(prices)
And because names()
returns NULL
if a vector is unnamed, you can it to check if a vector is named.
# create unnamed vector x <- logical(3) # is named? is.null(names(x)) # `FALSE` # name names(x) <- c("a", "b", "c") is.null(names(x)) # TRUE
Once you have created a vector, you can access its values by indexing it.
To do so, use the [
function and provide integer values to access positions as follows:
fruits <- c("apple", "banana", "lemon") # access first value fruits[1] # access first two values fruits[1:2]
Note that by negating indexes, for example, by adding a negative sign ("dash") in front of an integer index value, you can omit ("drop") values:
fruits <- c("apple", "banana", "lemon") # omit first value fruits[-1] # omit first two values fruits[-c(1:2)]
You can also access/omit the values of a vector using logical values.
(Note that !
can be used to negate boolean values.)
idxs <- c(TRUE, FALSE, FALSE) # access first value fruits[idxs] # negate to access last two values fruits[!idxs]
Named vectors are useful because they allow indexing by names:
prices <- c("apple" = 3, "banana" = 2, "lemon" = 1.5) prices["apple"] prices[c("apple", "lemon")] # NOTE: # ... negative indexing doesn't work with names prices[-c("apple", "lemon")] # this throws an ERROR! # ... but you can use logical indexing prices[!names(prices) %in% c("apple", "lemon")]
Empty vectors can be created with the following functions:
logical() integer() double() character()
Note that if you pass an positive numeric value to the length
argument of these functions (the default is length = 0L
, see also ?vector
),
you can create vectors of given length.
Note, however, that the initial values picked vary between types.
logical(3) # all values are `FALSE` integer(3) # all values are `0L` double(3) # all values are `0` character(3) # all values are `""`
For certain types, there are shortcuts for creating vectors.
For example, seq()
creates a sequence of numeric (double) values.
Specifically, seq()
can be used to create sequence of numbers that in- or decreases by a specific amount specified by the by
argument (the default is by = 1L
).
# increasing in steps of 1 seq(-10, 10) # equivalent to `-10:10` # increasing in steps of 5 seq(-10, 10, by = 5) # NOTE: if you don't force the value passed to `by` to an integer, # `seq()` will return a double vector typeof(seq(-10, 10, by = 5)) typeof(seq(-10, 10, by = 5L))
You can also create decreasing sequences! But remember to bring the start and end values in the correct order.
seq(-10, 10, by = -5L)
seq()
can also be called with the argument length.out
.
This allows creating a sequence of a pre-defined length.
seq(-10, 10, length.out = 11)
Other useful functions are seq_len()
and seq_along()
.
seq_len()
takes a non-negative number as input and returns an integer sequence of equal length.
seq_len(0) seq_len(3)
seq_along()
takes a vector as input and returns an integer sequence of equal length as the input vector.
seq_along(c()) seq_along(c("a", "b", "c"))
The data types you've encountered so far in this tutorial (logical, integer, double, character, and NULL
) are atomic types.
They are "atomic" in the sense that they cannot "nest" objects of other type.
So if you use c()
to create an atomic vector, all values need to have the same type.
Remember that if you violate this rule, R performs type conversion following hierarchy: logical
↦ integer
↦ double
↦ character
.
In contrast to atomic types, recursive types can nest objects of other types.
The most important recursive type is "list".
You can create a list with the list()
functions
list(TRUE, 1L, 1.1, "one")
You'll note that no "type conversion" is performed inside lists!
# types are preserved list(TRUE, 1L, 1.1, "one") # types are n o t preserved c(TRUE, 1L, 1.1, "one")
You can use c()
to combine multiple list.
c(list(1), list(2))
But note that when combining a list/lists and a vector/vectors, the vector(s) will be converted to a list(s)! So type conversion applies beyond atomic vectors
c( list("hello"), # a list with one character element c("goodbye") # a character vector (!) with one element )
Like vectors, lists can be named:
# create named list x <- list("logical" = TRUE, "integer" = 1L, "double" = 1.1, "character" = "one") names(x) # assign names to an existing list x <- list(TRUE, 1L, 1.1, "one") names(x) <- c("logical", "integer", "double", "character") names(x)
The difference between atomic and recursive types is important because it matters for how you can index objects, that is, access their values.
As already shown, the elements of atomic vectors can be accessed with a single square bracket [
.
In contrast, we use double square brackets [[
to access the elements of a list.
vec <- c(1, 2) vec[1] lst <- list(1, 2) lst[[1]]
This difference in syntax is necessary because, in contrast to atomic types, lists can be multiple levels deep. This is illustrated by the example below:
# (a list of lists of lists) # grandma Paula has two children: ... paula <- list( # ... Frank "frank" = list( # Frank has two children ... "laura" = list(), "beth" = list() ), # ... and Tom "tom" = list( # ... Tom has no children ) )
To access elements at the first level (grandma Paula's children), we need to use [[
.
For example, So to get the data for Paula' first child, Frank' we need to index the first element of the fist level.
# get data for Paula's first child paula[[1]] # alternatively: paula[["frank"]] paula$frank
And to get at the values at the second level, we need to index even deeper. Below, we extract the information for the first child of Frank, who in turn is grandma Paula's first child.
# get data for first child of Paula's first child
paula[[1]][[1]]
In R, there are several data structures that combine multiple values into a single object. You have already encountered two of the most important data structures in R:
vector()
): sequence of values of a certain typelist()
): collection of objects, potentially of different typesA third, very important data structure is the data frame. A data frame is a tabular data format similar to an Excel spread sheet or a CSV file.
Data frames allow us to combine many vectors or lists of the same length into a single object.
We create data frames with the data.frame()
function.
student_names <- c("Bill", "Jane", "Sarah", "Fred", "Paul") math_scores <- c(80, 75, 91, 67, 56) verbal_scores <- c(72, 90, 99, 60, 68) students <- data.frame(student_names, math_scores, verbal_scores) students
Note that student_names
has a different type (character) than math_scores
(numeric).
Nevertheless, you can combine them in a data frame.
(This is because a data frame is really just a list of lists with same length.)
Because data frames are tabular data structures, we refer to the vectors they combine as columns.
The length of individual columns is equal to the number of rows of a data frame.
You can access this information with the ncol()
and nrow()
functions
We can also add new columns to an existing data frames by using the assignment operator:
student_names <- c("Bill", "Jane", "Sarah", "Fred", "Paul") math_scores <- c(80, 75, 91, 67, 56) verbal_scores <- c(72, 90, 99, 60, 68) students <- data.frame(student_names, math_scores, verbal_scores) # add 'final_score' column students$final_scores <- (students$math_scores + students$verbal_scores)/2
Often we want to combine two or more data frames.
If we want to combine them side-by-side, we can "bind" them column-wise using the cbind()
function:
a <- data.frame(letters = letters[1:2]) b <- data.frame(Letters = LETTERS[1:2]) cbind(a, b)
Note, however, that to allow for column-wise combination, the data frames need to have the same amount of rows!
a <- data.frame(letters = letters[1:3]) b <- data.frame(Letters = LETTERS[1:2]) nrow(a) nrow(b) cbind(a, b)
If we want to stack two or more data frames on top of one another, we can "bind" them row-wise with the rbind()
functions:
a <- data.frame(col = letters[1:2]) b <- data.frame(col = LETTERS[1:2]) rbind(a, b)
Note, however, that the rbind()
function throws an error if the data frames you want to combine have different column names.
a <- data.frame(letters = letters[1:2]) b <- data.frame(Letters = LETTERS[1:2]) rbind(a, b)
a <- data.frame(letters = letters[1:2]) b <- data.frame(Letters = LETTERS[1:2]) names(b) <- "letters" rbind(a, b)
A very common scenario is when we have a list of data frames, and we want to bind them together.
We can do this by passing the list and our desired bind function as arguments to the do.call()
function
results <- list() # let's say here you're scraping 3 websites results[[1]] <- data.frame(domain = "google", url = "www.google.com") results[[2]] <- data.frame(domain = "facebook", url = "www.facebook.com") results[[3]] <- data.frame(domain = "twitter", url = "www.twitter.com") # and now we want to combine all 3 data frames do.call(rbind, results)
Alternatively, we can use the bind_rows
functions in the dplyr package.
bind_rows(results)
Sometimes what you want to do depends on circumstances. Implementing this behavior in code is called control flow because we control which of many potential streams of actions are executed.
Like in other programming languages, control flow is implemented with if-else
statements.
Depending on whether a condition is true or false, we can execute different chunks of code.
Let's start with if
.
The logic is: "If something is true, do something."
print_this <- FALSE if (print_this) { print("Hello") }
In the above code chunk, because print_this
evaluates to FALSE, "Hello" is not printed.
Note that you can write everything inside the parentheses after if
that evaluates to a single logical value (excluding NA
).
if (0 == 1) { print("All numbers are equal!") }
Sometimes you do one thing if a condition applies, and something else if it doesn't.
This can be implemented in R with if
-else
control flow:
if (0 == 1) { print("I escape the laws of math!") } else { print("zero and one are not equal") }
Note that there is the shorthand ifelse()
function
ifelse(0 == 1, "I escape the laws of math!", "zero and one are not equal")
If you think beyond binary options, you want to use else if
:
x <- 4 y <- 5 if (y > x) { print("`y` is greater than `x`") } else if (x > y) { print("`x` is greater than `y`") } else { print("`x` and `y` are equal") }
We use loops whenever we need to execute the same chunk of code multiple times with some varying input.
For example, we may use a loop whenever we have multiple Twitter accounts and we want to run sentiment analysis for tweets posted by each of them.
for
-loopsfor
-loops are probably the most common type of loop and are easily implemented in R
vec <- 1:10 for (i in vec) { print(i) }
for
-loopWe can abstract from this to note that a for loop follows the following logic:
for (i in VECTOR) { do something with i }
Now, let's have a look at the anatomy of a for
-loop:
for
tells R that you want repeat some code {
and }
)(i in VECTOR)
VECTOR
is a vector (or list) with several elementsi
is taking on the value of an element of VECTOR
; first the value of the first element, then the value of the second element, and so on.Consequently, when the code inside your loop (i.e., inside the curly braces) involves i
,
in each iteration it is evaluated with the current value of i
.
This allows to repeat code while varying i
.
i
am a placeholderLook at the following example and not that this code runs smoothly too.
This shows that instead of i
, you can use different word --- the word is is just a placeholder!
for (number in 1:3) { print(number) }
There are many scenarios where you want to "collect" the results of individual for-loop iterations in a separate object. Say we have a vector of integer numbers, we want to know if they are greater than 5, and we want to collect the result of this test in a new vector.
numbers <- 4:6 results <- logical(length = length(numbers)) for (i in seq_along(numbers)) { results[i] <- numbers[i] > 5 } results
However, in the above example it is assumed that you know the length of the resulting object in advance.
Specifically, we know we will evaluate each element of numbers
against the number five.
Because numbers
has three elements, we know that we will generate 3 results.
Hence results
must have three elements too.
If you don't know how long the vector you fill will be after all iterations have been completed, you can apply the following logic:
Appending to the last position can be done by assigning a value to the position of the vector/list taht comes after the actual end of this vector, that is, at position length(vector)+1
numbers <- 4:6 results <- logical() # create zero-length/empty vector for (i in seq_along(numbers)) { results[length(results)+1] <- numbers[i] > 5 } results
You can verify yourself that because in this example
results
vector is empty in the begging (i.e., has length zero)we can simply write:
numbers <- 4:6 results <- logical() for (i in seq_along(numbers)) { results[i] <- numbers[i] > 5 } results
Two important features of for
loops is that you can skip iterations or stop them entirely if desired.
You skip an iteration using the keyword next
for (i in 1:4) { # skip if `i` is less than 3 if (i < 3) { next # otherwise print current value of `i` } else { print(i) } }
You can interrupt a running fro loop with the break
keyword:
for (i in 1:4) { # stop if `i` is greater than 4 if (i > 3) { next } # otherwise print current value of `i` print(i) }
A function defines a procedure to return some output based on some inputs.
We create functions in R using the function()
function.
hello <- function() { print("Hello") }
We can call this function as follows:
hello()
Note that because our hello()
function does not have any arguments (i.e., parameters based on which it adapts its behavior), we do not pass it any values inside the parentheses.
This changes if we define our function in a way that it requires an input when called. In our example, this input is called "name".
hello <- function(name) { print(paste("Hello", name)) }
If we now call the function without passing an input, we get an error
hello()
So we have to pass a value to the name
argument
hello("Hauke")
In the functions above, we have just printed a value when calling them. Usually, we use functions to compute something which the function is expected to return.
We can return values with the return()
function that works only inside functions.
add <- function(x, y) { res <- x + y return(res) } add(1, 2)
Note: You could actually omit the return in the above function and instead write
add <- function(x, y) { x + y } add(1, 2)
But it aids the readability of your code a lot if you explicitly call return()
.
return()
can be called at any position in a function.
This is a particularly powerful feature when combined with control flow:
divide <- function(numerator, denominator) { if (denominator == 0) return(NaN) if (numerator == 0) return(Inf) return(numerator/denominator) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.