library(learnr)
library(testwhat)
library(tidyverse)

options(repos = "https://cloud.r-project.org")
tutorial_options(
  exercise.timelimit = 60,
  exercise.checker = testwhat::testwhat_learnr
)
knitr::opts_chunk$set(comment = NA)

plot.microbenchmark <- function(object, ..., log = TRUE) {
  object <- object %>% 
    dplyr::mutate(
      ntime = microbenchmark:::convert_to_unit(time, "t")
    ) %>% 
    dplyr::group_by(expr) %>% 
    dplyr::filter(
      !(ntime %in% boxplot.stats(ntime)$out)
    )
  y_min <- min(object$ntime)
  y_max <- max(object$ntime)
  plt <- object %>% 
    ggplot(aes(x = expr, y = ntime)) + 
    geom_violin() + 
    scale_x_discrete(name = "") + 
    coord_flip() + 
    theme_bw()
  y_label <- sprintf("Time [%s]", attr(object$ntime, "unit"))
  plt <- if (log) {
    plt + 
      scale_y_log10(
        name = y_label, 
        limits = c(y_min, y_max)
      )
  }
  else {
    plt + scale_y_continuous(
      name = y_label, 
      limits = c(y_min, y_max)
    )
  }
  plt
}

Required packages

| Package | Install Command | |----------------------------------------------------------------------|------------------------------------------------| | rmarkdown | install.packages("rmarkdown") | | remotes | install.packages("remotes") | | learnr | install.packages("learnr") | | teachr | remotes::install_github("astamm/teachr") | | testwhat | remotes::install_github("datacamp/testwhat") | | tidyverse | install.packages("tidyverse") | | rlang | install.packages("rlang") | | lobstr | install.packages("lobstr") | | microbenchmark | install.packages("microbenchmark") | | janitor | install.packages("janitor") |

Functions

vctrs logo

A function is an R object like any other R objects. For instance, you can create a list of functions. A function is an object that you can call via my_function(), which usually takes a certain number of input arguments and returns an R object. Let us write a function that takes two input arguments x and y and returns their sum. The general syntax would be something like:

my_add <- function(x, y = 1, ...) {
  return(x + y)
}

Note the <- assignment operator, which is not =. We strongly advice against using = for assignment because

Input arguments to a function can be of three forms:

Now we can comment our my_add function. It takes one mandatory argument x and adds it to a second optional argument y which takes the value of $1$ if not provided by the user when calling the function. It is also possible to pass extra arguments through the ... argument although any additional arguments that would be supplied are currently not used by the function implementation.

A function defined as above can be used right away. More generally, you can use rlang::as_function() to generate ready-to-use functions from the three following inputs:

add1 <- rlang::as_function(my_add)
add2 <- rlang::as_function("my_add")
add3 <- rlang::as_function("+")
add4 <- rlang::as_function(~ .x + .y)
add1
add2
add3
add4
add1(1, 2)
add2(1, 2)
add3(1, 2)
add4(1, 2)

Exercises

Exercise 1

Write a function that approximates the golden number using the Fibonacci sequence.


# The golden number can be approximated as 
# the ratio of two consecutive elements of 
# the Fibonacci sequence.
# Write a first function that generates 
# the $n$-th element of the Fibonacci 
# sequence. Next use that function in 
# a second function that approximates 
# the golden number.
fibonacci <- function(n) {
  if (n == 0) return(0)
  if (n == 1) return(1)
  fibonacci(n - 1) + fibonacci(n - 2)
}
golden_number <- function(n = 10) {
  fibonacci(n + 1) / fibonacci(n)
}
golden_number()

Exercise 2

Write a function that takes any number of input arguments and returns the list of input arguments itself by default, or, optionally, the number of input arguments.


# Use rlang::list2(...) to capture the 
# arguments from the dot argument.
# Add an extra optional argument (such 
# as "type") and use an if conditional 
# statement in the function implementation 
# to handle output according to the value 
# of argument "type".
parse_arguments <- function(..., num = FALSE) {
  if (num) return(rlang::dots_n(...))
  return(rlang::list2(...))
}
parse_arguments(10, b = 4, po = "a")
parse_arguments(10, b = 4, po = "a", num = TRUE)

Function-Oriented Programming

There are two kinds of programming languages:

While other popular programming languages such as Python and C++ are object-oriented, R is primarily function-oriented. The R6 package allows the users to define objects in an OOP fashion with associated methods just like in C++ using so-called R6 classes but this is not how R was originally designed and it is out of the scope here.

In R instead, you can for example implement the mean() function for various objects. An implementation for a specific object, say my_object, needs to be named mean.my_object while the mother function mean() calls a lower-level function UseMethod() that will dispatch the correct implementation given the input object type.

Tip: to reveal the internal code of a function, just type the name of the function with no parentheses.

Exercise

In the code cell below, we define a new R object, specifically for handling positive numbers. Provide a function that computes the geometric mean of a set of positive numbers using internally the mean.default() function, which provides the arithmetic mean of a set of real numbers.

positive_numbers <- function(x) {
  # The input x is a vector of real number
  stopifnot(all(x > 0))
  class(x) <- c("positive_numbers", class(x))
  x
}
# You can create a vector of numbers using 
# the c() function with the syntax 
# x <- c(1.1, 2.4, 4.3) which will generate 
# a vector with 3 real numbers in this case.
# You can alternatively generate a vector of 
# n positive numbers by sampling from a 
# probability distribution defined on the 
# set of real positive numbers such as the 
# exponential distribution. To do so in R, 
# use the rexp() function as in 
# x <- rexp(n).
# You need to implement the function 
# mean.positive_numbers().
?exp
?log
mean.positive_numbers <- function(x) {
    exp(mean.default(log(x)))
}

x <- rexp(10)
mean(x)
mean(positive_numbers(x))

The tidyverse

tidyverse logo

The approach that we are taking to learn R is called the tidyverse. It is a coding philosophy that takes the form of a series of packages that handles $5$ main tasks:

  1. Better handling of basic R data types by designing a dedicated package for each of them;
  2. Easy data import by designing $3$ packages tailored to import specific file formats;
  3. Easy data wrangling and transformation by designing a first package dedicated to reshaping data sets and a second package dedicated to transforming a data set;
  4. Easy data visualization by providing a dedicated package that contains an entire grammar for plotting;
  5. Easier data modeling by providing a set of dedicated packages regrouped under the tidymodels package that harmonize data pre-processing steps and calls to different implementations of a same statistical method.

The last task is out of scope for the first half of the course.

The authors of the tidyverse dedicated a whole book available online on these topics that is an excellent complement to the tutorials seen in this class: R for Data Science.

In our task of cleaning and analyzing data sets, we need to treat separately different kinds of variables: numerical, categorical, binary, date-times, time-of-day. In addition, it can be convenient from a programming point of view to be able to combine together variables of different types to go towards objects that can properly store data sets. Let us dig into each type.

Atomic vectors

Creating and subsetting vectors

Atomic vectors are containers of the same base type among $4$: integers, doubles, strings, boolean. To create an atomic vector, use c():

# Atomic vector of doubles
x <- c(1, 3, 5)
x
# Atomic vector of strings
y <- c("a", "b", "c")
y
# Atomic vector of booleans
z <- c(FALSE, TRUE, FALSE)
z
# Atomic vector of integers
w <- c(0L, 0L, 7L)
w
# To check the type of the data, use class()
class(w)

Vector elements can be named, either when they are defined:

x <- c(a = 1, b = 3, c = 5)
x

or, after their definition:

x <- c(1, 3, 5)
y
names(x) <- y
x

To subset an atomic vector, use the operator [:

# By position, using a vector of doubles or integers
# (each number refer to the position of elements to keep)
x[2]
x[c(1, 3)]
# Negative numbers for selecting all but these indices
x[-c(1, 3)]
x[-2]
# Using a vector of boolean of the same size
# (TRUE to keep, FALSE to remove)
z
x[z]
x[c(TRUE, FALSE, TRUE)]
# By name (when elements are named)
x["b"]
x[c("a", "c")]

Numerical variables

vctrs logo

Doubles and integers

Observations from numerical variables are stored as either integer or double objects which are integer-precision or double-precision numeric vectors respectively. In R, numbers are doubles by default. To explicitly mark a number as integer, place an L after the number:

typeof(1)
typeof(1L)

The distinction between integers and doubles is usually not important, but there are two important differences that you should be aware of:

  1. Doubles are approximations. Doubles represent floating point numbers that cannot always be precisely represented with a fixed amount of memory. This means that you should consider all doubles to be approximations. For example, if we compare $\sqrt{2}^2$ and $2$ using ==, this is what we get:

    r a <- sqrt(2)^2 b <- 2 a == b abs(a - b)

    This behaviour is common when working with floating point numbers: most calculations include some approximation error. Instead of comparing floating point numbers using ==, you should check that their difference in absolute value is not greater than a small tolerance value. For this purpose, you can use the function dplyr::near().

    r dplyr::near(sqrt(2)^2, 2)

  2. Number of special values. Integers have one special value: NA, while doubles have four: NA, NaN, Inf and -Inf. The NaN value stands for not-a-number.

    In R, the special value that encodes missing values is NA. Missing values can be detected in a vector by invoking the function is.na(). Note that it detects both NA and NaN entries. In the case of doubles, all three other special values NaN, Inf and -Inf can arise during division:

    r c(-1, 0, 1) / 0

    Here is now what is happening if we use specifically integers:

    r c(-1L, 0L, 1L) %/% 0L

    Note the use of the special operator %/% to perform the integer division. In effect, even if input arguments to the real division operator / are integers the result is converted to double. So we must use the integer division operator if we want to preserve an integer output.

    Avoid using == to check for these special values. Instead use the helper functions is.finite(), is.infinite(), and is.nan().

Concatenation, sequences

The function c() that allows to create atomic vectors, also allows to concatenate them:

x <- c(1, 7, 6, 4)
y <- c(3, 2, 9, 7)
c(x, y)

You can also create a vector that linearly spans a given interval on a regular grid with seq().

# By specifying the length of the grid
x <- seq(-1, 1, length.out = 10)
x
# By specifying the step of the grid
y <- seq(-1, 1, by = 0.1)
y

For sequences of integers of step $1$, there are two simplified functions. Say you want to generate a vector containing all integers from $1$ to $7$. You can then use either:

s1 <- seq_len(7)

or the syntax start:stop:

1:7

Of course, you can also use the generic function seq() for generating sequences of contiguous integers:

seq(1, 7, by = 1)

The last approach is however not recommended because it is memory-inefficient as illustrated by the following example:

s1 <- seq_len(1e6)
s2 <- 1:1e6
s3 <- seq(1, 1e6, by = 1)
s4 <- seq(1L, 1e6L, by = 1L)
tibble(
  Variable = c("s1", "s2", "s3", "s4"), 
  `Memory (bytes)` = purrr::map_dbl(Variable, ~ lobstr::obj_size(get(.)))
)

In fact, the first two functions use ALTREP which stands for ALTernative REPresentations and was introduced in R 3.5. Prior to R 3.5, : and seq_len() would compute all the requested integers and store the results into one big vector, e.g. 1:1000 would require around $4,000$ bytes. As of R 3.5, this sequence is instead represented by an ALTREP vector, so none of the values actually exist in memory. Instead, each time R accesses a particular value in the sequence, that value is computed on-the-fly. This saves memory and execution time, and allows users to use sequences which would otherwise be too big to fit in memory. The generic seq() function instead does not use ALTREP vector. Moreover, we need to be explicit about numbers being integers, e.g. using 1L, otherwise memory allocation doubles because integers are being seen as real numbers.

Usual mathematical functions and operations

You can find all the universal mathematical functions in the base package: abs(), sqrt(), log(), exp(), trigonometric functions, etc. These functions are all vectorized. You can find all usual mathematical operations in the base package as well: addition (+), subtraction (-), multiplication (*), division /, powering (^), integer division (%/%), modulo (%%) and all comparison operators. See the exhaustive list of available mathematical functions and operators with ?groupGeneric:

?groupGeneric

Recycling. Be careful that operations are also completely vectorized and R will not prevent you from performing operations between two vectors of different lengths (although it will warn you to some extent). Experiment in the following code box for understanding how R handles this situation:

x <- 1:5
y <- 1:10
x + y

x <- seq_len(5)
y <- seq_len(11)
x + y

If you experimented enough, you will have realized that R sometimes warn you but not always. Did you figure out when the warning occurs? In any event, even when R warns you, it recycles anyway. You can prevent recycling from happening for those situations where R warns you by starting your R script with the command options(warn = 2) which tells R to convert all warnings into errors. But beware that all other kind of warnings will be turned into errors as well. Plus, in the situation of recycling without warning, R will still recycle. So our recommendation is to never rely on recycling in your R code.

For completeness, we shall mention that you can also create and manipulate vectors of complex numbers. Complex numbers can be created via complex() from the base package. We will not detail complex vectors because it is pretty rare (but it happens) that they end up appearing in data sets for statistical analysis.

There is a work-in-progress (WIP) package within the tidyverse called vctrs that attempts to implement a more robust and more generic vector class that should be better suited for representing numeric variables in the near future.

Exercises

Exercise 1

Find out which entries of the following numeric variable are finite using both is.finite() and the negation of is.infinite(). What can you conclude?

x <- c(Inf, 3, NA, NaN)
Exercise 2

We propose here a benchmarking (timing of the execution times) of the seq_len() and the start:stop syntax:

mbm <- microbenchmark::microbenchmark(
  seq_len = seq_len(10000000), 
  start_stop = 1:10000000, 
  times = 1000
)
plot(mbm)

In view of these results, which approach would you recommend? Why are there two possible approaches then?

Exercise 3

Compute an approximation of $\pi$ using trigonometric functions.


Exercise 4

Extract even numbers from the following vector of integers:

x <- rpois(20)

Categorical variables

Categorical variables like the hair color for instance take values that are substantially words (dark, brown, blond, etc.). The natural way to store a sample of observations of categorical variables in R is a character vector which contains a sample of those words. There is a second type of vectors, called factors, that constrains character vectors to take on a finite set of possible values called levels. Values that are not among the expected levels are considered missing and treated as NA. The main advantages are that you control the order of the levels and thus you can customize plot appearances or variable summary. Let us dig into these two types of vectors.

Character vectors

stringr logo

Character vectors are vectors made of strings. There is a dedicated package within the tidyverse for manipulating strings (and character vectors) called stringr. The names of all the functions for manipulating character vectors in this package start with str_*. You can therefore see the exhaustive list when typing str_ thanks to code completion. The final part of the function names is usually explicit about the intended purpose of the function. You can then invoke the help on any function you think you might need to see the precise syntax for using it.

Be aware that all str_* functions are vectorized meaning that they will apply what they are intended to do to each string within a character vector.

Useful str_* functions for later exploring data sets include:

The function str_detect() is particularly useful for filtering observations in data sets by keeping only those for which categorical variables match certain values. To get the best experience of this function, one should learn how to match patterns with regular expressions. We recommend the interested reader to use str_view() which offers a nice visualization of the effect of regular expressions. For example, if we want to understand what kind of patterns the regular expression ".a." matches, we can use:

x <- c("apple", "banana", "pear")
stringr::str_view(x, ".a.")

We understand from the output that it is matching pieces of strings of length $3$ with the letter a in the middle.

However, the same regular expressions should then be used in combination with str_detect() when it comes to filtering data sets, because str_view() is provided only for educational purposes. So if you want to get rid of "apple" in the previous example because it does not contain any letter a surrounded by two other letters, you should do something like:

x <- c("apple", "banana", "pear")
x[stringr::str_detect(x, ".a.")]

Factors

forcats logo

Categorical variables are in fact variables that have a fixed and known set of possible values. Samples of observations of such variables are particularly well handled using factors in R. Functions for creating and manipulating factors can be found in the dedicated forcats package.

Why factors?

There are two major advantages in using factors over simpler character vectors for representing observations from categorical variables:

  1. When creating a factor, you can explicitly define the set of allowed values that the variable can take, which are called the levels. Hence, any other value in the character vector will be changed into missing value.
  2. When creating a factor, you also decide the order of the levels, which is by default alphabetical. This is particularly useful for reporting summaries of categorical variables, be them tables or bar plots.

However, it is considerably simpler to manipulate character vectors w.r.t. factors. As a result, we highly recommend that you stick to character vectors for representing categorical variables and only switch to the factor representation when it comes to summarizing or plotting them.

To convert a numeric or character vector, say x, into a factor, use factor(). Once you converted it into a factor, you can then use any of the fct_* functions of the forcats package to manipulate it. Again, thanks to code completion, you can easily get a list of all these functions and the end of the function name is usually self-explicit about the intended purpose.

Creating a factor

Imagine that you have a variable that records month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

You can fix both of these problems with a factor. To create a factor you can start by creating a list of the valid levels:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Now you can coerce your character vector of months into a factor:

y1 <- factor(x1, levels = month_levels)
y1
sort(y1)

Observe that any values not in the set of authorized levels will be silently converted to NA:

y2 <- factor(x2, levels = month_levels)
y2

If you'd rather want a warning, you can use readr::parse_factor():

y2 <- readr::parse_factor(x2, levels = month_levels)

If you omit the levels, they will be taken from the data in alphabetical order:

factor(x1)

Sometimes you could prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the facts, with forcats::fct_inorder():

# When creating the factor
f1 <- factor(x1, levels = unique(x1))
f1
# Afterwards
f2 <- forcats::fct_inorder(factor(x1))
f2

If you ever need to access the set of valid levels directly, you can do so with levels():

levels(f2)
Modifying factor levels

Sometimes, raw data sets are not that clean and typos in categorical variables is a major source of problems. Sometimes, the number of categories is unnecessarily high, making the frequency of some categories too small for statistical significance. We might also want to clarify categories for publication or collapse levels for high-level displays. This is where fct_recode(), fct_collapse() and fct_lump() come into play.

Let us experiment with these functions. We will use two categorical variables extracted from the forcats::gss_cat data set. This data set is a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. We will focus on the two categorical variables partyid and relig:

partyid <- forcats::gss_cat$partyid
relig <- forcats::gss_cat$relig

Let us first analyze the variable partyid. The first thing to do is to inspect the frequency of each category for this variable:

knitr::kable(table(partyid))

The function table() from the base package does the job of counting the number of observations in each category. The function knitr::kable() is just cosmetic for a nice formatting of the table in your browser. There are two things that we can notice:

  1. Category labels are not always very clear because of abbreviations.
  2. There is one category with a frequency of $1$ which, for the purpose of statistical analysis, is irrelevant.

We can solve both these problems with the forcats::fct_recode() function, for example as follows:

new_partyid <- forcats::fct_recode(
  partyid,
  "Republican (strong)"           = "Strong republican",
  "Republican (weak)"             = "Not str republican",
  "Independent (near republican)" = "Ind,near rep",
  "Independent (near democrat)"   = "Ind,near dem",
  "Democrat (weak)"               = "Not str democrat",
  "Democrat (strong)"             = "Strong democrat", 
  "Unsure"                        = "No answer",
  "Unsure"                        = "Don't know"
)
knitr::kable(table(new_partyid))

In the function forcats::fct_recode(), we only need to list the categories that we aim at relabeling. There other ones will be left as is. If we attribute the same new label to several categories, the latter will be merged together, as we did for categories "No answer" and "Don't know" that we grouped under the category "Unsure".

Exercises

Exercise 1
  1. Experiment with the function stringr::str_length() on the following vector:

    ```r x <- c("a", "R for data science", NA)

    ```

  2. What does it do? Did you expect such a result?

  3. If you did not expect this result, what piece of R code would compute what you intended?

    ```r x <- c("a", "R for data science", NA)

    ```

Exercise 2

Use one or more of the stringr::str_*() functions to change the following sentence in order to make you happy (but me less):

x <- "Centrale Lyon will win again the Challenge this year, vive Centrale Lyon !"
Exercise 3

Explore how to use the function forcats::fct_collapse() to group categories into the four following macro-categories: other, republicans, independents and democrats.

knitr::kable(table(partyid))
new_partyid <- forcats::fct_collapse(
  partyid,
  other = c("No answer", "Don't know", "Other party"),
  republicans = c("Strong republican", "Not str republican"),
  independents = c("Ind,near rep", "Independent", "Ind,near dem"),
  democrats = c("Not str democrat", "Strong democrat")
)
knitr::kable(table(new_partyid))
Exercise 4

Explore how to use the function forcats::fct_lump() to automatically group the smallest categories of the relig variable so that we end up with exactly $10$ categories.

knitr::kable(table(relig))
new_relig <- forcats::fct_lump(relig, n = 10)
knitr::kable(table(new_relig))

Binary variables

Binary variables are variables which can take only two outcomes, which can be encoded in R as $0$'s and $1$'s or as a boolean (TRUE or FALSE or NA).

Logical vectors are vectors of booleans that can therefore take only three possible values: FALSE, TRUE, and NA. As a result, they are particularly well suited for representing binary variables. Logical vectors are usually constructed with comparison operators ==, !=, <, <=, >, >=.

Write a piece of code that returns a binary vector with TRUE entries for pair integers up to $10$ and FALSE otherwise:


?`%%`

Remember that you can also create binary vectors by hand with c():

c(TRUE, TRUE, FALSE, NA)

Missing values

Logical, integer, real, complex and character vectors all come with their own missing value:

NA            # logical
NA_integer_   # integer
NA_real_      # double
NA_complex_   # complex
NA_character_ # character

Normally you do not need to know about these different types because you can always use NA and it will be converted to the correct type using coercion rules internal to R. However, there are some functions that are strict about their inputs so it might be helpful to know about this.

Date and date-times variables

lubridate logo

Dates and times are arguably the most complex data types. Just think of the nightmare of determining leap years. It is not only a year divisible by $4$. The full rule has actually three parts. And this is just the beginning of our problems with dates and times (think about time zones, daylight savings times, etc.). Fortunately for us, the lubridate package solves seamlessly all of these problems.

Note that, bottom line, dates and times are nothing but numbers because, internally, they are stored as the number of seconds separating a date/time from a fixed date. Hence, all the things described for numeric vectors will generally be allowed for date/time objects but we strongly recommend to manipulate them through the lubridate package because operation rules on date/time objects are not the same as for classical numbers.

Creating date and date-time objects

Most of the time, you will create a date or date-time object:

Extracting components from date and date-time objects

The reverse operations can be done, i.e. it is possible to extract individual components from a date or date-time object with the accessor functions year(), month(), etc.

Take a look at the help of each accessor functions to find the right syntax for extracting the year, month, day, hours, minutes, seconds of the following date:

datetime <- lubridate::ymd_hms("2016-07-08 12:34:56")

How did you extract the day? What is the difference between the different functions to perform day extraction?

For month() and wday(), you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.

datetime <- lubridate::ymd_hms("2016-07-08 12:34:56")
lubridate::month(datetime, label = TRUE)
lubridate::wday(datetime, label = TRUE, abbr = FALSE)

Is the last output from lubridate::wday() what you expected? How can you fix this?

Modifying a date or date-time object

You can also use each accessor function to set the components of a date or date-time object:

datetime <- lubridate::ymd_hms("2016-07-08 12:34:56")

lubridate::year(datetime) <- 2020
datetime
lubridate::month(datetime) <- 01
datetime
lubridate::hour(datetime) <- lubridate::hour(datetime) + 1
datetime

Alternatively, rather than modifying in place, you can create a new date or date-time object with update(). This also allows you to set multiple values at once.

update(datetime, year = 2020, month = 2, mday = 2, hour = 2)

Note that update() is a method from the stats package, an implementation of which is proposed in the lubridate package for date and date-time objects. Hence, it is not part of the lubridate package and should therefore not be referred to as such.

If values are too big, they will roll-over:

datetime <- lubridate::ymd("2015-02-01")
update(datetime, mday = 30)
update(datetime, hour = 400)

Lists

purrr logo{style="float: right; margin-left: 10 px; margin-right: 10px;" width="64"}

Atomic vectors can only store booleans, numeric entries or strings and are required to contain only one type of these. In fact, R is very permissive and allows you to mix all types into a vector but results might be unexpected:

c(1, 3, TRUE)
c(1, 3, TRUE, "a")

Did you expect this outcome?

Lists are a special kind of vectors that provide a solution to $2$ major issues of atomic vectors:

  1. Creating vectors with entries that have heterogeneous types and that keep their own type when grouped together in the vector;
  2. Creating vectors with homogeneous entries that have more complex data types, such as matrices, data sets, results as output by R modeling functions or any other R object.

To create a list, use the function list2() from the rlang package. For example, using rlang::list2() instead of c() in the previous examples allows you to preserve the original data types of the entries:

rlang::list2(1, 3, TRUE)
rlang::list2(1, 3, TRUE, "a")

Note however that, while for atomic vectors you use c() for both creation and concatenation, you cannot do that with rlang::list2():

l <- rlang::list2(TRUE, "a")
rlang::list2(1, 3, l)

Instead, when working with lists, creation and concatenation are two different operations handled by two different functions: rlang::list2() for creation and the usual c() for concatenation:

l <- rlang::list2(TRUE, "a")
c(1, 3, l)

This actually means that c() is in fact the generic function for concatenation. Whenever it detects a list as one of its arguments, it automatically converts non-lists object into length-1 lists and produce a concatenated list. Following this logic, any atomic vector can be viewed as a list. Indeed, a numeric vector (such as 1:10) is nothing but a list with only numeric entries.

There are two operators for performing list subsetting: [ and [[. For both operators, you can subset either by position using numeric vectors or by name using character vectors. The difference between the two operators is that

l <- rlang::list2(a = 1, b = 2)
l[1]
l["a"]
l[1:2]
l[c("a", "b")]
l[[1]]
l[["a"]]

List manipulation is handled by the purrr package. We invite you to see the webpage of the package for an in-depth presentation of all the operations you can perform on lists. We here focus on the family of functions with the map prefix. The map functions transform their input by applying a function to each element (or a subset of elements) and returning a list or a vector of the same length as the input. In particular:

The first argument to these functions is the input list or atomic vector. The second argument is the function to be applied to each element of the list. It can be specified in three ways:

Let see a first example, Say we want to compute the sample means of samples of size $10$ from the normal distribution with unit standard deviation but means varying from $1$ to $10$. We can solve this in a very compact and easy-to-read way using purrr::map*() functions:

# Either using an explicit function
set.seed(123)
purrr::map_dbl(purrr::map(1:10, function(x) rnorm(10, x)), mean)
# Or a lambda function
set.seed(123)
purrr::map_dbl(purrr::map(1:10, ~ rnorm(10, .)), mean)

Additional parameters to the map functions are further arguments to be passed on the mapped function. The mapped input is passed as the first argument of the mapped function that has not yet been assigned by the named additional parameters in the map function call. Let see that in action to better understand how input mapping is done. Try to reproduce the last example using only the function name in the second argument of the map call:


set.seed(123)

::: {#map-yourturn-hint} Hint: Remember to always name your additional arguments in the map function calls. :::

purrr::map_dbl(purrr::map(1:10, rnorm, n = 10), mean)
ex() %>% 
  check_function("map") %>% {
    check_arg(., ".x") %>% check_equal()
    check_arg(., ".f") %>% check_equal()
    check_arg(., "n") %>% check_equal()
  }

The map2 and pmap families of functions are variants of the map family but iterate over multiple inputs simultaneously. They are parallel in the sense that each input is processed in parallel with the others, not in the sense of multicore computing.

Tibbles

tibble logo

What is a tibble ?

Data sets are tables that store in each row an observation, such as an individual, and in each column a variable that is observed for such an individual. Variables can be of various type: numerical, categorical, binary or date-time. Hence, a data set can be column-wise heterogeneous. There is a package called tibble that handles data set with the R object tbl_df that is called a tibble. Bottom line, tibbles are nothing but customized lists where:

Behavior of a tbl_df object

Creating tibbles

To create a tibble, use tibble() (column-wise creation) or tribble() (row-wise creation):

To coerce a data set already imported as an R object different from tbl_df (currently supporting coercion from data.frame, list, matrix and table), use as_tibble:

l <- rlang::list2(x = c("a", "b"), y = rlang::list2(1:3, 4:6))
l
tibble::as_tibble(l)

Finally, if you want to have a compact view of a data set, you can use glimpse(). Try it on the iris data set:

tibble::glimpse(iris)

Subsetting

Experiment the different ways to use the [ operator (by position, name, etc.) on tibbles (note that you can now subset both rows and/or columns):

iris_tbl <- tibble::as_tibble(iris)

To subset a single column, you can also use the [[ extractor operator, which will automatically coerce the result into the simplest possible R object representation:

iris_tbl <- tibble::as_tibble(iris)
# By position
iris_tbl[[1]]
# By name
iris_tbl[["Sepal.Length"]]

Finally, one can use the $ operator to extract a given column using its name in an unquoted fashion:

iris_tbl$Sepal.Length

Matrices

It happens sometimes (especially in simulation setups) that you end up with a data set with only numerical variables. In this situation, it might be suitable to use a matrix representation instead of a tibble. There are basic functions for handling matrices in the base package.

Useful functions to handle matrices

To create a matrix from a vector, use matrix():

# Fill matrix by columns
matrix(data = 1:6, nrow = 2, ncol = 3)
# Fill matrix by rows
matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE)

To create a matrix by concatenating rows or columns, use rbind() and cbind() respectively:

# Concatenate vectors as rows
rbind(1:3, 4:6)
# Concatenate vectors as columns
cbind(c(1, 4), c(2, 5), c(3, 6))

To subset a matrix, the subset operator [ can be used with two arguments:

To create a matrix from a data set made of numerical variables, use as.matrix():

df <- tibble(
  height = c(1.3, 4.7, 6.3, 7.1), 
  weight = c(43, 65, 12, 78), 
  width = c(34, 75, 23, 54)
)
X <- as.matrix(df)
X

Note that when converting a data set into a matrix, the column of the matrix are named after the variable names in the data set. You might want to change these names or even remove them. This can be done via the function colnames(). In particular, assigning a new character vector to colnames(X) redefines the column names, or assigning NULL removes them:

colnames(X) <- NULL
X

To know the dimensions of the matrix (number of rows and columns), use dim():

dim(X)

To transpose a matrix, use t():

t(X)

Universal mathematical functions and operators +, -, * and / can be used without any problem. They just apply the function or the operator element-wise. For operations, matrix dimensions should match. An error will be thrown if it is not the case. The element-wise multiplication is not the usual matrix multiplication. The latter can be performed using the operator %*%. For example, compute $X^\top X$ and store it in matrix $A$:

A <- t(X) %*% X
A

When we will learn about regression analysis, we will make extensive use of solving linear systems of the form $A x = b$. This is handled in an algorithmic efficient way by solve(A, b), which is a far better solution than the intuitive $x = A^{-1} b$. If the b argument is not specified, the function provides the inverse of $A$ (which needs to be squared and invertible):

# Solve Ax = b with
b <- rep(1, 3)
# Either directly via:
solve(A, b)
# Or first by finding the inverse of A via:
Ainv <- solve(A)
# and then multiplying it with b:
as.numeric(Ainv %*% b)

The first strategy is much more efficient both in terms of accuracy and computation time. We can for instance benchmark the computation time of the two approaches for a matrix $A$ of size $100 \times 100$:

X <- matrix(rnorm(20000), nrow = 200, ncol = 100)
A <- t(X) %*% X
b <- rep(1, 100)
mbm <- microbenchmark::microbenchmark(
  solve = solve(A, b), 
  inv = as.numeric(solve(A) %*% b), 
  check = "equal"
)
plot(mbm)

To perform matrix exponential, logarithm or powering, use the expm package. It provides the functions expm(), logm() and the operator %^% for matrix powering.

Use of matrices in statistics

In statistics, a data set of $p$ numerical variables of which we observed $n$ realizations is often represented as an $n \times p$ matrix $X$ where variables are in columns and observations in rows. We can get easily the sample mean using colMeans(X) and the sample covariance matrix using cov(X):

X <- as.matrix(cars)
m <- colMeans(X)
m
S <- cov(X)
S

To perform the spectral decomposition of a square matrix, use eigen(). Here for instance, we could need to compute the spectrum of the sample covariance matrix, i.e. its two eigenvalues and associated eigenvectors:

eigen(S, symmetric = TRUE)
eigen(S, symmetric = TRUE, only.values = TRUE)

When the number $p$ of variables (matrix columns) exceeds the number $n$ of observations (matrix rows), the sample covariance matrix is known to be a poor estimator of the true unknown covariance matrix. In particular it is not invertible. The corpcor package provides a number of functions to perform shrinkage estimation of both the covariance and precision matrices.

Exercise

The iris_tbl data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for $50$ flowers from each of $3$ species of iris:

iris_tbl <- iris %>% 
  as_tibble() %>% 
  janitor::clean_names()
iris_tbl
  1. Define a new tibble containing only sepal_length, sepal_width and petal_length and convert it into a matrix that you will call X.
  2. Extract the variable petal_width and name it y and compute $\mathbf{b} = X^\top \mathbf{y}$.
  3. Compute the spectrum of $A = X^\top X$ to prove that it is positive definite and not only semi-definite.
  4. Compute $\widehat{\boldsymbol{\beta}} = \left( X^\top X \right)^{-1} X^\top \mathbf{y}$.

X <- iris_tbl %>%
  dplyr::select(1:3) %>%
  as.matrix()
X
y <- iris_tbl$petal_width
b <- t(X) %*% y
A <- t(X) %*% X
eigen(A, only.values = TRUE, symmetric = TRUE)
beta <- solve(A, b)
beta

The pipe operator

magrittr logo

What is it?

The tidyverse philosophy adopts a coding style that relies on the pipe operator %>% available from the magrittr package, which is part of the tidyverse as well (meaning that loading the tidyverse package into your R environment automatically makes the pipe operator available for use). The idea is that you start with an object you intend to modify through some function(s), and pipe it into a function using %>%. The object is then passed as the first argument of the function. Hence, the code

iris %>% head()

is completely equivalent to

head(iris)

What makes the pipe operator powerful is that one can chain several operations one after the other. This makes code easier to read because:

The dot place-holder

Piping creates chains where at each position in the chain, the current result of previous transformations of the input object is available through the dot place-holder for further processing. For example, the following code extracts the first three letters of the alphabet and print them in the format position: letter:

1:3 %>% 
  paste(letters[.], sep = ": ")

Indeed, this is equivalent to:

paste(1:3, letters[1:3], sep = ": ")

because the vector 1:3 is passed twice:

  1. as first argument of the function paste();
  2. within letters[1:3] as second argument of the function paste().

Exercices

Exercice 1. Write a piece of code that extracts the even numbers lower than 100 and perform their sum:


?`[`
?`%%`
?sum
1:100 %>% 
  `[`(. %% 2 == 0) %>% 
  sum()
ex() %>% {
  check_operator(., "%>%", not_called_msg = "You are expected to use chaining with the pipe operator %>%.")
  check_operator(., "`[`", not_called_msg = "Have you used the `[` subset operator to filter out the odd integers prior to performing the sum?")
  check_operator(., "%%", not_called_msg = "You did not use the %% modulo operator which is needed for finding out which integer is even")
  check_code(., ". %% 2", missing_msg = "You misused the %% modulo operator. The current associated code does not filter out odd integers.")
  check_operator(., "%>%", index = 2, not_called_msg = "You are expected to use at least two pipe operators.")
  check_function(., "sum") %>% 
    check_arg("...") %>% 
    check_equal(incorrect_msg = "Did you manage to correctly keep only even integers from the first 100 integers prior to calling the sum() function?")
  check_result(.) %>% check_equal()
}

Exercice 2. It is possible to force the pipe not to pass the current result in the chain to the first argument of the next function. This is done by surrounding the next action in the chain with curly brackets. Modify the following code to display only the minimum, mean and maximum summary of the first 10 integers:

1:10 %>% 
  c(min(.), mean(.), max(.))

Exercice 3. Simplify the following code using pipes:

a <- dplyr::filter(mtcars, carb > 1)
b <- dplyr::group_by(a, cyl)
c <- dplyr::summarise(b, Avg_mpg = mean(mpg))
d <- dplyr::arrange(c, dplyr::desc(Avg_mpg))
d



astamm/teachr documentation built on Jan. 12, 2023, 7:21 a.m.