library(learnr) library(testwhat) library(tidyverse) options(repos = "https://cloud.r-project.org") tutorial_options( exercise.timelimit = 60, exercise.checker = testwhat::testwhat_learnr ) knitr::opts_chunk$set(comment = NA) plot.microbenchmark <- function(object, ..., log = TRUE) { object <- object %>% dplyr::mutate( ntime = microbenchmark:::convert_to_unit(time, "t") ) %>% dplyr::group_by(expr) %>% dplyr::filter( !(ntime %in% boxplot.stats(ntime)$out) ) y_min <- min(object$ntime) y_max <- max(object$ntime) plt <- object %>% ggplot(aes(x = expr, y = ntime)) + geom_violin() + scale_x_discrete(name = "") + coord_flip() + theme_bw() y_label <- sprintf("Time [%s]", attr(object$ntime, "unit")) plt <- if (log) { plt + scale_y_log10( name = y_label, limits = c(y_min, y_max) ) } else { plt + scale_y_continuous( name = y_label, limits = c(y_min, y_max) ) } plt }
| Package | Install Command |
|----------------------------------------------------------------------|------------------------------------------------|
| rmarkdown | install.packages("rmarkdown")
|
| remotes | install.packages("remotes")
|
| learnr | install.packages("learnr")
|
| teachr | remotes::install_github("astamm/teachr")
|
| testwhat | remotes::install_github("datacamp/testwhat")
|
| tidyverse | install.packages("tidyverse")
|
| rlang | install.packages("rlang")
|
| lobstr | install.packages("lobstr")
|
| microbenchmark | install.packages("microbenchmark")
|
| janitor | install.packages("janitor")
|
A function is an R object like any other R objects. For instance, you can create a list of functions. A function is an object that you can call via my_function()
, which usually takes a certain number of input arguments and returns an R object. Let us write a function that takes two input arguments x
and y
and returns their sum. The general syntax would be something like:
my_add <- function(x, y = 1, ...) { return(x + y) }
Note the <-
assignment operator, which is not =
. We strongly advice against using =
for assignment because
==
for testing equality;Input arguments to a function can be of three forms:
...
: the dot argument is useful whenever you do not know, or do not want to fix, the number of inputs that the function can accept; in this case, the number of input parameters can be retrieved via the function dots_n()
from the rlang package; you can capture the arguments passed to ...
as a list via rlang::list2()
.Now we can comment our my_add
function. It takes one mandatory argument x
and adds it to a second optional argument y
which takes the value of $1$ if not provided by the user when calling the function. It is also possible to pass extra arguments through the ...
argument although any additional arguments that would be supplied are currently not used by the function implementation.
A function defined as above can be used right away. More generally, you can use rlang::as_function()
to generate ready-to-use functions from the three following inputs:
add1 <- rlang::as_function(my_add) add2 <- rlang::as_function("my_add") add3 <- rlang::as_function("+") add4 <- rlang::as_function(~ .x + .y) add1 add2 add3 add4 add1(1, 2) add2(1, 2) add3(1, 2) add4(1, 2)
Write a function that approximates the golden number using the Fibonacci sequence.
# The golden number can be approximated as # the ratio of two consecutive elements of # the Fibonacci sequence.
# Write a first function that generates # the $n$-th element of the Fibonacci # sequence. Next use that function in # a second function that approximates # the golden number.
fibonacci <- function(n) { if (n == 0) return(0) if (n == 1) return(1) fibonacci(n - 1) + fibonacci(n - 2) }
golden_number <- function(n = 10) { fibonacci(n + 1) / fibonacci(n) } golden_number()
Write a function that takes any number of input arguments and returns the list of input arguments itself by default, or, optionally, the number of input arguments.
# Use rlang::list2(...) to capture the # arguments from the dot argument.
# Add an extra optional argument (such # as "type") and use an if conditional # statement in the function implementation # to handle output according to the value # of argument "type".
parse_arguments <- function(..., num = FALSE) { if (num) return(rlang::dots_n(...)) return(rlang::list2(...)) } parse_arguments(10, b = 4, po = "a") parse_arguments(10, b = 4, po = "a", num = TRUE)
There are two kinds of programming languages:
While other popular programming languages such as Python and C++ are object-oriented, R is primarily function-oriented. The R6 package allows the users to define objects in an OOP fashion with associated methods just like in C++ using so-called R6
classes but this is not how R was originally designed and it is out of the scope here.
In R instead, you can for example implement the mean()
function for various objects. An implementation for a specific object, say my_object
, needs to be named mean.my_object
while the mother function mean()
calls a lower-level function UseMethod()
that will dispatch the correct implementation given the input object type.
Tip: to reveal the internal code of a function, just type the name of the function with no parentheses.
In the code cell below, we define a new R object, specifically for handling positive numbers. Provide a function that computes the geometric mean of a set of positive numbers using internally the mean.default()
function, which provides the arithmetic mean of a set of real numbers.
positive_numbers <- function(x) { # The input x is a vector of real number stopifnot(all(x > 0)) class(x) <- c("positive_numbers", class(x)) x }
# You can create a vector of numbers using # the c() function with the syntax # x <- c(1.1, 2.4, 4.3) which will generate # a vector with 3 real numbers in this case. # You can alternatively generate a vector of # n positive numbers by sampling from a # probability distribution defined on the # set of real positive numbers such as the # exponential distribution. To do so in R, # use the rexp() function as in # x <- rexp(n).
# You need to implement the function # mean.positive_numbers().
?exp ?log
mean.positive_numbers <- function(x) { exp(mean.default(log(x))) } x <- rexp(10) mean(x) mean(positive_numbers(x))
The approach that we are taking to learn R is called the tidyverse. It is a coding philosophy that takes the form of a series of packages that handles $5$ main tasks:
The last task is out of scope for the first half of the course.
The authors of the tidyverse dedicated a whole book available online on these topics that is an excellent complement to the tutorials seen in this class: R for Data Science.
In our task of cleaning and analyzing data sets, we need to treat separately different kinds of variables: numerical, categorical, binary, date-times, time-of-day. In addition, it can be convenient from a programming point of view to be able to combine together variables of different types to go towards objects that can properly store data sets. Let us dig into each type.
Atomic vectors are containers of the same base type among $4$: integers, doubles, strings, boolean. To create an atomic vector, use c()
:
# Atomic vector of doubles x <- c(1, 3, 5) x # Atomic vector of strings y <- c("a", "b", "c") y # Atomic vector of booleans z <- c(FALSE, TRUE, FALSE) z # Atomic vector of integers w <- c(0L, 0L, 7L) w # To check the type of the data, use class() class(w)
Vector elements can be named, either when they are defined:
x <- c(a = 1, b = 3, c = 5) x
or, after their definition:
x <- c(1, 3, 5) y names(x) <- y x
To subset an atomic vector, use the operator [
:
# By position, using a vector of doubles or integers # (each number refer to the position of elements to keep) x[2] x[c(1, 3)] # Negative numbers for selecting all but these indices x[-c(1, 3)] x[-2] # Using a vector of boolean of the same size # (TRUE to keep, FALSE to remove) z x[z] x[c(TRUE, FALSE, TRUE)] # By name (when elements are named) x["b"] x[c("a", "c")]
Observations from numerical variables are stored as either integer
or double
objects which are integer-precision or double-precision numeric vectors respectively. In R, numbers are doubles by default. To explicitly mark a number as integer, place an L
after the number:
typeof(1) typeof(1L)
The distinction between integers and doubles is usually not important, but there are two important differences that you should be aware of:
Doubles are approximations. Doubles represent floating point numbers that cannot always be precisely represented with a fixed amount of memory. This means that you should consider all doubles to be approximations. For example, if we compare $\sqrt{2}^2$ and $2$ using ==
, this is what we get:
r
a <- sqrt(2)^2
b <- 2
a == b
abs(a - b)
This behaviour is common when working with floating point numbers: most calculations include some approximation error. Instead of comparing floating point numbers using ==
, you should check that their difference in absolute value is not greater than a small tolerance value. For this purpose, you can use the function dplyr::near()
.
r
dplyr::near(sqrt(2)^2, 2)
Number of special values. Integers have one special value: NA
, while doubles have four: NA
, NaN
, Inf
and -Inf
. The NaN
value stands for not-a-number.
In R, the special value that encodes missing values is NA
. Missing values can be detected in a vector by invoking the function is.na()
. Note that it detects both NA
and NaN
entries. In the case of doubles, all three other special values NaN
, Inf
and -Inf
can arise during division:
r
c(-1, 0, 1) / 0
Here is now what is happening if we use specifically integers:
r
c(-1L, 0L, 1L) %/% 0L
Note the use of the special operator %/%
to perform the integer division. In effect, even if input arguments to the real division operator /
are integers the result is converted to double. So we must use the integer division operator if we want to preserve an integer output.
Avoid using ==
to check for these special values. Instead use the helper functions is.finite()
, is.infinite()
, and is.nan()
.
The function c()
that allows to create atomic vectors, also allows to concatenate them:
x <- c(1, 7, 6, 4) y <- c(3, 2, 9, 7) c(x, y)
You can also create a vector that linearly spans a given interval on a regular grid with seq()
.
# By specifying the length of the grid x <- seq(-1, 1, length.out = 10) x # By specifying the step of the grid y <- seq(-1, 1, by = 0.1) y
For sequences of integers of step $1$, there are two simplified functions. Say you want to generate a vector containing all integers from $1$ to $7$. You can then use either:
s1 <- seq_len(7)
or the syntax start:stop
:
1:7
Of course, you can also use the generic function seq()
for generating sequences of contiguous integers:
seq(1, 7, by = 1)
The last approach is however not recommended because it is memory-inefficient as illustrated by the following example:
s1 <- seq_len(1e6) s2 <- 1:1e6 s3 <- seq(1, 1e6, by = 1) s4 <- seq(1L, 1e6L, by = 1L)
tibble( Variable = c("s1", "s2", "s3", "s4"), `Memory (bytes)` = purrr::map_dbl(Variable, ~ lobstr::obj_size(get(.))) )
In fact, the first two functions use ALTREP which stands for ALTernative REPresentations and was introduced in R 3.5. Prior to R 3.5, :
and seq_len()
would compute all the requested integers and store the results into one big vector, e.g. 1:1000
would require around $4,000$ bytes. As of R 3.5, this sequence is instead represented by an ALTREP vector, so none of the values actually exist in memory. Instead, each time R accesses a particular value in the sequence, that value is computed on-the-fly. This saves memory and execution time, and allows users to use sequences which would otherwise be too big to fit in memory. The generic seq()
function instead does not use ALTREP vector. Moreover, we need to be explicit about numbers being integers, e.g. using 1L
, otherwise memory allocation doubles because integers are being seen as real numbers.
You can find all the universal mathematical functions in the base package: abs()
, sqrt()
, log()
, exp()
, trigonometric functions, etc. These functions are all vectorized. You can find all usual mathematical operations in the base package as well: addition (+
), subtraction (-
), multiplication (*
), division /
, powering (^
), integer division (%/%
), modulo (%%
) and all comparison operators. See the exhaustive list of available mathematical functions and operators with ?groupGeneric
:
?groupGeneric
Recycling. Be careful that operations are also completely vectorized and R will not prevent you from performing operations between two vectors of different lengths (although it will warn you to some extent). Experiment in the following code box for understanding how R handles this situation:
x <- 1:5 y <- 1:10 x + y x <- seq_len(5) y <- seq_len(11) x + y
If you experimented enough, you will have realized that R sometimes warn you but not always. Did you figure out when the warning occurs? In any event, even when R warns you, it recycles anyway. You can prevent recycling from happening for those situations where R warns you by starting your R script with the command options(warn = 2)
which tells R to convert all warnings into errors. But beware that all other kind of warnings will be turned into errors as well. Plus, in the situation of recycling without warning, R will still recycle. So our recommendation is to never rely on recycling in your R code.
For completeness, we shall mention that you can also create and manipulate vectors of complex numbers. Complex numbers can be created via complex()
from the base package. We will not detail complex vectors because it is pretty rare (but it happens) that they end up appearing in data sets for statistical analysis.
There is a work-in-progress (WIP) package within the tidyverse called vctrs that attempts to implement a more robust and more generic vector class that should be better suited for representing numeric variables in the near future.
Find out which entries of the following numeric variable are finite using both is.finite()
and the negation of is.infinite()
. What can you conclude?
x <- c(Inf, 3, NA, NaN)
We propose here a benchmarking (timing of the execution times) of the seq_len()
and the start:stop
syntax:
mbm <- microbenchmark::microbenchmark( seq_len = seq_len(10000000), start_stop = 1:10000000, times = 1000 ) plot(mbm)
In view of these results, which approach would you recommend? Why are there two possible approaches then?
Compute an approximation of $\pi$ using trigonometric functions.
Extract even numbers from the following vector of integers:
x <- rpois(20)
Categorical variables like the hair color for instance take values that are substantially words (dark, brown, blond, etc.). The natural way to store a sample of observations of categorical variables in R is a character vector which contains a sample of those words. There is a second type of vectors, called factors
, that constrains character vectors to take on a finite set of possible values called levels. Values that are not among the expected levels are considered missing and treated as NA
. The main advantages are that you control the order of the levels and thus you can customize plot appearances or variable summary. Let us dig into these two types of vectors.
Character vectors are vectors made of strings. There is a dedicated package within the tidyverse for manipulating strings (and character vectors) called stringr. The names of all the functions for manipulating character vectors in this package start with str_*
. You can therefore see the exhaustive list when typing str_
thanks to code completion. The final part of the function names is usually explicit about the intended purpose of the function. You can then invoke the help on any function you think you might need to see the precise syntax for using it.
Be aware that all str_*
functions are vectorized meaning that they will apply what they are intended to do to each string within a character vector.
Useful str_*
functions for later exploring data sets include:
str_c()
: for concatenating strings together;str_detect()
: for finding entries with strings that match a given pattern;str_replace()
and str_replace_all()
: for finding and replacing a given pattern by another one;str_replace_na()
: for handling missing values;str_sub()
: for subsetting a string.The function str_detect()
is particularly useful for filtering observations in data sets by keeping only those for which categorical variables match certain values. To get the best experience of this function, one should learn how to match patterns with regular expressions. We recommend the interested reader to use str_view()
which offers a nice visualization of the effect of regular expressions. For example, if we want to understand what kind of patterns the regular expression ".a."
matches, we can use:
x <- c("apple", "banana", "pear") stringr::str_view(x, ".a.")
We understand from the output that it is matching pieces of strings of length $3$ with the letter a
in the middle.
However, the same regular expressions should then be used in combination with str_detect()
when it comes to filtering data sets, because str_view()
is provided only for educational purposes. So if you want to get rid of "apple"
in the previous example because it does not contain any letter a
surrounded by two other letters, you should do something like:
x <- c("apple", "banana", "pear") x[stringr::str_detect(x, ".a.")]
Categorical variables are in fact variables that have a fixed and known set of possible values. Samples of observations of such variables are particularly well handled using factors
in R. Functions for creating and manipulating factors
can be found in the dedicated forcats package.
There are two major advantages in using factor
s over simpler character vectors for representing observations from categorical variables:
factor
, you can explicitly define the set of allowed values that the variable can take, which are called the levels. Hence, any other value in the character vector will be changed into missing value.factor
, you also decide the order of the levels, which is by default alphabetical. This is particularly useful for reporting summaries of categorical variables, be them tables or bar plots.However, it is considerably simpler to manipulate character vectors w.r.t. factor
s. As a result, we highly recommend that you stick to character vectors for representing categorical variables and only switch to the factor
representation when it comes to summarizing or plotting them.
To convert a numeric or character vector, say x
, into a factor
, use factor()
. Once you converted it into a factor
, you can then use any of the fct_*
functions of the forcats package to manipulate it. Again, thanks to code completion, you can easily get a list of all these functions and the end of the function name is usually self-explicit about the intended purpose.
Imagine that you have a variable that records month:
x1 <- c("Dec", "Apr", "Jan", "Mar")
Using a string to record this variable has two problems:
There are only twelve possible months, and there is nothing saving you from typos:
r
x2 <- c("Dec", "Apr", "Jam", "Mar", "Try")
It does not sort in a useful way:
r
sort(x1)
You can fix both of these problems with a factor
. To create a factor you can start by creating a list of the valid levels:
month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" )
Now you can coerce your character vector of months into a factor
:
y1 <- factor(x1, levels = month_levels) y1 sort(y1)
Observe that any values not in the set of authorized levels will be silently converted to NA
:
y2 <- factor(x2, levels = month_levels) y2
If you'd rather want a warning, you can use readr::parse_factor()
:
y2 <- readr::parse_factor(x2, levels = month_levels)
If you omit the levels, they will be taken from the data in alphabetical order:
factor(x1)
Sometimes you could prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor
by setting levels to unique(x)
, or after the facts, with forcats::fct_inorder()
:
# When creating the factor f1 <- factor(x1, levels = unique(x1)) f1 # Afterwards f2 <- forcats::fct_inorder(factor(x1)) f2
If you ever need to access the set of valid levels directly, you can do so with levels()
:
levels(f2)
Sometimes, raw data sets are not that clean and typos in categorical variables is a major source of problems. Sometimes, the number of categories is unnecessarily high, making the frequency of some categories too small for statistical significance. We might also want to clarify categories for publication or collapse levels for high-level displays. This is where fct_recode()
, fct_collapse()
and fct_lump()
come into play.
Let us experiment with these functions. We will use two categorical variables extracted from the forcats::gss_cat
data set. This data set is a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. We will focus on the two categorical variables partyid
and relig
:
partyid <- forcats::gss_cat$partyid relig <- forcats::gss_cat$relig
Let us first analyze the variable partyid
. The first thing to do is to inspect the frequency of each category for this variable:
knitr::kable(table(partyid))
The function table()
from the base package does the job of counting the number of observations in each category. The function knitr::kable()
is just cosmetic for a nice formatting of the table in your browser. There are two things that we can notice:
We can solve both these problems with the forcats::fct_recode()
function, for example as follows:
new_partyid <- forcats::fct_recode( partyid, "Republican (strong)" = "Strong republican", "Republican (weak)" = "Not str republican", "Independent (near republican)" = "Ind,near rep", "Independent (near democrat)" = "Ind,near dem", "Democrat (weak)" = "Not str democrat", "Democrat (strong)" = "Strong democrat", "Unsure" = "No answer", "Unsure" = "Don't know" ) knitr::kable(table(new_partyid))
In the function forcats::fct_recode()
, we only need to list the categories that we aim at relabeling. There other ones will be left as is. If we attribute the same new label to several categories, the latter will be merged together, as we did for categories "No answer"
and "Don't know"
that we grouped under the category "Unsure"
.
Experiment with the function stringr::str_length()
on the following vector:
```r x <- c("a", "R for data science", NA)
```
What does it do? Did you expect such a result?
If you did not expect this result, what piece of R code would compute what you intended?
```r x <- c("a", "R for data science", NA)
```
Use one or more of the stringr::str_*()
functions to change the following sentence in order to make you happy (but me less):
x <- "Centrale Lyon will win again the Challenge this year, vive Centrale Lyon !"
Explore how to use the function forcats::fct_collapse()
to group categories into the four following macro-categories: other, republicans, independents and democrats.
knitr::kable(table(partyid))
new_partyid <- forcats::fct_collapse( partyid, other = c("No answer", "Don't know", "Other party"), republicans = c("Strong republican", "Not str republican"), independents = c("Ind,near rep", "Independent", "Ind,near dem"), democrats = c("Not str democrat", "Strong democrat") ) knitr::kable(table(new_partyid))
Explore how to use the function forcats::fct_lump()
to automatically group the smallest categories of the relig
variable so that we end up with exactly $10$ categories.
knitr::kable(table(relig))
new_relig <- forcats::fct_lump(relig, n = 10) knitr::kable(table(new_relig))
Binary variables are variables which can take only two outcomes, which can be encoded in R as $0$'s and $1$'s or as a boolean (TRUE
or FALSE
or NA
).
Logical vectors are vectors of booleans that can therefore take only three possible values: FALSE
, TRUE
, and NA
. As a result, they are particularly well suited for representing binary variables. Logical vectors are usually constructed with comparison operators ==
, !=
, <
, <=
, >
, >=
.
Write a piece of code that returns a binary vector with TRUE
entries for pair integers up to $10$ and FALSE
otherwise:
?`%%`
Remember that you can also create binary vectors by hand with c()
:
c(TRUE, TRUE, FALSE, NA)
Logical, integer, real, complex and character vectors all come with their own missing value:
NA # logical NA_integer_ # integer NA_real_ # double NA_complex_ # complex NA_character_ # character
Normally you do not need to know about these different types because you can always use NA
and it will be converted to the correct type using coercion rules internal to R. However, there are some functions that are strict about their inputs so it might be helpful to know about this.
Dates and times are arguably the most complex data types. Just think of the nightmare of determining leap years. It is not only a year divisible by $4$. The full rule has actually three parts. And this is just the beginning of our problems with dates and times (think about time zones, daylight savings times, etc.). Fortunately for us, the lubridate package solves seamlessly all of these problems.
Note that, bottom line, dates and times are nothing but numbers because, internally, they are stored as the number of seconds separating a date/time from a fixed date. Hence, all the things described for numeric vectors will generally be allowed for date/time objects but we strongly recommend to manipulate them through the lubridate package because operation rules on date/time objects are not the same as for classical numbers.
Most of the time, you will create a date
or date-time
object:
date
object from a string is the job of the yms()
function and friends; basically you just need to permute the order of the letters y
, m
and d
according to the order of appearance of the year, month and day in the input string to form a new three-letter function that exists in the lubridate package and will automatically work out the format once you specify the order of the components; functions of the type ymd_hms()
also exist for creating date-time
objects with the same philosophy.date
object from the year, month and day is the job of make_date()
; the function make_datetime()
allows to create date-time
objects.The reverse operations can be done, i.e. it is possible to extract individual components from a date
or date-time
object with the accessor functions year()
, month()
, etc.
Take a look at the help of each accessor functions to find the right syntax for extracting the year, month, day, hours, minutes, seconds of the following date:
datetime <- lubridate::ymd_hms("2016-07-08 12:34:56")
How did you extract the day? What is the difference between the different functions to perform day extraction?
For month()
and wday()
, you can set label = TRUE
to return the abbreviated name of the month or day of the week. Set abbr = FALSE
to return the full name.
datetime <- lubridate::ymd_hms("2016-07-08 12:34:56") lubridate::month(datetime, label = TRUE) lubridate::wday(datetime, label = TRUE, abbr = FALSE)
Is the last output from lubridate::wday()
what you expected? How can you fix this?
You can also use each accessor function to set the components of a date
or date-time
object:
datetime <- lubridate::ymd_hms("2016-07-08 12:34:56") lubridate::year(datetime) <- 2020 datetime lubridate::month(datetime) <- 01 datetime lubridate::hour(datetime) <- lubridate::hour(datetime) + 1 datetime
Alternatively, rather than modifying in place, you can create a new date
or date-time
object with update()
. This also allows you to set multiple values at once.
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
Note that update()
is a method from the stats package, an implementation of which is proposed in the lubridate package for date and date-time objects. Hence, it is not part of the lubridate package and should therefore not be referred to as such.
If values are too big, they will roll-over:
datetime <- lubridate::ymd("2015-02-01") update(datetime, mday = 30) update(datetime, hour = 400)
{style="float: right; margin-left: 10 px; margin-right: 10px;" width="64"}
Atomic vectors can only store booleans, numeric entries or strings and are required to contain only one type of these. In fact, R is very permissive and allows you to mix all types into a vector but results might be unexpected:
c(1, 3, TRUE) c(1, 3, TRUE, "a")
Did you expect this outcome?
Lists are a special kind of vectors that provide a solution to $2$ major issues of atomic vectors:
To create a list, use the function list2()
from the rlang package. For example, using rlang::list2()
instead of c()
in the previous examples allows you to preserve the original data types of the entries:
rlang::list2(1, 3, TRUE) rlang::list2(1, 3, TRUE, "a")
Note however that, while for atomic vectors you use c()
for both creation and concatenation, you cannot do that with rlang::list2()
:
l <- rlang::list2(TRUE, "a") rlang::list2(1, 3, l)
Instead, when working with lists, creation and concatenation are two different operations handled by two different functions: rlang::list2()
for creation and the usual c()
for concatenation:
l <- rlang::list2(TRUE, "a") c(1, 3, l)
This actually means that c()
is in fact the generic function for concatenation. Whenever it detects a list as one of its arguments, it automatically converts non-lists object into length-1 lists and produce a concatenated list. Following this logic, any atomic vector can be viewed as a list. Indeed, a numeric vector (such as 1:10
) is nothing but a list with only numeric entries.
There are two operators for performing list subsetting: [
and [[
. For both operators, you can subset either by position using numeric vectors or by name using character vectors. The difference between the two operators is that
[
always return a list even when a single element of the original list is requested while [[
converts the result into the appropriate simpler R object;[[
however works only for single-element subsetting.l <- rlang::list2(a = 1, b = 2) l[1] l["a"] l[1:2] l[c("a", "b")] l[[1]] l[["a"]]
List manipulation is handled by the purrr package. We invite you to see the webpage of the package for an in-depth presentation of all the operations you can perform on lists. We here focus on the family of functions with the map
prefix. The map
functions transform their input by applying a function to each element (or a subset of elements) and returning a list or a vector of the same length as the input. In particular:
map()
always returns a list;map_lgl()
, map_int()
, map_dbl()
and map_chr()
return an atomic vector of the indicated type.The first argument to these functions is the input list or atomic vector. The second argument is the function to be applied to each element of the list. It can be specified in three ways:
A function
object, in which case it is directly used.
A formula of the form ~ .x + 2
, which defines a lambda function, internally converted by rlang::as_function()
into a proper function. There are three ways to refer to the arguments of a lambda function depending of the number of parameters it takes:
.
;.x
and .y
;..1
, ..2
, ..3
, etc.A string or a number, in which case it is converted to an extractor function. A string performs extraction by name while a number performs extraction by position. If a component is not found, the value of .default
will be returned.
Let see a first example, Say we want to compute the sample means of samples of size $10$ from the normal distribution with unit standard deviation but means varying from $1$ to $10$. We can solve this in a very compact and easy-to-read way using purrr::map*()
functions:
# Either using an explicit function set.seed(123) purrr::map_dbl(purrr::map(1:10, function(x) rnorm(10, x)), mean) # Or a lambda function set.seed(123) purrr::map_dbl(purrr::map(1:10, ~ rnorm(10, .)), mean)
Additional parameters to the map
functions are further arguments to be passed on the mapped function. The mapped input is passed as the first argument of the mapped function that has not yet been assigned by the named additional parameters in the map
function call. Let see that in action to better understand how input mapping is done. Try to reproduce the last example using only the function name in the second argument of the map
call:
set.seed(123)
::: {#map-yourturn-hint}
Hint: Remember to always name your additional arguments in the map
function calls.
:::
purrr::map_dbl(purrr::map(1:10, rnorm, n = 10), mean)
ex() %>% check_function("map") %>% { check_arg(., ".x") %>% check_equal() check_arg(., ".f") %>% check_equal() check_arg(., "n") %>% check_equal() }
The map2
and pmap
families of functions are variants of the map
family but iterate over multiple inputs simultaneously. They are parallel in the sense that each input is processed in parallel with the others, not in the sense of multicore computing.
Data sets are tables that store in each row an observation, such as an individual, and in each column a variable that is observed for such an individual. Variables can be of various type: numerical, categorical, binary or date-time. Hence, a data set can be column-wise heterogeneous. There is a package called tibble that handles data set with the R object tbl_df
that is called a tibble. Bottom line, tibbles are nothing but customized lists where:
nrow()
function on a tibble;names
attribute that is a character vector the same length as the underlying list.tbl_df
objectTo create a tibble, use tibble()
(column-wise creation) or tribble()
(row-wise creation):
tibble()
builds columns sequentially. When defining a column, you can refer to columns created earlier in the call. Only columns of length one are recycled.
r
tibble::tibble(
x = 1:5,
y = x * 2
)
tribble()
will create a list-column if the value in any cell is not a scalar:
r
tibble::tribble(
~x, ~y,
"a", 1:3,
"b", 4:6
)
To coerce a data set already imported as an R object different from tbl_df
(currently supporting coercion from data.frame
, list
, matrix
and table
), use as_tibble
:
l <- rlang::list2(x = c("a", "b"), y = rlang::list2(1:3, 4:6)) l tibble::as_tibble(l)
Finally, if you want to have a compact view of a data set, you can use glimpse()
. Try it on the iris
data set:
tibble::glimpse(iris)
Experiment the different ways to use the [
operator (by position, name, etc.) on tibbles (note that you can now subset both rows and/or columns):
iris_tbl <- tibble::as_tibble(iris)
To subset a single column, you can also use the [[
extractor operator, which will automatically coerce the result into the simplest possible R object representation:
iris_tbl <- tibble::as_tibble(iris) # By position iris_tbl[[1]] # By name iris_tbl[["Sepal.Length"]]
Finally, one can use the $
operator to extract a given column using its name in an unquoted fashion:
iris_tbl$Sepal.Length
It happens sometimes (especially in simulation setups) that you end up with a data set with only numerical variables. In this situation, it might be suitable to use a matrix representation instead of a tibble. There are basic functions for handling matrices in the base package.
To create a matrix from a vector, use matrix()
:
# Fill matrix by columns matrix(data = 1:6, nrow = 2, ncol = 3) # Fill matrix by rows matrix(data = 1:6, nrow = 2, ncol = 3, byrow = TRUE)
To create a matrix by concatenating rows or columns, use rbind()
and cbind()
respectively:
# Concatenate vectors as rows rbind(1:3, 4:6) # Concatenate vectors as columns cbind(c(1, 4), c(2, 5), c(3, 6))
To subset a matrix, the subset operator [
can be used with two arguments:
To create a matrix from a data set made of numerical variables, use as.matrix()
:
df <- tibble( height = c(1.3, 4.7, 6.3, 7.1), weight = c(43, 65, 12, 78), width = c(34, 75, 23, 54) ) X <- as.matrix(df) X
Note that when converting a data set into a matrix, the column of the matrix are named after the variable names in the data set. You might want to change these names or even remove them. This can be done via the function colnames()
. In particular, assigning a new character vector to colnames(X)
redefines the column names, or assigning NULL
removes them:
colnames(X) <- NULL X
To know the dimensions of the matrix (number of rows and columns), use dim()
:
dim(X)
To transpose a matrix, use t()
:
t(X)
Universal mathematical functions and operators +
, -
, *
and /
can be used without any problem. They just apply the function or the operator element-wise. For operations, matrix dimensions should match. An error will be thrown if it is not the case. The element-wise multiplication is not the usual matrix multiplication. The latter can be performed using the operator %*%
. For example, compute $X^\top X$ and store it in matrix $A$:
A <- t(X) %*% X A
When we will learn about regression analysis, we will make extensive use of solving linear systems of the form $A x = b$. This is handled in an algorithmic efficient way by solve(A, b)
, which is a far better solution than the intuitive $x = A^{-1} b$. If the b
argument is not specified, the function provides the inverse of $A$ (which needs to be squared and invertible):
# Solve Ax = b with b <- rep(1, 3) # Either directly via: solve(A, b) # Or first by finding the inverse of A via: Ainv <- solve(A) # and then multiplying it with b: as.numeric(Ainv %*% b)
The first strategy is much more efficient both in terms of accuracy and computation time. We can for instance benchmark the computation time of the two approaches for a matrix $A$ of size $100 \times 100$:
X <- matrix(rnorm(20000), nrow = 200, ncol = 100) A <- t(X) %*% X b <- rep(1, 100) mbm <- microbenchmark::microbenchmark( solve = solve(A, b), inv = as.numeric(solve(A) %*% b), check = "equal" ) plot(mbm)
To perform matrix exponential, logarithm or powering, use the expm package. It provides the functions expm()
, logm()
and the operator %^%
for matrix powering.
In statistics, a data set of $p$ numerical variables of which we observed $n$ realizations is often represented as an $n \times p$ matrix $X$ where variables are in columns and observations in rows. We can get easily the sample mean using colMeans(X)
and the sample covariance matrix using cov(X)
:
X <- as.matrix(cars) m <- colMeans(X) m S <- cov(X) S
To perform the spectral decomposition of a square matrix, use eigen()
. Here for instance, we could need to compute the spectrum of the sample covariance matrix, i.e. its two eigenvalues and associated eigenvectors:
eigen(S, symmetric = TRUE) eigen(S, symmetric = TRUE, only.values = TRUE)
When the number $p$ of variables (matrix columns) exceeds the number $n$ of observations (matrix rows), the sample covariance matrix is known to be a poor estimator of the true unknown covariance matrix. In particular it is not invertible. The corpcor package provides a number of functions to perform shrinkage estimation of both the covariance and precision matrices.
The iris_tbl
data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for $50$ flowers from each of $3$ species of iris:
iris_tbl <- iris %>% as_tibble() %>% janitor::clean_names() iris_tbl
sepal_length
, sepal_width
and petal_length
and convert it into a matrix that you will call X
.petal_width
and name it y
and compute $\mathbf{b} = X^\top \mathbf{y}$.X <- iris_tbl %>% dplyr::select(1:3) %>% as.matrix() X y <- iris_tbl$petal_width b <- t(X) %*% y A <- t(X) %*% X eigen(A, only.values = TRUE, symmetric = TRUE) beta <- solve(A, b) beta
The tidyverse philosophy adopts a coding style that relies on the pipe operator %>%
available from the magrittr package, which is part of the tidyverse as well (meaning that loading the tidyverse package into your R environment automatically makes the pipe operator available for use). The idea is that you start with an object you intend to modify through some function(s), and pipe it into a function using %>%
. The object is then passed as the first argument of the function. Hence, the code
iris %>% head()
is completely equivalent to
head(iris)
What makes the pipe operator powerful is that one can chain several operations one after the other. This makes code easier to read because:
Piping creates chains where at each position in the chain, the current result of previous transformations of the input object is available through the dot place-holder for further processing. For example, the following code extracts the first three letters of the alphabet and print them in the format position: letter
:
1:3 %>% paste(letters[.], sep = ": ")
Indeed, this is equivalent to:
paste(1:3, letters[1:3], sep = ": ")
because the vector 1:3
is passed twice:
paste()
;letters[1:3]
as second argument of the function paste()
.Exercice 1. Write a piece of code that extracts the even numbers lower than 100 and perform their sum:
?`[`
?`%%`
?sum
1:100 %>% `[`(. %% 2 == 0) %>% sum()
ex() %>% { check_operator(., "%>%", not_called_msg = "You are expected to use chaining with the pipe operator %>%.") check_operator(., "`[`", not_called_msg = "Have you used the `[` subset operator to filter out the odd integers prior to performing the sum?") check_operator(., "%%", not_called_msg = "You did not use the %% modulo operator which is needed for finding out which integer is even") check_code(., ". %% 2", missing_msg = "You misused the %% modulo operator. The current associated code does not filter out odd integers.") check_operator(., "%>%", index = 2, not_called_msg = "You are expected to use at least two pipe operators.") check_function(., "sum") %>% check_arg("...") %>% check_equal(incorrect_msg = "Did you manage to correctly keep only even integers from the first 100 integers prior to calling the sum() function?") check_result(.) %>% check_equal() }
Exercice 2. It is possible to force the pipe not to pass the current result in the chain to the first argument of the next function. This is done by surrounding the next action in the chain with curly brackets. Modify the following code to display only the minimum, mean and maximum summary of the first 10 integers:
1:10 %>% c(min(.), mean(.), max(.))
Exercice 3. Simplify the following code using pipes:
a <- dplyr::filter(mtcars, carb > 1) b <- dplyr::group_by(a, cyl) c <- dplyr::summarise(b, Avg_mpg = mean(mpg)) d <- dplyr::arrange(c, dplyr::desc(Avg_mpg)) d
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.