library(learnr) library(tutorial.helpers) library(tidyverse) library(readxl) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") x <- c("one", "two", "three", "four", "five") x2 <- c(10, 3, NA, 5, 8, 1, NA) x3 <- c(abc = 1, def = 2, xyz = 5) df <- tibble( x = 1:3, y = c("a", "e", "f"), z = runif(3) ) df0 <- tibble( x = c(2, 3, 1, 1, NA), y = letters[1:5], z = runif(5) ) df2 <- data.frame(x1 = 1) tb <- tibble( x = 1:4, y = c(10, 4, 1, 21) ) tb2 <- tibble(x1 = 1) list1 <- list( a = 1:3, b = "a string", c = pi, d = list(-1, -5) ) L <- list(15, 16, 17, 18, 19) L2 <- list(4, 16, 25, 49, 64) df4 <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4) num_cols <- sapply(df4, is.numeric) # DK: Delete objects we don't use. # DK: I had some real annoyance with code like map(paths, readxl::read_excel). # This works fine if you just Run Document locally. But it fails with R CMD # check with an error about "Error in `exists(dbname)`: first argument has # length > 1". Could not solve! Adding purrr:: in this set up chunk worked. Not # sure what is going on. Perhaps a different map() function is called. But # tidyverse_conflicts() does not show any problems. paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE) files <- purrr::map(paths, readxl::read_excel) files.2 <- vector("list", length(paths))
This tutorial covers Chapter 27: A field guide to base R from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.
In this chapter, we'll focus on four big topics: subsetting with [
, subsetting with [[
and $
, the apply family of functions, and for
loops. To finish off, we'll briefly discuss two essential plotting functions.
The [
bracket can be used to extract sub-components from data frames and vectors, with the syntax x[i]
. x
represents the vector and i
represents the position of the value inside of x
(1st element is position 1, second element is position 2, and so forth).
Load the tidyverse library.
library(...)
library(tidyverse)
There are five main types of things that you can subset a vector with, which will be covered in the following exercises:
1) A vector of positive integers 2) A vector of negative integers 3) A logical vector 4) A character vector 5) Nothing
Press "Run Code".
x <- c("one", "two", "three", "four", "five")
x <- c("one", "two", "three", "four", "five")
This code sets the variable x
to the vector c("one", "two", "three", "four", "five")
.
Extract the first element of x
by typing x[]
with the number 1
inside the brackets
x[...]
x[1]
As you can see, the code extracts the first value of the x
vector, printing out the value "one".
We can also pass in a vector to []
, containing various positions to extract from x
. In the code chunk below, extract the 2nd, 3rd, and 4th values of x
, using c()
x[c(..., ..., ...)]
x[c(2, 3, 4)]
When you run the code above, you should see that it extracts "two", "three", and "four". By using vectors inside []
, you can extract multiple elements at once.
You can also pass in a vector of negative values. In the code chunk below, type x[]
, placing the vector c(-1, -2)
inside the brackets. Observe what happens.
x[c(..., ...)]
x[c(-1, -2)]
Negative values drop the elements at the specified positions; the code above drops the first and second elements, returning "three", "four", and "five".
Logical vectors are another type of thing that you can subset a vector with. Create a vector with the values 10
, 3
, NA
, 5
, 8
, 1
, and NA
. Save this to a variable named x2
.
... <- ...(10, 3, ..., 5, 8, 1, ...)
x2 <- c(10, 3, NA, 5, 8, 1, NA)
Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.
is.na()
is a function that identifies missing values in vectors, data frames, etc. In the code chunk below, type x2[]
, placing is.na(x2)
inside of the brackets.
x2[...(...)]
x2[is.na(x2)]
This code prints out the missing values (the NA
's) stored inside x2
. Unlike filter()
, NA indicies will be included in the output as NA
's.
The modulo operator, %%
, returns the remainder of the division of two numbers. In the code chunk below, type in 1:10 %% 2
and press "Run Code".
1:10 ... 2
1:10 %% 2
This code goes through each integer from 1 to 10 and calculates the remainder when divided by 2.
In the code chunk below, copy the previous code and add == 0
. Observe the output.
1:10 %% 2 ... 0
1:10 %% 2 == 0
As you can see, the output of this code is a series of FALSE
's and TRUE
's. This code goes through each integer (from 1 to 10) and checks whether that number (when divided by 2) has a remainder of 0
In general, if a number divided by 2 produces a remainder of 0, that means it perfectly divisible by 2, meaning that the number is even. This code checks whether each integer is even, outputting TRUE
if it is perfectly divisible and FALSE
if not.
Knowing how to utilize the modulo operator, let's extract all the even values of x2
. In the code chunk below, type x2[]
. Inside the brackets, type x2 %% 2
and equal that expression to 0
, using the ==
operator.
x2[... %% 2 == ...]
x2[x2 %% 2 == 0]
The %%
operator is used to calculate the remainder of the division of two numbers. So, by placing x2 %% 2 == 0
inside the brackets, the code will search through each element of x2
and return all of the numbers with a remainder of 0 when divided by 2 (thus being an even number). And as mentioned previously, all NA
indicies will be included in the output as NA
The last kind of vector that you can subset is a character vector. Using the following vector below, let's extract the xyz
element. On a new line, type x3
followed with a pair of square brackets, passing in the string "xyz"
.
x3 <- c(abc = 1, def = 2, xyz = 5)
x3 <- c(abc = 1, def = 2, xyz = 5) x3["..."]
x3 <- c(abc = 1, def = 2, xyz = 5) x3["xyz"]
Along with sub-setting logical, character, and integer vectors, you can also subset nothing. For example, by running x[]
on the code above, it'll just return the vector x
.
Sub-setting can not only be used on vectors; it works on data sets too. In the code chunk below, create a tibble (using tibble()
) called df
. The first argument, x
, should contain a range of numbers from 1 to 3. The second argument, y
, should be set to the vector c("a", "e", "f")
. The third argument, z
, should be set to runif(3)
. After completing this, run df
on a new line.
df <- ...( ... = 1:3, y = ...("a", "e", "f"), z = ...(3) )
df <- tibble( x = 1:3, y = c("a", "e", "f"), z = runif(3) )
There are many ways to use [
with data sets, but the most common way is to subset by selecting rows and columns, with the syntax df[row, col]
.
Also, the runif
function is part of the Uniform Distribution set of functions. For more details type ?runif
in the console or go here.
Using df
, let's extract the letter "a". On a new line, type df
followed by a pair of brackets. In the brackets, type 1,2
.
df[... , ...]
df[1,2]
This extracts the element in the 1st row (contains first element for variables x
, y
and z
), and in the 2nd column (variable x
represents this column), which is the letter "a".
You can also leave the row/column input blank when extracting data from a data set. For example, df[rows, ]
returns the specified row(s) and all columns in the data set, while df[, cols]
returns all rows and the specified column(s) in the data set.
In the code chunk below, extract all the rows in df
, as well as the columns x
and z
.
df[, c("...", "...")]
df[, c("x", "z")]
As you can see, by leaving the first part of the subset blank (the part before the comma), the code returns all of the rows in df
, but only returns columns x
and z
due to the vector inputted after the comma.
There's an important difference between tibbles and data frames when it comes to [
. In the tutorials, we've mainly used tibbles, which are data frames, but tibbles tweak some behaviors to make your life a little easier. In most places, you can use "tibble" and "data frame" interchangeably, so when we want to draw particular attention to R's built-in data frame, we'll write data.frame.
Create a data table with data.frame()
and set x
equal to all numbers 1 to 3.Then save the data frame into a variable named df1
.
Then create a tibble with x
equaling numbers from 1 to 3. Then, save the tibble into df2
.
Lastly, display the data frame and tibble by typing the variables on two different lines.
df1 <- data.frame(x = ...:...) df2 <- tibble(x = ...:...) ... ...
df1 <- data.frame(x = 1:3) df2 <- tibble(x = 1:3) df1 df2
If df
is a data.frame, then df[, cols]
will return a vector if col
selects a single column and a data frame if it selects more than one column. If df
is a tibble, then [
will always return a tibble.
Several dplyr verbs are special cases of [
.
Create a tibble with three columns, x
, y
, and z
. For x
, create a vector consisting of the values 2, 3, 1, 1, NA
. For y
, create a list of letters from a to e. Make z
equal to runif(5)
. Save the tibble into df0
and run that in a new line.
df0 <- tibble( x = c(...), ... = letters[1:5], z = runif(...) ) df0
df0 <- tibble( x = c(2, 3, 1, 1, NA), y = letters[1:5], z = runif(5) ) df0
The dplyr
package contains many equivalents of subsetting, such as the filter()
, arrange()
, andselect()
functions.
Let's observe the filter()
equivalent. Pipe df0
(with the |>
) to the filter()
function. Pass x > 1
into filter()
.
... |> ...(x > ...)
df0 |> filter(x > 1)
filter()
is equivalent to subsetting the rows with a logical vector, taking care to exclude any missing values. For this scenario, the equivalent subset code would be df0[!is.na(df0$x) & df0$x > 1, ]
. Running this code produces the same result.
The $
symbol is used to pull out columns from data frames.
arrange()
is equivalent to subsetting the rows with an integer vector, usually created with order(). In the code chunk below, pipe df0
to arrange()
, passing in x, y
. Then, on a new line, paste its equivalent: df0[order(df0$x, df0$y),]
... |> ...(x, y) df0[order(df0$x, df0$y), ]
df0 |> arrange(x, y) df0[order(df0$x, df0$y), ]
$
is specialized for access by name.
You can use order(decreasing = TRUE)
to sort all columns in descending order or -rank(col)
to sort columns in decreasing order individually.
select()
is equivalent to subsetting columns with character vector. In the code chunk below, pipe df0
to select()
, passing in x, z
to select()
. Then, on a new line, paste its equivalent: df0[, c("x", "z")]
... |> ...(x, z) df0[, c("x", "z")]
df0 |> select(x, z) df0[, c("x", "z")]
filter()
, arrange()
, and select()
are very useful functions that help organize data. You will use these functions quite often when analyzing data.
The single bracket operator, [
, which selects many elements, is paired with the double bracket operator, [[
, and $
, which extract a single element. In this section, we'll show you how to use [[
and $
to pull columns out of data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [
and [[
when used with lists.
In the code chunk below, create a tibble. The first value should be x
, which is equal to a range of integers from 1 to 4. The second value should be y
, which is a vector containing 10, 4, 1, and 21. Save this tibble to the name tb
and print it on a new line.
... <- tibble(x = ..:.., y = c(..., ..., ..., ...)) tb
tb <- tibble(x = 1:4, y = c(10, 4, 1, 21)) tb
The double bracket operator, [[
, and dollar sign, $
, can be used to extract columns out of a data frame. [[
can access by position or by name, and $
is specialized for access by name.
In the code chunk below, let's extract the elements of column x
. In the code chunk below, type in tb
followed by [[]]
. Inside the inner bracket, type in 1
.
tb[[...]]
tb[[1]]
In this scenario, the [[]]
are being used to return the values by position. By placing the number 1
inside the brackets, the code returns the values in the first position, which are the range of integers stored in column x
.
Now, let's extract the same elements by name. Copy the code above, replacing the code inside the inner bracket with "x"
tb[["..."]]
tb[["x"]]
By entering either the position or the name of the column, you can extract the values it contains.
As mentioned previously, the $
is specialized for accessing columns by name. In the code chunk below, let's extract the elements inside y
. Type in the name of the tibble, tb
, followed by $
and the name of the column, y
.
tb...y
tb$y
As you can see, the $
extracted the values inside column y
, which are 10, 4, 1, and 21.
The $
can also be used to create new columns, which is the Base R equivalent of the mutate()
method. In the code chunk below, type in tb$z
, setting it equal to tb$x + tb$y
(use <-
to do this). On a new line, type in tb
.
tb$z <- ... + ... tb$z
tb$z <- tb$x + tb$y tb$z
There are several other base R approaches to creating new columns including with transform()
, with()
, and within()
. Hadley Wickham, one of the authors of R for Data Science, collected a few examples.
Let's take a glance at the diamonds data.
Type diamonds
in the code chunk below.
diamonds
Using $
directly is convenient when performing quick summaries. let's use this to calculate some values of the diamonds data.
Let's find out the maximum carat in the diamonds data, using the max()
function. In the code chunk below, type in max()
. Inside the parentheses, type in diamonds$carat
.
max(...$...)
max(diamonds$carat)
In this scenario, the code above would be the Base R equivalent of the summarize()
function.
The dplyr package also provides a base R equivalent for the double bracket operator, [[
, and $
called pull()
. pull()
takes either a variable name or variable position and returns just that column.
Let's replicate the code above, using pull()
. Pipe the diamonds
dataframe to pull()
, passing in carat
. Then, continue the pipe to the max()
function.
diamonds |> ...(carat) |> ...()
diamonds |> pull(carat) |> max()
Just like the previous exercise, the code returns the maximum carat, which is 5.01
.
Run levels()
with the argument diamonds$cut
.
levels(...)
levels(diamonds$cut)
We see 5 different levels of cut
: Fair, Good, Very Good, Premium and Ideal. Now let's recreate the code using pull()
Pipe diamonds
to pull()
with the argument cut
. Next pipe it to levels()
.
... |> pull(...) |> levels()
diamonds |> pull(cut) |> levels()
We see the same 5 levels once again here: Fair, Good, Very Good, Premium and Ideal.
An important difference between tibbles and data frames is that tibbles are much more strict when extracting columns with $
.
Create a data.frame()
and set x1
equal to 1
. Save the data frame as df2
. Then on a new line type df2$x
to extract a column from the data frame.
... <- data.frame(...) ...$x
df2 <- data.frame(x1 = 1) df2$x
Although there is no column named x
in df2
, the code is still able to output the values in column x1
. This is because data frames are able to match the prefix of any variable's name (so-called partial matching) without returning an error if the column doesn't exist.
However, tibbles are much stricter: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn't exist.
Create a tibble()
and set x1
equal to 1
. Save the tibble in tb2
. Then on a new line type tb2$x
to extract a column from the tibble.
... <- tibble(...) ...$x
tb2 <- tibble(x1 = 1) tb2$x
Since there is no column named exactly x1
in the tibble tb2
, the code will print a warning message and NULL
. For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
The double bracket operator [[
and dollar sign $
are really important when working with lists, and have differences compared to the single bracket operator [
.
Using the list()
function, create a list. Set the first list element to a
, which is equal to a range of integers from 1 to 3. Set the second list element to b
, which is equal to the string "a string"
. Set the third list element to c
, which is equal to pi
, and set the fourth list element to d
, which is equal to another list containing -1
and -5
. Then, set the entire list equal to the variable list1
. Type list1
on the next line to see the result.
list1 <- ...( ... = 1:3, b = "...", c = ..., ... = list(-1, -5) ) ...
list1 <- list( a = 1:3, b = "a string", c = pi, d = list(-1, -5) ) list1
Remember that the single bracket operator [
is used to extract sub-components, while the double bracket operator [[
is used to extract single elements.
The [
can be used to extract a sub-list. In the code chunk below, type in str()
. Place list1[1:2]
inside the parentheses.
str(list1[...:...])
str(list1[1:2])
As you can see, the code returns two lists: one list for a
and one list for b
. It doesn't matter how many elements you extract, the result will always be a list.
Unlike the single bracket operator, which extracts a sub-list, the double bracket operator, [[
, extracts a single component from a list.
In the code chunk below, type in str()
. Inside the function, type in list1
, followed by double brackets, [[]]
. Inside the double brackets, type in 4
. This will extract the list d
. We can also replace the 4
with a d
and obtain the same result because [[]]
works with both variable names as well as the element number.
str(list1[[...]])
str(list1[[4]])
As you can see, the double bracket operator, [[
, extracts all the components of column d
in list1
.
Like the double bracket operator, the $
operator also extracts a single component from a list.
In the code chunk below, type in str()
, placing list1
inside the function. Right next to list1
, type in $d
.
str(...$...)
str(list1$d)
With the $
operator, instead of passing in the element's position (ex: 2nd element is position 2), we pass in the name of the column itself, which in this case is d
.
The difference between [
and [[
is particularly important for lists because [[
drills down into the list while [
returns a new, smaller list. Click here to see a visualization of the differences.
In Chapter 26, you learned tidyverse techniques for iteration like dplyr::across()
and the map family of functions. In this section, you'll learn about their base equivalents, the apply family. In this context apply and map are synonyms because another way of saying "map a function over each element of a vector" is "apply a function over each element of a vector". Here we'll give you a quick overview of this family so you can recognize them in the wild.
The most important member of this family is lapply()
. Run ?lapply()
in the console and read the Description section. CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
There's no exact base R equivalent to across()
but you can get close by using [
with lapply()
. This works because under the hood, data frames are lists of columns, so calling lapply()
on a data frame applies the function to each column.
The function sapply()
is very similar to lapply()
. Run ?sapply()
in the console and read the Description section. CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
sapply()
is similar to lapply()
but it always tries to simplify the result, hence the "s" in its name.
Create a tibble where a = 1
, b = 2
, c = "a"
, d = "b"
and e = 4
. Then save the tibble into variable df4
.
df4 <- tibble(..., ..., ..., ..., e = 4)
df4 <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
We can use the sapply()
and lapply()
functions to modify this data.
Let's find all the numeric columns in df4
, using sapply()
. In the code chunk below, type sapply()
, passing in df4
as the first argument and is.numeric
as the second argument. Save the results to the variable num_cols
. Then, on a new line, print num_cols
.
num_cols <- ...(df4, ...) ...
num_cols <- sapply(df4, is.numeric) num_cols
As mentioned previously, sapply()
always tries to simplify the results, hence producing a logical vector instead of a list in the code above. We don't recommend using it for programming, because the simplification can fail and give you an unexpected type, but it's usually fine for interactive use.
Now, let's transform each column with lapply()
and then replace the original values. In the code chunk below, type in lapply()
. In the first argument, pass in df4[, num_cols, drop = FALSE]
. In the second argument, type in \(x) x * 2
. Save the results to df4[, num_cols]
. Then on a new line, type df4
.
df4[, ...] <- ...(df4[, ..., ... = FALSE], \(x) x * 2) ...
df4[, num_cols] <- lapply(df4[, num_cols, drop = FALSE], \(x) x * 2) df4
This code transforms each numeric column of df4
to store the original number multiplied by 2. For example, the original value of column "e" was 4. After running the lapply()
code, that number has been changed to 8.
Base R provides a stricter version of sapply()
called vapply()
, short for vector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input.
We can replace the sapply()
code from Exercise 4 with vapply()
, making sure to specify that is.numeric()
returns a logical vector of length 1.
On a new line, type in vapply()
. The first argument should be df4
, the second argument should be is.numeric
, and the third argument should be logical(1)
.
vapply(... , ... , logical(...))
vapply(df4, is.numeric, logical(1))
The distinction between sapply()
and vapply()
is really important when they're inside a function (because it makes a big difference to the function's robustness to unusual inputs), but it doesn't usually matter in data analysis.
Another important member of the apply family is tapply()
which computes a single grouped summary. Let's use this on the diamonds
data. In the code chunk below, pipe diamonds
to group_by(cut)
. Then continue the pipe to summarize()
with the argument price = mean(price)
diamonds |> group_by(...) |> summarize(...)
diamonds |> group_by(cut) |> summarize(price = mean(price))
If you want to see how you might use tapply()
or other base techniques to perform other grouped summaries, Hadley has collected a few techniques.
In the code chunk below, type tapply()
. The first element should be diamonds$price
, the second element should be diamonds$cut
, and the third element should be mean
. Press "Run Code" after you are done.
tapply(...$..., ...$..., ...)
tapply(diamonds$price, diamonds$cut, mean)
Unfortunately tapply()
returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it's certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work).
The final member of the apply family is the titular apply()
, which works with matrices and arrays, however, this rarely comes up in data science because we usually work with data frames and not matrices.
for
loops are the fundamental building block of iteration that both the apply() and map() families use under the hood. The structure for for
loops looks like this:
for (element in vector) { # do something with element }
Use c()
to create a vector of values from 15 to 19. Assign that vector x
. Then use a for
loop to print out each element of vector x
.
... <- c(...) for (i in x) { ...(i) }
x <- c(15, 16, 17, 18, 19) for (i in x) { print(i) }
The for
loop iterates through each element in the vector, printing it out.
Type dir()
with "data/gapminder"
as the first argument.
dir("data/...")
dir("data/gapminder")
Recall that dir()
lists all the files in a given location.
Add "\\.xlsx$"
as the value for the pattern
argument to the dir("data/gapminder")
command you used for the previous question.
dir("data/gapminder", pattern = "...")
dir("data/gapminder", pattern = "\\.xlsx$")
This pattern
argument ensures that only files which match the pattern are returned by dir()
. In this case, the file names must end with ".xlsx". One of the two forward slashes is needed to "escape" the period, and second slash escapes the first slash. Using an R raw string, r"(.xlsx)"
, would produce the same answer and probably be easier to interpret.
Add TRUE
as the value for the full.names
argument to the dir("data/gapminder", , pattern = "\\.xlsx$")
command you used for the previous question.
dir("data/gapminder", pattern = "\\.xlsx$", full.names = ...)
dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
Setting full.names
to TRUE
provides the relative directory path to the files which match the pattern
and are located in the path
, which is the first argument to dir()
.
Assign the result of the dir()
command to a new variable, paths
.
... <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
Whenever we need to do the same thing to a bunch of files, we first need to create a list of those files, along with their locations.
Let's take a look at one of these files. Use readxl::read_excel()
with the path
argument set to the last file, which is "data/gapminder/2007.xlsx"
.
readxl::read_excel(... = "data/gapminder/2007.xlsx")
readxl::read_excel(path = "data/gapminder/2007.xlsx")
This returns a tibble with 142 rows, each corresponding to a different country. But we don't want to have to type out this command for every file. That would be a bother with the 12 files we have in this case. It would be impossible with hundreds or thousands of files.
Because we have all the locations of the files stored in the paths
vector, we can use (elements from) it instead. Run the same readxl::read_excel()
again but, instead of providing the location by hand, use paths[12]
as the argument to path
.
readxl::read_excel(path = ...)
readxl::read_excel(path = paths[12])
Keep track of which name refers to what. path
is the first argument in the readxl()
function. paths
is a character vector, which we created with dir()
, of paths to the gapminder data.
Use the paths
object you created as the first argument to map()
. Set the second argument to readxl::read_excel
. Remember not to include the parentheses when passing a function to map()
.
...(paths, readxl::...)
# Not sure why including this test makes for an error. Code does work for # students. So, just leaving it commented out for now. # map(paths, readxl::read_excel)
map()
takes the function in the second argument and applies each to each element of the vector in the first argument, returning a list of the same length as the latter. The last element of the list is a tibble with 142 rows, the same object as we created when we applied readxl::read_excel()
to the last element of paths
.
It is useful to see how we might accomplish this same task "by hand," using an explicit for
loop. Start by creating an object in which we can store the results. Run this code.
files.2 <- vector("list", length(paths)) files.2
files.2 <- vector("list", length(paths)) files.2
files.2
is a list of length 12. We will store one tibble in each element of files.2
.
Run seq_along()
with paths
as its argument.
seq_along(...)
seq_along(paths)
Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output. seq_along()
is the best method for generating those indices because of how well it handles edge cases, like a zero length input vector.
Write a for()
loop. The argument in the ()
after for
should be i in seq_along(paths)
. The code in the body of the for()
loop should be files[[i]] <- readxl::read_excel(paths[i])
.
for (... in seq_along(paths)) { ... <- readxl::read_excel(paths[...]) }
for (i in seq_along(paths)) { files[[i]] <- readxl::read_excel(paths[i]) }
The trickiest part of this code is that we have files[[i]]
with two pairs of nested brackets and paths[i]
with only one pair. The reason for the difference is that paths
is a simple vector. We just need one pair of brackets to access each element of a vector. files
, on the other hand, is a list and, therefore, needs nested brackets.
To combine the list of tibbles into a single tibble you can use do.call()
and rbind()
. Run do.call()
with two arguments: rbind
and files
.
...(rbind, ...)
do.call(rbind, files)
Functions from the purrr package have largely replaced do.call()
in modern R code.
Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece. Run this code.
out <- NULL for (path in paths) { out <- rbind(out, readxl::read_excel(path)) }
out <- NULL for (path in paths) { out <- rbind(out, readxl::read_excel(path)) }
We recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that for loops are slow: they’re not, but iteratively growing a vector is.
Many R users who don't otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because they're so concise --- it takes very little typing to do a basic exploratory plot.
There are two main types of base plot you'll see in the wild: scatterplots and histograms, produced with plot()
and hist()
respectively.
Run ?hist()
in the console and look at the Description section. CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Let's use hist()
and the diamonds
data to create a histogram. In the code chunk below, type in hist()
, passing in diamonds$carat
.
hist(... $ ...)
hist(diamonds$carat)
As you can see, this code creates a basic histogram of the data in the carat
column. The hist()
function would be a quick & easy way to create a histogram of your data.
Now let's use the plot()
function. The plot()
function creates a scatterplot of the specified data.
In the code chunk below, type in plot()
, passing in diamonds$carat
and diamonds$price
.
plot(...$... , ...$...)
plot(diamonds$carat, diamonds$price)
Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using $
or some other technique.
This tutorial covered Chapter 27: A field guide to base R from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.
In this tutorial, you have learned:
how to subset with the single bracket operator, [
how to subset with the double bracket operator, [[
and dollar sign, $
how to use functions from the apply family, such as lapply()
, sapply()
, vapply()
, and tapply()
how to use for
loops as a form of iteration
how to create plots without tidyverse and ggplot2
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.