Numbers
In r4ds.tutorials: Tutorials for "R for Data Science"

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(nycflights13)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 600, 
        tutorial.storage = "local") 

df <- tribble(
  ~x, ~y,
  1,  3,
  5,  2,
  7, NA,
)

x <- c(1, 2, 3, 4, NA)
ranktypes <- tibble(x = x)

numbers <- tibble(id = 1:10)

times_visited <- tibble(
  time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)
)

repetition <- tibble(
  x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
  y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
)

Introduction

This tutorial covers Chapter 13: Numbers from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will be utilizing two core packages of Tidyverse, readr and dplyr. Key commands of this section will include parse_double() for parsing numbers directly from strings, parse_number() for removing useless characters and parsing numbers from strings, count() which counts the unique values of one or more variables, pmin() which take one or more vectors in and returns the minima or maxima of these vectors, round() which rounds values in its first argument to the specified number of decimal places, and min_rank() which gives every tie the same value and ranks an inputted vector.

Making Numbers

In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.

Exercise 1

This chapter mostly uses functions from base R, which are available without loading any packages. But we still need tidyverse functions like mutate() and filter().

Use library() to load in the tidyverse package.

library(...)

library(tidyverse)

The readr package is one of the nine core packages in the Tidyverse. It provides two useful functions for parsing strings into numbers: parse_double() and parse_number().

Exercise 2

In many of the exercises in this tutorial, we will provide the code for creating an example object. You will then just add code for working with that object. Don't delete the object creation code! Below, we create an object x.

On the next line, run parse_double() on x.

x <- c("1.2", "5.6", "1e3")

x <- c("1.2", "5.6", "1e3")
parse_double(...)

x <- c("1.2", "5.6", "1e3")
parse_double(x)

This should return an output of: #> [1] 1.2 5.6 1000.0. parse_double() works well on regular numbers. You can use parse_integer() if all the inputs are integers.

Exercise 3

Use parse_number() when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages.

Use parse_number() with y as the argument to ignore this non-numeric text.

y <- c("$1,234", "USD 3,513", "59%")

y <- c("$1,234", "USD 3,513", "59%")
parse_number(y)

y <- c("$1,234", "USD 3,513", "59%")
parse_number(y)

The result is #> [1] 1234 3513 59. Note how parse_number() returns only the underling number, ignoring the all the other characters in the inputs.

Counts

It’s surprising how much data science you can do with just counts and a little basic arithmetic, so the dplyr package strives to make counting as easy as possible with count(). This function is great for quick exploration and checks during analysis.

Exercise 1

Use library() to load the nycflights13 package.

library(...)

library(nycflights13)

This dataset from the contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013.

Exercise 2

Type in flights and hit "Run Code."

flights

flights

Run ?flights to pull up the help page in order to learn more about the data.

Exercise 3

Let's count the number of planes of which departed from a certain destination. Pipe flights to the count() function. Within the call to count(), put dest.

flights |> count(dest)

flights |> count(dest)

We usually put count() on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.

Exercise 4

If you want to see the most common values, use the previous pipe and add the parameter sort = TRUE to the function count().

flights |> count(dest, sort = ...)

flights |> count(dest, sort = TRUE)

To see all the values, you can use |> View() or |> print(n = Inf).

Exercise 5

Pipe flights to summarize() wthin which set the argument n equal to n().

flights |> 
  ...(
    n = n())

flights |> 
  summarize(
    n = n())

We are using two totally different n's here. The first n is the number of a variable which we are creating. We could use any variable name we want. The second n --- in n() --- is the name of a function which does the same thing as count().

Exercise 6

Using the same code, add another argument to summarize(): delay = mean(arr_delay, na.rm = TRUE).

flights |> 
  summarize(
    n = n(),
    delay = ...(arr_delay, ... = TRUE))

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE))

n() is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs like summarize().

Exercise 7

Using the same code, add another argument to summarize(): carriers = n_distinct(carrier). Don't forget to separate out each of the arguments in summarize() using a comma.

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE),
    carriers = ...(carrier))

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE),
    carriers = n_distinct(carrier))

n_distinct(x) counts the number of distinct (unique) values of one or more variables. It is part of the same family of functions as count() and n().

Exercise 8

Using the same code, add another argument to summarize(): .by = dest.

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE),
    carriers = n_distinct(carrier),
    ... = dest)

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE),
    carriers = n_distinct(carrier),
    .by = dest)

The .by argument modifies summarize() so that it does the same calculation for each value of .dest. Until now, it has been showing us the results for the entire data set.

Exercise 9

Continue the pipe by adding arrange(desc(carriers)).

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE),
    carriers = n_distinct(carrier),
    .by = dest) |> 
  ...(desc(...))

flights |> 
  summarize(
    n = n(),
    delay = mean(arr_delay, na.rm = TRUE),
    carriers = n_distinct(carrier),
    .by = dest) |> 
  arrange(desc(carriers))

Keep track of variable names. carrier is the variable in flights which tells us the individual carrier for a specific flight. carriers is a variable we created within summarize(). If we did not create it, it would not be available for use by desc() and arrange().

Exercise 10

Start a new pipe by starting with flights again and then adding summarize(miles = sum(distance), .by = tailnum).

flights |> 
  summarize(miles = sum(distance), .by = tailnum)

This is the distance that each plane flew. In a sense, we have count()'d those miles by using sum().

Exercise 11

We can accomplish the same goal by using the wt argument to count(). Pipe flights to count(tailnum, wt = distance).

flights |> count(tailnum, wt = distance)

The answers are the same, although to confirm that claim you would need to compare the resulting tibbles since the two commands output the data in different orders.

Exercise 12

Pipe flights to summarize(n_cancelled = sum(is.na(dep_time))).

flights |> 
  summarize(n_cancelled = sum(is.na(dep_time)),
            .by = dest)

This code counts the missing values by combining sum() and is.na(). In the flights dataset this represents flights that are cancelled.

Numeric transformations

This section will go over transforming numeric vectors with the function mutate() and using various mathematical methods to make new columns.

Operations like flights |> mutate(air_time = air_time / 60) have different lengths in the left and the right hand side. There are 336,776 individual numbers on the left of the / but only one number on the right. This would normally be a problem but R handles these mismatched lengths by recycling, or repeating, the short vector.

By the end of this section we will be making a plot that looks like this:

plots <- flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n) |>
  distinct() |> 
  ggplot(aes(x = hour, y = prop_cancelled)) +
  geom_line() + 
  geom_point(aes(size = n)) 

plots

Exercise 1

On a new line, divide the vector x by 5.

x <- c(1, 2, 10, 20)

x <- c(1, 2, 10, 20)
x / ...

x <- c(1, 2, 10, 20)
x / 5

This operation is shorthand for x / c(5, 5, 5, 5). Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector.

Exercise 2

Multiply the vector x with the values c(1, 2, 3)

x <- c(1, 2, 10, 20)

x <- c(1, 2, 10, 20)
x * c(...)

x <- c(1, 2, 10, 20)
x * c(1,2,3)

It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:

Exercise 3

Create a new pipeline. Pipe flights to the filter() function. And add an argument that checks if month equals c(1, 2).

flights |> 
  filter(month == c(...))

flights |> 
  filter(month == c(1, 2))

The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because flights has an even number of rows.

Exercise 4

Most arithmetic functions work with pairs of variables. Two closely related functions are pmin() and pmax(), which when given two or more variables will return the smallest or largest value in each row.

Let's utilize a premade dataframe df. Type in df and hit "Run Code."

df

df

Exercise 5

Create a new pipeline. Pipe df to the mutate(). Within the mutate(), create a new variable called min and set it to pmin(x, y).

df |> 
  mutate(
    min = ...)

df |> 
  mutate(
    min = pmin(x, y)
  )

We can see that the missing values are still being accounted for in the min variable (NA).

Exercise 6

Using the previous pipe, add the argument na.rm to the pmin() function call within the mutate() function and set it to TRUE.

df |> 
  mutate(
    min = pmin(x, y, na.rm = ...)
  )

df |> 
  mutate(
    min = pmin(x, y, na.rm = TRUE)
  )

If na.rm is FALSE an NA value in any of the arguments will cause a value of NA to be returned, otherwise NA values are ignored.

Exercise 7

Using the previous pipeline. Within the call to the mutate() function create a new variable max and set it to pmax(x, y).

df |> 
  mutate(
    min = pmin(x, y, na.rm = TRUE), 
    max = ...)

df |> 
  mutate(
    min = pmin(x, y, na.rm = TRUE), 
    max = pmax(x, y)
  )

We can see that the missing values are still being accounted for in the max variable (NA).

Exercise 8

Using the previous pipe, add the argument na.rm to the pmax() function call within the mutate() function and set it to TRUE.

df |> 
  mutate(
    min = pmin(x, y, na.rm = TRUE), 
    max = pmax(x, y, na.rm = ...)
  )

df |> 
  mutate(
    min = pmin(x, y, na.rm = TRUE), 
    max = pmax(x, y, na.rm = TRUE)
  )

Note that these are different to the summary functions min() and max() which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value.

Exercise 9

Divide the vector z by 3 using integer division (%/%).

z <- 1:10

z <- 1:10
z ... 3

z <- 1:10
z %/% 3

We can also find the remainder using %%.

Exercise 10

Compute the remainder of the vector z divided by 3 by typing in z %% 3 and hitting "Run Code."

z <- 1:10

z <- 1:10
z ... 3

z <- 1:10
z %% 3

Exercise 11

Create a new pipeline. Pipe flights with the function mutate(). Within the mutate() function, create a new variable called hour and set it to sched_dep_time %/% 100.

flights |> 
  mutate(
    hour = ...)

flights |> 
  mutate(
    hour = sched_dep_time %/% 100)

You can read more about other arithmetic operators here.

Exercise 12

Using the previous code, within the mutate() function, create a new variable called minute and set it to sched_dep_time %% 100.

flights |> 
  mutate(
    hour = sched_dep_time %/% 100, 
    minute = ...)

flights |> 
  mutate(
    hour = sched_dep_time %/% 100, 
    minute = sched_dep_time %% 100)

Modular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time variable into hour and minute.

Exercise 13

Using your previous code, within the mutate() function and add an argument called .keep and set it to "used".

flights |> 
  mutate(
    hour = sched_dep_time %/% 100, 
    minute = sched_dep_time %% 100, 
    .keep = ...
  )

flights |> 
  mutate(
    hour = sched_dep_time %/% 100, 
    minute = sched_dep_time %% 100, 
    .keep = "used"
  )

Read more about the .keep argument in ?summarize.

Exercise 14

Let's use all of the stuff we've learnt to create a graph. This is what the graph is supposed to look like.

plots

Create a new pipeline. Pipe flights to the function mutate(). Within the call to mutate(), create a new variable called hour and set it to sched_dep_time %/% 100.

flights |> 
  mutate(... = sched_dep_time %/% ...)

flights |> 
  mutate(hour = sched_dep_time %/% 100)

We can combine that with a trick using mean(is.na(x)) to see how the proportion of cancelled flights varies over the course of the day.

Exercise 15

Copy your previous code. Within the call to mutate(), create a new variable called prop_cancelled and set it to mean(is.na(dep_time)).

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = ...(is.na(...)))

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)))

is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else, which we can use to find all the rows with a missing dep_time.

Exercise 16

Copy your previous code. Within the call to mutate(), create another variable called n and set it to n().

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = ...())

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n())

n() is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verb.

Exercise 17

Copy your previous code. Within the call to mutate(), add an argument called .by and set it to hour, so that we can group by hour.

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = ...)

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
         n = n(), 
         .by = hour)

Exercise 18

Copy your previous code. Add the function filter() to the pipeline. Within the call to filter() create a new argument that checks if the variable hour is > 1.

... |> 
  filter(... > 1)

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
         n = n(), 
         .by = hour) |> 
  filter(hour > 1)

Exercise 19

Continue the pipe to select(), with the columns hour, prop_cancelled, and n as arguments.

... |>
  ...(hour, ..., n)

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
         n = n(), 
         .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n)

Exercise 20

Now continue the pipe to distinct().

... |>
  ...()

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n) |>
  distinct()

Recall that the distinct() function removes any duplicate rows.

Exercise 21

Copy your previous code. Add the function ggplot() to the pipeline. Within the function ggplot() and map x to hour and y to prop_cancelled using aes().

... |> 
  ggplot(aes(x = ..., y = ...))

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n) |>
  distinct() |> 
  ggplot(aes(x = hour, y = prop_cancelled))

The plot has no data because we have not yet provided a geom.

Exercise 22

Using your previous code. Add the function geom_line() to the pipeline.

... +
  ...()

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n) |>
  distinct() |> 
  ggplot(aes(x = hour, y = prop_cancelled)) +
  geom_line()

Don't forget to use + instead of |> to separate out the component parts of your ggplot() object.

Exercise 23

Using your previous code, add the function geom_point() to the pipeline.

... + 
  geom_...()

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n) |>
  distinct() |> 
  ggplot(aes(x = hour, y = prop_cancelled)) +
  geom_line(color = "grey50") + 
  geom_point()

Exercise 24

Copy your previous code. Within the function geom_point() map size to n.

... + 
  geom_point(aes(size = ...))

flights |> 
  mutate(hour = sched_dep_time %/% 100, 
         prop_cancelled = mean(is.na(dep_time)), 
            n = n(), 
            .by = hour) |> 
  filter(hour > 1) |>
  select(hour, prop_cancelled, n) |>
  distinct() |> 
  ggplot(aes(x = hour, y = prop_cancelled)) +
  geom_line() + 
  geom_point(aes(size = n))

What we have here is a line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.

Reminder: Your plot should look somewhat like this.

plots

Other number functions

This section will allow you to practice with many other number functions with many different uses.

Exercise 1

Run ?log in the Console and copy paste the Description below. (Don't worry about formatting)

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 6)

Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude and converting exponential growth to linear growth.

Exercise 2

Use the function round() with the argument 123.456 and hit "Run Code."

round(...)

round(123.456)

The function round(x) rounds a number to the nearest integer. You can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-digits so digits = 2 will round to the nearest 0.01.

Exercise 3

Copy your previous code. Add the digits argument, with a value of 2, to the round() function.

round(123.456, digits = ...)

round(123.456, digits = 2)

You can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01.

Exercise 4

Copy your previous code and change the digits argument to -2.

round(123.456 , digits = ...)

round(123.456, digits = -2)

The -2 argument in the round() function will round the number 123.456 to the nearest hundred which would be 100 in this case.

Exercise 5

Type in round(c(1.5, 2.5)) and hit "Run Code."

round(...)

round(c(1.5, 2.5))

round() uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.

Exercise 6

The functions floor() and ceiling() are paired with round(). Use the function floor() with the numerical argument of 123.456.

floor(...)

floor(123.456)

The function floor() will always round down to the nearest integer.

Exercise 7

Type in ceiling() with the numerical argument of 123.456.

ceiling(...)

ceiling(123.456)

The function ceiling() always round up to the nearest integer.

Exercise 8

Make a new variable x and set it to the number 123.456.

x <- ...

x <- 123.456

The floor() and ceiling() functions don’t have a digits argument, so you can instead scale down, round, and then scale back up

Exercise 9

Let's round down to the nearest two digits. Type in floor(x / 0.01) * 0.01 and hit "Run Code."

x <- 123.456

x <- 123.456
floor(...) * ...

x <- 123.456
floor(x / 0.01) * 0.01

Exercise 10

Let's round up to the nearest two digits. Type in ceiling(x / 0.01) * 0.01 and hit "Run Code."

x <- 123.456

x <- 123.456
ceiling(...) * ...

x <- 123.456
ceiling(x / 0.01) * 0.01

You can use the same technique if you want to round() to a multiple of some other number

Exercise 11

Let's round to the nearest multiple of 4. Type in round(x / 4) * 4 and hit "Run Code."

x <- 123.456

x <- 123.456
round(...) * ...

x <- 123.456
round(x/4) * 4

Exercise 12

This time let's round to the nearest 0.25. Type in round(x / 0.25) * 0.25 and hit "Run Code."

x <- 123.456

x <- 123.456
round(...) * ...

x <- 123.456
round(x / 0.25) * 0.25

Exercise 13

Use cut() on the numeric vector x and add in the breaks argument and set breaks to c(0, 5, 10, 15, 20).

x <- c(1, 2, 5, 10, 15, 20)

x <- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = ...)

x <- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0,5,10,15,20))

Use cut() to break up (aka bin) a numeric vector into discrete buckets.

Exercise 14

The breaks don't need to be evenly spaced. Copy the previous code and change the breaks argument and set it to c(0, 5, 10, 100).

x <- c(1, 2, 5, 10, 15, 20)

x <- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = ...)

x <- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0, 5, 10, 100))

You can optionally supply your own labels. Note that there should be one less labels than breaks.

Exercise 15

Copy your previous code and add a new argument called labels and set it to c("sm", "md", "lg").

x <- c(1, 2, 5, 10, 15, 20)

x <- c(1, 2, 5, 10, 15, 20)
cut(x, 
    breaks = ...,
    labels = c(...))

x <- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0, 5, 10, 100), 
    labels = c("sm", "md", "lg"))

Any values outside of the range of the breaks will become NA. See the documentation for other useful arguments like right and include.lowest, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b] by running ?cut() in the Console.

Exercise 16

Make a new vector x and set it to 1:10.

x <- ...

x <- 1:10

The function cumsum() will cumulate the sum using all the previous integers within the vector

Exercise 17

Use the function cumsum() on x.

x <- 1:10

x <- 1:10
cumsum(...)

x <- 1:10
cumsum(x)

Base R provides cumsum(), cumprod(), cummin(), cummax() for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean() for cumulative means. Cumulative sums tend to come up the most in practice.

General transformations

This section will utilize the package dplyr and allow you to make use of transformations on an actual data frame.

Exercise 1

We provide a new x vector. On the next line, run min_rank() on x.

x <- c(1, 2, 3, 4, NA)

x <- c(1, 2, 3, 4, NA)
min_rank(x)

x <- c(1, 2, 3, 4, NA)
min_rank(x)

dplyr provides a number of ranking functions inspired by SQL, but you should always start with dplyr::min_rank(). It uses the typical method for dealing with ties, e.g., 1st, 2nd, 2nd, 4th

Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks

Exercise 2

Run min_rank() on desc(x).

x <- c(1, 2, 3, 4, NA)

x <- c(1, 2, 3, 4, NA)
min_rank(...)

x <- c(1, 2, 3, 4, NA)
min_rank(desc(x))

If min_rank() doesn’t do what you need, look at the variants dplyr::row_number(), dplyr::dense_rank(), dplyr::percent_rank(), and dplyr::cume_dist(). See the documentation for details.

Exercise 3

Type in ranktypes and hit "Run Code."

ranktypes

ranktypes

Right now, ranktypes is just a tibble which includes the vector x.

Exercise 4

Pipe ranktypes to the mutate() function. Within the call to mutate(), create a variable row_number which equals row_number(x).

ranktypes |> 
  mutate(row_number = ...)

ranktypes |> 
  mutate(row_number = row_number(x))

Exercise 5

Using the same pipe as above, create another variable, within the call to mutate(), called dense_rank equal to dense_rank(x).

ranktypes |> 
  mutate(row_number = row_number(x), 
         dense_rank = ...)

ranktypes |> 
  mutate(row_number = row_number(x),
         dense_rank = dense_rank(x))

Exercise 6

Using the same pipe as above, create another variable, within the call to mutate(), called percent_rank equal to percent_rank(x).

ranktypes |> 
  mutate(row_number = row_number(x),
         dense_rank = dense_rank(x),
         percent_rank = ...)

ranktypes |> 
  mutate(row_number = row_number(x),
         dense_rank = dense_rank(x),
         percent_rank = percent_rank(x))

Exercise 7

Using the same pipe as above, create another variable, within the call to mutate(), called cume_dist equal to cume_dist(x).

ranktypes |> 
  mutate(row_number = row_number(x),
         dense_rank = dense_rank(x),
         percent_rank = percent_rank(x),
         cume_dist = ...)

ranktypes |> 
  mutate(row_number = row_number(x),
         dense_rank = dense_rank(x),
         percent_rank = percent_rank(x),
         cume_dist = cume_dist(x))

You can achieve many of the same results by picking the appropriate ties.method argument to base R’s rank() function; you’ll probably also want to set na.last = "keep" to keep NAs as NA.

Exercise 8

Type in numbers and hit "Run Code."

numbers

numbers

The numbers tibble just has one variable, id, with values from 1 through 10.

Exercise 9

Pipe numbers to the mutate() function. Within the call to mutate(), create a variable row0 which equals row_number() minus 1.

numbers |> 
  mutate(
    row0 = ... - ...
  )

numbers |> 
  mutate(
    row0 = row_number() - 1)

row_number() can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row.

Exercise 10

Using the same pipe as above, create another variable, within the call to mutate(), called three_groups equal to row0 %% 3.

numbers |> 
  mutate(
    row0 = row_number() - 0, 
    three_groups = ... %% ...
  )

numbers |> 
  mutate(
    row0 = row_number() - 0,
    three_groups = row0 %% 3)

When combined with %% or %/%, `row_number()`` can be a useful tool for dividing data into similarly sized groups

Exercise 11

Using the same pipe as above, create another variable, within the call to mutate(), called three_in_each_group set to row0 %/% 3.

numbers |> 
  mutate(
    row0 = row_number() - 0,
    three_groups = row0 %% 3,
    three_in_each_group = ... %/% ...
  )

numbers |> 
  mutate(
    row0 = row_number() - 0,
    three_groups = row0 %% 3,
    three_in_each_group = row0 %/% 3)

dplyr::lead() and dplyr::lag() allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end

Exercise 12

We provide a new vector x. On the next line, use the function lag() with x as its argument.

x <- c(2, 5, 11, 11, 19, 35)

x <- c(2, 5, 11, 11, 19, 35)
lag(...)

x <- c(2, 5, 11, 11, 19, 35)
lag(x)

Exercise 13

Use the function lead() taking in x as its argument.

x <- c(2, 5, 11, 11, 19, 35)

x <- c(2, 5, 11, 11, 19, 35)
lead(...)

x <- c(2, 5, 11, 11, 19, 35)
lead(x)

Exercise 14

Subtract lag(x) from x.

x <- c(2, 5, 11, 11, 19, 35)

x <- c(2, 5, 11, 11, 19, 35)
x - lag(...)

x <- c(2, 5, 11, 11, 19, 35)
x - lag(x)

z - lag(z) will give you the difference between the current and previous value for all the elements of the vector z.

Note that z represents any variable, and must be changed in every situation to match the variable being used (i.e. apples - lag(apples), or with any other variable).

Exercise 15

Type in x == lag(x) and hit "Run Code."

x <- c(2, 5, 11, 11, 19, 35)

x <- c(2, 5, 11, 11, 19, 35)
x == ...

x <- c(2, 5, 11, 11, 19, 35)
x == lag(x)

x == lag(x) tells you when the current value changes. You can lead or lag by more than one position by using the second argument, n.

Exercise 16

When you’re looking at website data, it’s common to want to break up events into sessions, where you begin a new session after a gap of more than x minutes since the last activity. For example, the times_visited dataset has the times when someone visited a website.

Type in times_visited and hit "Run Code."

times_visited

times_visited

Sometimes you want to start a new group every time some event occurs. We've computed the time between each event, and figured out that there's a gap that's big enough to qualify.

Exercise 17

Pipe times_visited to the mutate() function. Within the call to mutate(), create a variable diff which is set to time minus lag(time, default = first(time)).

times_visited |> 
  mutate(diff = ...)

times_visited |> 
  mutate(diff = lag(time, default = first(time)))

Exercise 18

Using the same pipe as above, create another variable, within the call to mutate(), called has_gap set to diff >= 5.

times_visited |> 
  mutate(
    diff = diff = lag(time, default = first(time)), 
    has_gap = ...)

times_visited |> 
  mutate(diff = lag(time, default = first(time)),
         has_gap = diff >=5)

But how do we go from that logical vector to something that we can use .by with? cumsum() comes to the rescue here.

Exercise 19

Using the same pipe as above, create another variable, within the call to mutate(), called group set to cumsum(has_gap).

times_visited |> 
  mutate(
    diff = diff = lag(time, default = first(time)), 
    has_gap = diff >=5,
    group = ...)

times_visited |> 
  mutate(diff = lag(time, default = first(time)),
         has_gap = diff >=5,
         group = cumsum(has_gap))

When has_gap is TRUE, cumsum() will increment group by 1.

Exercise 20

Imagine you have a dataframe with a bunch of repeated values. Type in repetition and hit "Run Code."

repetition

repetition

Exercise 21

Create a new pipeline and pipe repetition to the mutate() function. Within the call to mutate(), create a variable id and set it to consecutive_id(x).

repetition |> 
  mutate(id = ...)

repetition |> 
  mutate(id = consecutive_id(x))

Another approach for creating grouping variables is consecutive_id(), which starts a new group every time one of its arguments changes.

Exercise 22

Using your previous code, within your call to the mutate() function, add the parameters x and y so that we could include these columns in our plot.

repetition |> 
  mutate(id = consecutive_id(x), x, ...)

repetition |> 
  mutate(id = consecutive_id(x), x, y)

Exercise 23

Using the same pipe as above, add the function slice_head() to the pipeline. Within this function add the argument n and set it to 1.

repetition |> 
  mutate(id = consecutive_id(x), x, y)
  slice_head(n = ...)

repetition |> 
  mutate(id = consecutive_id(x), x, y) |>
  slice_head(n = 1)

This keeps the first row from each repeated x.

Exercise 24

Using your previous code, within the slice_head() function, add the by argument and set it to id.

repetition |> 
  mutate(id = consecutive_id(x), x, y) |>
  slice_head(n = 1, by = ...)

repetition |> 
  mutate(id = consecutive_id(x), x, y) |>
  slice_head(n = 1, by = id)

Numeric summaries

This section will introduce more useful summary functions that will help to summarize your data much better.

Exercise 1

Create a new pipeline. Pipe flights to the mutate() function. Within the mutate() function, create a new variable mean and set it to mean(dep_delay, na.rm = TRUE).

flights |> 
  mutate(
    mean = ...
  )

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE)
  )

An alternative to mean() to use the median(), which finds a value that lies in the “middle” of the vector, Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center.

Exercise 2

Using the same pipe as above, create another variable, within the call to mutate(), called median and set it to median(dep_delay, na.rm = TRUE).

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = ...
  )

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE)
  )

Exercise 3

Using the same pipe as above, create another variable, within the call to mutate(), called n and set it to n().

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = ...
  )

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n()
  )

Exercise 4

Using your previous code, within your call to the mutate() function add an argument called .by and set it to c(year, month, day).

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(...)
  )

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)
  )

Exercise 5

Now continue the pipe to select(), using the columns year, month, day, mean, median, and n as the arguments.

... |>
  select(year, ..., day, ..., median, ...)

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n)

Exercise 6

Now continue the pipe to distinct().

... |>
  ...()

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct()

Exercise 7

Using the same pipe as above, add the ggplot() function to the pipeline. Within the ggplot() function, map x to mean and y to median.

...elt() |> 
  ggplot(aes(x = ..., y = ...))

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct() |> 
  ggplot(aes(x = mean, y = median))

Exercise 8

Using the same pipe as above, add the geom_abline() function to the pipeline. Within the geom_abline() function, add the argument slope and set it to 1 .

... + 
  geom_abline(slope = ...)

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct() |> 
  ggplot(aes(x = mean, y = median)) + 
  geom_abline(slope = 1)

Exercise 9

Using the same pipe as above, within the geom_abline() function, add another argument called intercept and set it to 0.

... + 
  geom_abline(slope = 1, intercept = ...)

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct() |> 
  ggplot(aes(x = mean, y = median)) + 
  geom_abline(slope = 1, intercept = 0)

Exercise 10

Using the same pipe as above, within the geom_abline() function, add another argument called color and set it to "white".

... + 
  geom_abline(slope = 1, intercept = 0, color = ...)

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct() |> 
  ggplot(aes(x = mean, y = median)) + 
  geom_abline(slope = 1, intercept = 0, color = "white")

Exercise 11

Using the same pipe as above, within the geom_abline() function, add another argument called linewidth and set it to 2.

... + 
  geom_abline(slope = 1, intercept = 0, color = "white", linewidth = ...)

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct() |> 
  ggplot(aes(x = mean, y = median)) + 
  geom_abline(slope = 1, intercept = 0, color = "white", 
              linewidth = 2)

Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.

Exercise 12

Using the same pipe as above, add the geom_point() function to the pipeline.

 + 
  geom_...()

flights |> 
  mutate(
    mean = mean(dep_delay, na.rm = TRUE),
    median = median(dep_delay, na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)) |>
  select(year, month, day, mean, median, n) |>
  distinct() |> 
  ggplot(aes(x = mean, y = median)) +
  geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) +
  geom_point()

This plot compares the mean vs. the median departure delay (in minutes) for each destination. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early. It is a scatterplot showing the differences of summarizing daily depature delay with median instead of mean.

Exercise 13

Create a new pipeline. Pipe flights to the mutate() function. Within the mutate() function, create a new variable max and set it to max(dep_delay, na.rm = TRUE).

flights |> 
  mutate(
    max = ...
  )

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE)
  )

min() and max() will give you the largest and smallest values.

Exercise 14

Using the same pipe as above, create another variable, within the call to mutate(), called q95 and set it to quantile(dep_delay, 0.95, na.rm = TRUE).

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE), 
    q95 = ...
  )

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE)
  )

Another powerful tool is quantile() which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find the value that’s greater than 95% of the values.

Exercise 15

Using your previous code, within your call to the mutate() function, add the .by argument and set it to c(year, month, day).

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
    .by = ...
  )

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
    .by = c(year, month, day)
  )

Exercise 16

Continue the pipe to select(), using the year, month, day, max, and q95 columns as the arguments.

... |>
  select(year, ..., day, ..., q95)

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
    .by = c(year, month, day)) |>
  select(year, month, day, max, q95)

Exercise 17

Now finalize the model by piping it to distinct().

... |>
  distinct()

flights |> 
  mutate(
    max = max(dep_delay, na.rm = TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
    .by = c(year, month, day)) |>
  select(year, month, day, max, q95) |>
  distinct()

For the flights data, we are looking at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.

Exercise 18

Create a new pipeline. Pipe flights to the mutate() function. Within the mutate() function, create a new variable distance_sd and set it to IQR(distance).

flights |> 
  mutate(
    distance_sd = ...
  )

flights |> 
  mutate(
    distance_sd = IQR(distance)
  )

Two commonly used summaries are the standard deviation, sd(x), and the inter-quartile range, IQR(). IQR() gives us the range that contains the middle 50% of the data and is calculated by subtracting quantile(x, 0.75) - quantile(x, 0.25).

Exercise 19

Using the same pipe as above, create another variable, within the call to mutate(), called n and set it to n().

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n = n()
  )

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n=n()
  )

Exercise 20

Using your previous code, within your call to mutate() add the .by argument and set it to c(origin, dest).

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n=n(),
    .by = ...
  )

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n=n(),
    .by = c(origin, dest)
  )

Exercise 21

Using the same pipe as above, add the filter() function to the pipeline. Within the filter() function add the argument distance_sd > 0.

... |> 
  filter(... > 0)

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n=n(),
    .by = c(origin, dest)) |>
  filter(distance_sd > 0)

Exercise 22

Continue the pipe to select(), with the columns origin, dest, distance_sd, and n as the arguments.

... |>
  select(origin, ..., dest, ...)

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n=n(),
    .by = c(origin, dest)) |>
  filter(distance_sd > 0) |>
  select(origin, dest, distance_sd, n)

Exercise 23

Finally, continue the pipe to distinct().

|>
  ...()

flights |> 
  mutate(
    distance_sd = IQR(distance),
    n=n(),
    .by = c(origin, dest)) |>
  filter(distance_sd > 0) |>
  select(origin, dest, distance_sd, n) |>
  distinct()

We can use this to reveal a small oddity in the flights data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code above reveals a data oddity for airport EGE (Eagle County Regional Airport).

Exercise 24

Describe the oddity.

question_text(NULL,
    message = "The distance between any two airports should be constant, obviously. Airports don't move! For some reason, the are distances between EGE, on one hand, and JFK/EWR on the other hand, are not always the same. That seems odd!",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 25

Create a new pipeline. Pipe flights to the filter() function. Within the call to filter(), have the argument be dep_delay < 120, filtering out any departure delay that is greater than 2 hours.

flights |> 
  filter(...)

flights |> 
  filter(dep_delay < 120)

Exercise 26

Using the same pipe as above, add the ggplot() function to the pipeline. Within the ggplot() function map x to dep_delay and group to interaction(day, month)

flights |> 
  filter(dep_delay < 120) |> 
  ggplot(aes(x = ..., group = ...))

flights |> 
  filter(dep_delay < 120) |> 
  ggplot(aes(x = dep_delay, group = interaction(day, month)))

flights |> 
  filter(dep_delay < 120) |> 
  ggplot(aes(x = dep_delay, group = interaction(day, month)))

Exercise 27

Using the same pipe as above, add the geom_freqpoly() function to the pipeline. Within the geom_freqpoly() function set the argument binwidth to 5 and alpha to 1/5.

flights |> 
  filter(dep_delay < 120) |> 
  ggplot(aes(x = dep_delay, group = interaction(day, month))) |> 
  geom_freqpoly(...)

flights |> 
  filter(dep_delay < 120) |>
  ggplot(aes(x=dep_delay, group = interaction(day, month))) +
  geom_freqpoly(binwidth = 5, alpha = 1/5)

In the following plot 365 frequency polygons of dep_delay, one for each day, are overlaid. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.

Exercise 28

Create a new pipeline. Pipe flights to the mutate() function. Within the mutate() function, create a new variable first_dep and set it to first(dep_time, na_rm = TRUE).

flights |> 
  mutate(
    first_dep = ...
  )

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE))

Exercise 29

Using the same pipe as above, create another variable, within the call to mutate(), called fifth_dep and set it to nth(dep_time, 5, na_rm = TRUE).

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = ...
  )

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE))

Exercise 30

Using the same pipe as above, create another variable, within the call to mutate(), called last_dep and set it to last(dep_time, na_rm = TRUE).

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = ...
  )

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = last(dep_time, na_rm = TRUE))

Exercise 31

Using your previous code, within your call to mutate() add the .by argument and set it to c(year, month, day).

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = last(dep_time, na_rm = TRUE),
    .by = ...
  )

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = last(dep_time, na_rm = TRUE),
    .by = c(year, month, day))

Exercise 32

Continue the pipe to select(), with the columns year, month, day, first_dep, fifth_dep, and last_dep as the arguments.

... |>
  select(year, ..., day, ..., fifth_dep, ...)

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = last(dep_time, na_rm = TRUE),
    .by = c(year, month, day)) |>
  select(year, month, day, first_dep, fifth_dep, last_dep)

Exercise 33

Finally, continue the pipe to distinct().

... |>
  distinct()

flights |> 
  mutate(
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = last(dep_time, na_rm = TRUE),
    .by = c(year, month, day)) |>
  select(year, month, day, first_dep, fifth_dep, last_dep) |>
  distinct()

The functions first(x), last(x), and nth(x, n) extract a value at a specific position.

Exercise 34

Create a new pipeline and pipe flights to the mutate() function to the pipeline. Within the mutate() function, create a new variable y and set it to min_rank(sched_dep_time).

flights |> 
  mutate(y = min_rank(...))

flights |> 
  mutate(y = min_rank(sched_dep_time))

Exercise 35

Using your previous code, within your call to the mutate() function, add another argument called .by and set it to c(year, month, day).

flights |> 
  mutate(y = min_rank(sched_dep_time), .by = ...)

flights |> 
  mutate(y = min_rank(sched_dep_time),
         .by = c(year, month, day))

As the names suggest, the summary functions are typically paired with summarize(). However, because of the recycling rules, they can also be usefully paired with mutate(), as we have done so in this tutorial.

At the moment, the only two functions that work in these cases are mutate() and reframe(), as an update to the Dplyr package removed the ability to use summarize() in these cases. At this stage, you shouldn't use reframe(), as it's a very complicated function, so ideally, the only function that you should be using would be mutate().

Exercise 36

Using the same pipe as above, add the filter() function to the pipeline. Within the filter() function, add the argument y %in% c(1, max(y)).

flights |> 
  mutate(y = min_rank(sched_dep_time), .by = c(year, month, day)) |> 
  filter(...)

flights |> 
  mutate(y=min_rank(sched_dep_time),
         .by = c(year, month, day)) |>
  filter(y %in% c(1, max(y)))

Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row.

Summary

This tutorial covered Chapter 13: Numbers from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We have utilized two core packages from Tidyverse: readr and dplyr. Key commands that you learned include parse_double() parsed numbers directly from strings, parse_number() removed useless characters and parsing numbers from strings, count() which counted the unique values of one or more variables, pmin() which take one or more vectors in and returns the minima or maxima of these vectors, round() which rounds values in its first argument to the specified number of decimal places, and min_rank() which gives every tie the same value and ranks an inputted vector.

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

r4ds.tutorials Tutorials for "R for Data Science"

Numbers In r4ds.tutorials: Tutorials for "R for Data Science"

Introduction

Making Numbers

Exercise 1

Exercise 2

Exercise 3

Counts

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Numeric transformations

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Exercise 18

Exercise 19

Exercise 20

Exercise 21

Exercise 22

Exercise 23

Exercise 24

Other number functions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

General transformations

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

r4ds.tutorials
Tutorials for "R for Data Science"

Numbers
In r4ds.tutorials: Tutorials for "R for Data Science"