Strings

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(babynames)
library(gghighlight)
knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

x <- "n"

x1 <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")

set.seed(9)

df <- tibble(name = c("Flora", "Preceptor", "Terra", NA))

df1 <- tibble(name = c("Flora", "Preceptor", "Terra", NA))

df2 <- tribble(
  ~ name, ~ fruit,
  "Carmen", "banana",
  "Carmen", "apple",
  "Marvin", "nectarine",
  "Terence", "cantaloupe",
  "Terence", "papaya",
  "Terence", "mandarin"
)

y <- c("Apple", "Banana", "Pear")

x2 <- "text\nEl Ni\xf1o was particularly bad this year"

x3 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

u <- c("\u00fc", "u\u0308")

tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""


Introduction

This tutorial covers Chapter 14: Strings from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Some important functions which we will learn include: str_c(), str_glue(), str_flatten(), separate_longer_delim(), and more.

Creating a String

We’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (') or double quotes (").

Exercise 1

Type this is a string in double quotes and assign it to a variable named string1 using <-.


string1 <- "..."
string1 <- "This is a string"

Now there’s no difference in behavior between the single quotes and double quotes, but in the interests of consistency, the Tidyverse style guide recommends using ", unless the string itself contains double quotes.

Exercise 2

Now type hello world with one double quote, the other missing, and assign to str2.


str2 <- ...
str2 <- "hello world"

Now you'll get a error to close the quotes. In the console, if you forget to close a quote, you’ll see +: the continuation prompt. If this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.

Exercise 3

Now type hello "world" in double quotes and use another pair of double quotes on world and assign to str3.


str3 <- "..."

The code str3 <- "hello "world"" doesn't work in RStudio due to a syntax error. Without the proper escape character, RStudio will interpret the second double quote as the end of the string, resulting in a syntax error.

Exercise 4

In the previous exercise, we encountered a syntax error when trying to assign the string str3 <- "hello "world"" in RStudio. To fix this error, we can modify the code by changing the outer double quotes to single quotes, while keeping the inner double quotes intact. This way, the double quotes will be visible within the string.


str3 <- 'hello "world"'
str3 <- 'hello "world"'

If you wanted to include single quotes in a string you could put the outer quotes as double quotes.

Exercise 5

New create a new variable x and set it n using <- in a string format using double quotes. Then on a new line call x.


x <- "n"
x
x <- "n"
x

R shows the whole string as a vector, which is why we don't see 'n' without quotes.

Exercise 6

To view only the string and not the quotes, lets load the stringr library using library().


library(...)
library(stringr)

The stringr library is a part of the Tidyverse. We can just load the tidyverse library and the stringr library will be automatically loaded.

Exercise 7

To utilize the variable x that we previously set, let's type str_view() and specify x as the argument.


str_view(...)
str_view(x)

The printed representation of a string is not the same as the string itself because the printed representation shows the escapes. To see the raw contents of the string, we use str_view()

Exercise 8

To perform the same action with single quotes, let's create a new variable y and assign it the value "'" using the <- operator. Then, on a new line, type str_view() and pass y as the argument.


y <- ...
str_view(...)
y <- "'"
str_view(y)

Exercise 9

There's another way to include quotes in a string. To include a literal single or double quote, you can use the backslash \ to escape it. For example, "\"" will return '"'.

Let's create a string with one single quote and assign it to the variable single_quote. Then, on a new line, type str_view() and pass single_quote as the argument.


single_quote <- '\''
str_view(...)
single_quote <- '\''
str_view(single_quote)

You might wonder what do you do if you wanted to include a literal backslash in your string. We'll see how in next exercise.

Exercise 10

To include a literal backlash in your string, it is pretty simple and all you need to do is have two backslashes in a string to have one literal one. So just type two backslashes in a string and run it within str_view().


str_view("...")
str_view("\\")

Exercise 11

If you find yourself dealing with a complex situation where you have many backslashes and quotes to include, it can become confusing to keep track of them.

To illustrate this problem, consider the string tricky, defined as below. Hit "Run Code."

tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""

tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""

That’s a lot of backslashes! (This is sometimes called leaning toothpick syndrome.)

Exercise 12

Run str_view() with tricky set as the argument.


str_view(tricky)

str_view(tricky)

After running the code, we should expect to see double_quote <- "\"" # or '"' single_quote <- '\'' # or "'" as the output. However, creating a string representation like this can be extremely confusing due to the excessive use of backslashes and quotes. In the next exercise, we will explore a solution to this problem.

Exercise 13

Let's modify tricky as below. Add str_view() with tricky as the argument.

tricky <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")"

tricky <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")"
str_view(tricky)

tricky <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")"
str_view(tricky)

To solve the issue, we utilized a raw string, which is a specific type of string literal that doesn't interpret any special characters or escape sequences.

Typically, a raw string starts with r"(, ends with )", and allows for any text representation. However, if the string contains )", alternatives like r"[]" or r"{}" can be used. Furthermore, you can add dashes to ensure unique opening and closing pairs, such as r"--()--", r"---()---", and so on. Raw strings offer flexibility to handle any text without problems.

Exercise 14

Besides \", \', and \\, there are a few other special characters that can be useful. The most common ones are \n for a new line and \t for a tab.

Now let's observe how str_view() handles these special characters.

To do that, let's create a new variable called x1 and set it as a vector with the following elements: "one\ntwo", "one\ttwo", "\u00b5", "\U0001f604".


x1 <- c(...)

x1 <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")

If you want to check out a complete list of other special characters, check out in Quotes.

Exercise 15

Run x1.


x1
x1

All we get is the same as what we typed and doesn't actually take all the special characters that we put except the smiley face.

Exercise 16

Now run str_view() with x1 as the argument.


str_view(x1)
str_view(x1)

Compared to just printing the strings in console, we actually see the spaces for the first string, we see a tab which shows that str_view can take in special characters.

Creating many strings from data

Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame.

We’ll show you how to combine strings with str_c() and str_glue() and how you can use them with mutate().

Exercise 1

The first function we will learn is str_c().

str_c()takes any number of vectors as arguments and returns a character vector, so run str_c("x","y") and see what you will get.


str_c(...,...)

str_c("x","y")

str_c() is a function in the stringr package in R that combines multiple character vectors into a single character vector. It is similar to the paste() function, but it uses tidyverse recycling and NA rules.

Exercise 2

Copy previous code and change the argument to str_c("Hello", c("Precptor","Anish")) and run it.


str_c(..., c(...))

str_c("Hello ", c("Precptor","Anish"))

Based from the result we can see that str_c() is vectorized, which means it can take multiple arguments and combine them element-wise.

Exercise 3

To demonstrate the usage of str_c() on a data set, let's create a tibble. Create a variable called df and assign it to tibble(name = c("Flora", "Preceptor", "Terra", NA)). This will generate a data set with string characters.


... <- tibble(...=c(...))

df <- tibble(name = c("Flora", "Preceptor", "Terra", NA))

You might be wondering what tibble is, A tibble is a new form of a data frame in R that is part of the tidyverse library, and Tibbles print the data in a more efficient format than a data frame, showing the values of the columns, their datatype, and the size of the dataset.

Exercise 4

Run df to see the data frame.


...

df

the tibble has one column named "name." The first three rows contain the names "Flora," "Preceptor," and "Terra," respectively. However, the fourth row has an NA value, which typically represents missing or unknown data.

Exercise 5

Let's modify the tibble by first piping the tibble df to the mutate() function, then in mutate(), we will create new column greeting and set it equal to str_c("Hi ", name, "!").


... |>
  mutate(... = ...("...",name,"..."))

df |>
  mutate(greeting = str_c("Hi ", name, "!"))

After modifying the tibble, we still have a column called name. We used the mutate() function from the "dplyr" package to create a new column called greeting. The str_c() function is used to concatenate the string "Hi " with the values in the name column.

Exercise 6

If you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of "s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue().

Let's create a variable named df1 and set it to tibble(name = c("Flora","Preceptor","Terra",NA)) which create a data set with the string characters.


... <- tibble(...=c(...))

df1 <- tibble(name = c("Flora","Preceptor","Terra",NA))

How str_glue() works is that if you give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes.

Exercise 7

Let's modify the tibble by first piping the tibble df1 to the mutate() function, then in mutate(), we will create new column greeting and set it equal to str_glue("Hi {name}!").


df1 |>
  ...(greeting = ...())

df1 |>
  mutate(greeting = str_glue("Hi {name}!"))

After modifying the tibble, we still have a column called name. We used the mutate() function from the "dplyr" package to create a new column called greeting. As you can see, str_glue() currently converts missing values to the string "NA" unfortunately making it inconsistent with str_c().

Exercise 8

str_c() and str_glue() are suitable for use with mutate() since their output matches the length of their inputs. However, if you need a function that works well with summarize() and always returns a single string, str_flatten() comes into play. It takes a character vector as input and combines each element of the vector into a single string.

Type str_flatten() and within it have a vector of strings like this c("x","y","z").


str_flatten(...)
str_flatten(c("x","y","z"))

One variation of str_flatten() is str_flatten_comma() is a variation designed specifically for flattening with commas. It automatically recognizes if last uses the Oxford comma and handles the special case of 2 elements.

Exercise 9

What if you want commas after every letter. Copy the previous code and within the argument, add a comma and type a comma in string like this ",".


str_flatten(c(...),...)
str_flatten(c("x","y","z"), ",")

The second argument which was a comma as a string as we saw, it is actually known as a collapse: a string to enter between each piece. Defaults to "".

Exercise 10

Let's fix the grammar of the string we got in previous exercise, copy the previous code and add last = ", and ".


str_flatten(c(...),..., last = ...)
str_flatten(c("x","y","z"), ",", last = ", and ")

The last argument is a optional string to use in place of the final separator.

Exercise 11

Let's now use str_flatten() with summarize(). Run df2 to have a quick look at the data frame.


df2
df2

The tibble has two columns named "name" and "fruit". Carmen has more than one fruit to his name and Terence has more than two fruits to his name.

Exercise 12

Start a pipe with df2 to summarize() and within summarize(), type the column fruit and set it to str_flatten(fruit, ", ").


df2 |>
  summarize(...)
df2 |>
  summarize(fruit = str_flatten(fruit, ", "))

summarize() creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input.

Exercise 13

Within summarize() add .by = name.


... |>
  summarize(...,.by = ...)
df2 |>
    summarize(fruit = str_flatten(fruit, ", "), .by = name)

The .by arguement is a selection of columns to group by for just this operation, functioning as an alternative to group_by(). For details and examples, see ?dplyr_by.

The final outcome ensures the absence of duplicate names while effectively organizing all fruits based on their respective associations.

To review, mutate() either changes an existing column or adds a new one. summarize() calculates a single value (per group).

Extracting data from strings

It's very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:

df |> separate_longer_delim(col, delim) df |> separate_longer_position(col, width) df |> separate_wider_delim(col, delim, names) df |> separate_wider_position(col, widths)

Exercise 1

Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim() to split based on a delimiter.

Create a new tibble using tibble(), create a column x and set it to be a vector which should look like c("a,b,c", "d,e", "f") and assign it to df1.


df1 <- tibble(... = c())
df1 <- tibble(x = c("a,b,c", "d,e", "f"))

Just like with pivot_longer() and pivot_wider(), _longer like separate_longer_delim() functions make the input data frame longer by creating new rows and _wider functions make the input data frame wider by generating new columns.

Exercise 2

Copy the previous code and on a new line, start a pipe with df1 to separate_longer_delim(). Within the function, type the column x and set delim equal to ",".


df1 <- ...
df1 |>
  separate_longer_delim(...,... = ",")
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |>  
  separate_longer_delim(x, delim = ",")

If you are wondering what delim = "," means, the delim argument is used to specify the delimiter character in functions that involve reading or writing delimited data files.

For example, when reading a CSV (Comma-Separated Values) file using the read_delim() function from the readr package, you can specify delim = "," to indicate that the values in the file are separated by commas.

Exercise 3

Sometimes distinct numbers within a string are not separated by commas or spaces. In such cases, how should we properly split the string to separate these numbers? To demonstrate the issue create a new tibble(), set x as the name of the column and set x equal to a vector which looks like c("1211", "131", "21") and lastly assign it to df2.


df2 <- tibble(... = c())
df2 <- tibble(x = c("1211", "131", "21"))

As you look at the tibble, notice we can't separate the numbers by commas to make the column longer. That's where separate_longer_position() comes into the play.

Exercise 4

Copy the previous code and then, on a new line, pipe with df2 to separate_longer_position(), within the function, type the column x and then set the width to 1.


df2 <- tibble(... = c())
df2 |>
  separate_longer_position(x, ... = ...)
df2 <- tibble(x = c("1211", "131", "21"))
df2 |> 
  separate_longer_position(x, width = 1)

Now you see that all the strings have been separated into one-digit integers, creating more rows to fit each integer into each row.

In R, the width = 1 parameter is used in functions that involve writing or formatting output to specify the width of the output field.

Exercise 5

What if you want to create columns when separating? Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns.

Create a tibble(), set x as the name of the column and set x equal to a vector which looks like c("a10.1.2022", "b10.2.2011", "e15.1.2015") and lastly assign it to df3.


df3 <- tibble(x = c(...))
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))

In this following data set, x is made up of a code, an edition number, and a year, separated by ".". To use separate_wider_delim(), we supply the delimiter and the names in two arguments.

Exercise 6

Copy the previous code and start pipe with df3 to separate_wider_delim(), within the function, type x, then set the delim to "." and finally set the names to a vector which looks like c("code", "edition", "year").


df3 <- tibble(x = c(...))
df3 |> 
  ....(
    x,
    .... = ".",
    names = c(...)
  )
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |> 
  separate_wider_delim(
  x,
  delim = ".",
  names = c("code", "Edition", "year"))

What if I don't care about the edition of the book and want to delete it?

Exercise 7

If you don't want a column, all we have to do rename to column to be NA. Copy the previous code and change edition in names to NA.


df3 <- tibble(x = c(...))
df3 |> 
  ....(
    x,
    .... = ".",
    names = c(...)
  )
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |> 
  separate_wider_delim(
  x,
  delim = ".",
  names = c("code", NA, "year"))

Exercise 8

What if you want to separate them by different width in position? separate_wider_position() is the function that solves that.

Create a tibble(), set x as the name of the column and set x equal to a vector which looks like c("202215TX", "202122LA", "202325CA") and lastly assign it to df4.


df4 <- tibble(x = c(...))
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))

separate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them

Exercise 9

Copy the previous code and start a pipe with df4 to separate_wider_position(), within the function, type x, then set the widths to a vector which looks like c(year = 4, age = 2, state = 2).


df4 <- tibble(x = c(...))
... |> 
  ...(
    x,
    widths = c(...,...,...)
  )
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 |>  
  separate_wider_position(
  x, 
  widths = c(year = 4, age = 2, state = 2)
  )

_wider functions like the one we used above make the input data frame wider by generating new columns.

Exercise 10

separate_wider_delim() requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces?

Create a tibble(), set x as the name of the column and set x equal to a vector which looks like c("1-1-1", "1-1-2", "1-3", "1-3-2", "1") and lastly assign it to df.


df <- tibble(x = c(...))
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))

In response to not having the same number of pieces, there are two possible problems, too few or too many pieces, so separate_wider_delim() provides two arguments to help: too_few and too_many.

Exercise 11

Copy the previous code and start pipe with df to separate_wider_delim(), within the function, type x, then set the delim to "-" and finally set the names to a vector which looks like c("x", "y", "z").


df <- tibble(x = c(...))
... |> 
  separate_wider_delim(
    x,
    ... = "-",
    ... = c("x", "y", "z")
  )

You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem

Exercise 12

Copy the previous code and within separate_wider_delim() add too_few to be "debug". Also create a variable debug and set it to the pipe then at the end call debug.


df <- tibble(x = c(...))
debug <- ... |> 
  separate_wider_delim(
    x,
    ... = "-",
    ... = c("x", "y", "z"),
    too_few = ...
  )
...
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
debug <- df |>  
  separate_wider_delim(
    x, 
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "debug"
  )
debug

When you use the debug mode, you get three extra columns added to the output: x_ok, x_pieces, and x_remainder (if you separate a variable with a different name, you’ll get a different prefix).

x_pieces tells us how many pieces were found, compared to the expected 3 (the length of names). x_remainder isn’t useful when there are too few pieces, but we’ll see it again shortly

Exercise 13

In other cases, you may want to fill in the missing pieces with NAs and move on. That’s the job of too_few = "align_start" and too_few = "align_end" which allow you to control where the NAs should go.

Copy the previous code and change too_few to be set to "align_start".


df <- tibble(x = c(...))
... |> 
  separate_wider_delim(
    x,
    ... = "-",
    ... = c("x", "y", "z"),
    too_few = ...
  )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
debug <- df |>  
    separate_wider_delim(
    x, 
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "align_start"
  )
debug

You see that some of colums are filled with NA since we put too_few = "align_start".

Exercise 14

The same principles apply if you have too many pieces.

Create a tibble(), set x as the name of the column and set x equal to a vector which looks like c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9") and lastly assign it to df.


df <- tibble(x = ...)
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))

Exercise 15

Copy the previous code and start pipe with df to separate_wider_delim(), within the function, type x, then set the delim to "-" and finally set the names to a vector which looks like c("x", "y", "z").


df <- tibble(x = ...)
... |> 
  separate_wider_delim(
    ...,
    delim = "...",
    names = c(...)
  )

You will get an error and get suggestions to use too_many = "debug" or too_many = "drop/merge". We will use "debug" to see what being kept and what's being left and after that we will talk about use drop and merge.

Exercise 16

Copy the previous code add too_many = "debug" within separate_wider_delim().


df <- tibble(x = ...)
... |> 
  separate_wider_delim(
    ...,
    delim = "...",
    names = c(...),
    too_many = "..."
  )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df |>  
  separate_wider_delim(
    x, 
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "debug"
  )

When we debug the result, you can see the purpose of x_remainder: it shows what was left behind and not inserted into the columns.

Exercise 17

Copy the previous code and change too_many to "drop".


df <- tibble(x = ...)
... |> 
  separate_wider_delim(
    ...,
    delim = "...",
    names = c(...),
    too_many = "..."
  )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df |>  
  separate_wider_delim(
    x, 
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "drop"
  )

As we are already aware by the word drop, it has been established that Rstudio will discard the remaining x remainders.

Exercise 18

Copy previous and now change too_many to "merge".


df <- tibble(x = ...)
... |> 
  separate_wider_delim(
    ...,
    delim = "...",
    names = c(...),
    too_many = "..."
  )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df |>  
  separate_wider_delim(
    x, 
    delim = "-",
    names = c("x", "y", "z"),
    too_many = "merge"
  )

The x remainders are merged into the final column.

Letters

In this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract sub strings, and handle long strings in plots and tables.

Exercise 1

How do you find length of a string? That's where str_length() comes into play. str_length() tells you the number of letters in the string.

Type the function str_length() and type "hello world" as the argument and run it.


str_length(...)
str_length("hello world")

str_length() not only counts the letters but also the spaces in between the words in a string. If you want to explore more, check out str_length().

Exercise 2

In addition to "hello world" in str_length(), add two more values which are "programming" and NA. Make sure to put these three string in vector like c(...,...,...).


str_length(c(...,...,...))
str_length(c("hello world", "programming", NA))

The values you'll get are 11, 11 and NA.

Exercise 3

Let's load the "babynames" using library().


...
library("babynames")

The babynames is a data set which contained names used for babies from 1880 to 2017. If you want to explore the data set, checkout popular baby names.

Exercise 4

Run babynames to have an overview of the data set.


babynames
babynames

We have five columns(year, sex, name, n, prop). The n means the count of the name and the prop is the proportion or fraction of individuals with that name out of the total number of individuals.

Exercise 5

To discover some findings lets use this with count() to find the distribution of lengths of US baby names.

Start a pipe with babynames to count().


... |>
  count()
babynames |>  
  count()

We see that there are a total count of 1924665 names used. Let's make the function more complex to find the distribution of lengths of baby names.

Exercise 6

Let's create two columns to see the length of the name and also the count. Within count() type length and set it equal to str_length(name) and then type wt and set it to n.


... |>
  count(... = str_length(...),wt = ...)
babynames |>  
  count(
    length = str_length(name),
    wt = n
  )

What wt = n does is that it specifies that the count (n) should be used as the weight for the calculation. Looking at the results, we see that the biggest name contains 15 letters.

Exercise 7

Now that we know that the longest name is 15 letters. Let's find the most famous names which are 15 letters. Start a pipe with babynames to filter(). We have to select the names which are 15 letters so type str_length(name) and set it to be equal to be 15 using ==.


... |>
  filter(...(name)== 15)
babynames |> 
  filter(
    str_length(name) == 15
  )

In RStudio, the == operator is used to test for equality between two values. It is a comparison operator that returns a logical value of TRUE if the values are equal and FALSE otherwise.

Exercise 8

Now that we have all the names with 15 letters, let's continue the pipe from filter() to count(). Within count() type name, set wt to n and then set sort to TRUE.


... |>
  ...|>
  count(..., wt = ..., sort = ...)
babynames |> 
  filter(
    str_length(name) == 15
  ) |> 
  count(
    name, 
    wt = n, 
    sort = TRUE
  )

Looks at the results, the most famous name is Francisco Javier, being used 123 times. Looking at the other names, Christopher looks similar is most of the other names.

Exercise 9

Now that we have learned str_length(). Let's move on to learning about str_sub().

Create a variable y and set it equal to a vector containing strings: Apple, Banana, and Pear.


y <- c(...,...,...)
y <- c("Apple", "Banana", "Pear")

You can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end.

Exercise 10

To extract a substring using the str_sub() function, specify the following arguments: the first argument should be the variable y that we have defined, the second argument should be the starting position (1 in this case), and the third argument should be the ending position (3 in this case).


str_sub(y, ...,...)
str_sub(y, 1, 3)

Looking at results, when using str_sub()The start and end arguments are inclusive, so the length of the returned string will be end - start + 1

Exercise 11

Let's modify the line from previous code and change the 2nd arguemnt to -3 and the 3rd argument to -1.


str_sub(..., ...,...)
str_sub(y, -3, -1)

If you want to look at the end of the string, you can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.

Exercise 12

You might ask what if the range is bigger than the actual string? Let's see then, type str_sub() and in the first argument type "a", then type 1 for 2nd argument and 5 for 3rd arguemnt.


str_sub("...",1,...)
str_sub("a", 1, 5)

Seeing the results, we have to note that str_sub() won’t fail if the string is too short: it will just return as much as possible

Exercise 13

Now's that we have a sense of str_sub(), let's find the first and last letter of each name. Start a pipe with babynames to mutate().


babynames |>
  mutate()
babynames |>
  mutate()

We are using mutate() because we will be add two new columns called first and last.

Exercise 14

Within mutate(), create a new column called first and set it to str_sub(name, 1,1) to get the first letter.


... |>
  ...(... = str_...(name,1,...))
babynames |>
  mutate(
    first = str_sub(name, 1,1)
  )

If you didnt know, In Python, indexing starts from zero, whereas in RStudio, indexing starts from one, resulting in a slight variation in accessing elements and slicing sequences.

Exercise 15

In addition to creating the first column , let's add the last column. Within mutate(), add last and set it to str_sub(name, -1,-1).


... |>
  ...(... = str_...(name,1,...),
      last = ..._sub(name, -1,-1))
babynames |>
  mutate(
    first = str_sub(name, 1,1),
    last = str_sub(name, -1,-1)
  )

When we ran the code, we see two new columns first and last which show the first and last letters of a name.

Last letter of boy names

The babynames package contains data about baby names through the years.

Using this data, let's create the following graph.

boys_p <- babynames |>
  filter(sex == 'M') |>

  # prop is supposed to be proportion of a specific name in all male baby names
  # in that year. But I am not sure it is! For example, if you sum up prop for
  # all M in a given year, it never adds to 1. It always adds to a number less
  # than 1, but greater than 0.9. My *guess* would be that this is due to
  # babynames dropping names which have very small numbers of babies with that
  # name. If true, then prop is likely correct.

  select(name, year, prop) |>

  # Add last letter of baby names. 

  mutate(last_letter = stringr::str_sub(name, -1, -1)) |>

  # frequency is the poroportion of a certain last letter in all male baby names in that year. 

  summarize(frequency = sum(prop), 
            .by = c(year, last_letter)) |>
  ggplot(aes(x = year, y = frequency, colour = last_letter)) +
    geom_line() + 

    # only color last letters that appears more than 15 % of all boy names in any year. 

    gghighlight(max(frequency) > 0.15,
                label_key = last_letter) + 
    scale_x_continuous(guide = guide_axis(angle = 70),
                       breaks = c(1880, 1900, 1925, 1950,
                                  1975, 2000, 2017)) +
    scale_y_continuous(labels = scales::percent_format()) +
    theme_minimal() + 
    labs(x = NULL,
         y = NULL,
         subtitle = "Names ending with 'N' increased rapidly after 1950",
         title = "Last Letter of Boy Baby Names",
         caption = "Source: http://hadley.github.io/babynames")

boys_p

Exercise 1

glimpse() the babynames data set. Pay close attention to the data type of each variable.


glimpse(...)
glimpse(babynames)

We could have also used the head() function to check the first few rows of the tibble.

Exercise 2

To check for NA values, use any() function with argument is.na(), argument of which should be the dataset babynames.


any(is.na(...))
any(is.na(babynames))

This function returned FALSE, which indicates that there are no NA values.

Exercise 3

Because we are trying to analyse only boy names by last letter, start a pipe with babynames.filter() the data set to only get the data of boy names by setting sex equal to 'M'.


babynames |>
  filter(... == 'M')
babynames |>
  filter(sex == 'M')

Exercise 4

Continue the pipe by selecting the name, year, and prop columns using select().


...  |>
  select(..., ..., ...)
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop)

When selecting columns, we can also combine multiple select helpers. For example, when selecting name and year columns, we can use the following code: select(starts_with("na") | ends_with("ar")).

Exercise 5

We want the last letters of boy names. Therefore, create a new variable names last_letters that should be equal to last letter of each boy name. To get the last letter of each boy name, we will make use of str_sub() function of stringr package.

Add the mutate() function to add a new column: last_letter. Its value is the str_sub() function with name, -1 and -1 as its first three arguments.


... |>
  mutate(last_letter = stringr::str_sub(..., ..., -1))
babynames |>
  filter(sex == 'M')|>
  select(name, year, prop)|>
  mutate(last_letter = stringr::str_sub(name, -1, -1))

str_sub takes three arguments: character vector to work on, starting character, and ending character. Because we wanted to extract last letter, or -1 element, of each name, we set the starting and ending characters as -1.

Exercise 6

Add summarize() to create the variable frequency that is equal to sum() of proportions (prop) of each group. Also, set the .by argument to c(year, last_letter) because we want to know the value for each year/letter combination.


... |> 
  summarize(frequency = sum(prop), 
            .by = c(year, last_letter))
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter))

Exercise 7

Now, create a ggplot object, and within ggplot(aes()), set x-axis values to year, y-axis values to frequency, and colour to last_letter.


...|>
  ggplot(aes(x = ..., y = ..., colour = ...)) 
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter))

Exercise 8

Add geom_line() to the ggplot object from previous exercise.


...  +
  geom_line()
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter)) +
    geom_line()

Alternatively, we could have used other chart types such as scatter chart. However, for our goal, line plot is a great option.

Exercise 9

Because we want to analyse last letters that are prevalent, we want to color only last letters that have been above a certainn threshold.

We use 15% as the threshold. Therefore, we only want to color last letters that for certain time have been the last letter of more than 15% of all baby boys that were born during that time.

We will utilize gghighlight library for this purpose. Add gghighlight() to the ggplot object. Within this function, set max() frequency to be greater than 0.15, and to avoid any warnings, set label_key equal to last_letter.


...+ 
  gghighlight(max(...) > ...,
                label_key = ...) 
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter)) +
    geom_line() +
    gghighlight(
      max(frequency) > .15,
      label_key = last_letter
    )

We don't think current x-axis ticks are informative enough. Therefore, we should set custom x-axis ticks.

Exercise 10

Add scale_x_continuous, and within this function, set breaks argument equal to numerical vector of years. You can choose any years within the data set. We preferred 1880, 1900, 1925, 1950, 1975, 2000, and 2017.


... +
  scale_x_continuous(breaks = c(1880, 1900, ..., ...,
                                ..., 2000, 2017)) 
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter)) +
    geom_line() +
    gghighlight(
      max(frequency) > .15,
      label_key = last_letter
    ) +
    scale_x_continuous(
      breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017))

A rule of thumb, however, is to include ending and beginning years so that whoever sees your plot could understand the range of years that data is collected.

Exercise 11

Moreover, we want to visualize y-axis as percentage instead of proportion. Therefore, add scale_y_continuous, and within this function, set labels argument equal to scales::percent_format().


... +
  scale_y_continuous(... = scales::...()) 
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter)) +
    geom_line() +
    gghighlight(
      max(frequency) > .15,
      label_key = last_letter
    ) +
    scale_x_continuous(
      breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) +
    scale_y_continuous(
      labels = scales::percent_format())

scale library provides many functions to assist the scaling of axes. For example, scale_y_binned() can be used to discretize continuous position data.

Exercise 12

Add theme_minimal().


... +
  theme_minimal()
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter)) +
    geom_line() +
    gghighlight(
      max(frequency) > .15,
      label_key = last_letter
    ) +
    scale_x_continuous(
      breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) +
    scale_y_continuous(
      labels = scales::percent_format()) +
    theme_minimal()

We also could have used themes such as theme_gray(), theme_void(), andtheme_bw().

Exercise 13

Finally, add caption, title, subtitle, and axis labels of your choice. Add labs() function to the plot.

Reminder: This is what your plot should look like

boys_p

... +
  labs(...)
babynames |>
  filter( 
    sex == 'M') |>
  select(
    name, year, prop) |>
  mutate( 
    last_letter = str_sub(name, -1, -1)) |>
  summarize(
    frequency = sum(prop),
    .by = c(year, last_letter)) |>
  ggplot( 
    aes(x = year, y = frequency, colour = last_letter)) +
    geom_line() +
    gghighlight(
      max(frequency) > .15,
      label_key = last_letter
    ) +
    scale_x_continuous(
      breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) +
    scale_y_continuous(
      labels = scales::percent_format()) +
    theme_minimal() +
    labs( 
      x = "Year",
      y = "",
      title = "Last Letter of Boy Baby Names",
      subtitle = "Names ending with 'N' increased rapidly after 1950")

Summary

This tutorial covered Chapter 14: Strings from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Some important functions which we learned include: str_c(), str_glue(), str_flatten(), separate_longer_delim(), and more.




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.