Regular expressions

library(learnr)
library(tutorial.helpers)
library(tidyverse)         # Note that tidyverse includes stringr
library(babynames)
knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

people <- tribble(
  ~str,
  "<Sheryl>-F_34",
  "<Kisha>-F_45", 
  "<Brandon>-N_33",
  "<Sharon>-F_38", 
  "<Penny>-F_58",
  "<Justin>-M_41", 
  "<Patricia>-F_84", 
)

rgb <- c("red", "green", "blue")


Introduction

This tutorial covers Chapter 15: Regular expressions from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. With the help of the stringr package, we use regular expressions, a concise and powerful language for describing patterns within strings.

If you want to learn more, a good place to start is vignette("regular-expressions", package = "stringr"): it documents the full set of syntax supported by the stringr package. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.

Pattern basics

The term "regular expression" is a bit of a mouthful, so most people abbreviate it to "regex" or "regexp."

Exercise 1

Run library(tidyverse) in the Console. Copy/paste the resulting message.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

There are currently nine core packages in the Tidyverse, although that number may grow over time. stringr is the main package with which we use regular expressions.

Exercise 2

Load the babynames package with library() below. Don't forget to hit "Run Code."


library(...)
library(babynames)

There is no return value.

Exercise 3

In the Console, run library(babynames) and then look up the help page for the babynames tibble by running ?babynames. Copy/paste the Format information below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

The definition of the prop variable is subtle.

Exercise 4

Run glimpse() on the babynames tibble.


glimpse(...)
glimpse(babynames)

There are almost 2 millions rows!

Exercise 5

Run fruit at the Console. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 15)

In addition to babynames, we will use three character vectors from the stringr package:

Exercise 6

Type str_view(fruit, "berry"). Hit "Run Code." (Unless we specifically indicate that you should use the Console, you should enter commands into the exercise code block and press "Run Code.")

The first argument to str_view() is the vector which you are searching through. The second argument is the regular expression which you are searching for.


str_view(fruit, "berry")
str_view(fruit, "berry")

str_view() will show only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.

Exercise 7

Letters and numbers match exactly and are called literal characters. Most punctuation characters, like ., +, *, [, ], and ?, have special meanings and are called metacharacters.

Run str_view() with c("a", "ab", "ae", "bd", "ea", "eab") as the first argument --- this is the vector which we are searching through --- and "a." as the second argument.

A . will match any character, so "a." will match any string that contains an “a” followed by another character.


str_view(c("a", "ab", "ae", "bd", "ea", "eab"), ...)
str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")

There are three matches. Notice how the <> pull out the actual match itself, leaving irrelevant letters, like the "e" in the last match, outside.

Exercise 8

Run str_view() on fruit as the vector and "a...e" as the pattern.


str_view(..., "a...e")
str_view(fruit, "a...e")

Try to interpret the pattern before we tell you the answer . . .

The matches are any fruit which includes an "a", followed by any three characters, followed by an "e". Look at the matches to confirm that they all follow this rule.

Exercise 9

Quantifiers control how many times a pattern can match:

Run str_view() on c("a", "ab", "abb") as the vector and "ab?" as the pattern.


str_view(c("a", "ab", "abb"), ...)
str_view(c("a", "ab", "abb"), "ab?")

Note how "a" matches "ab?" because the "?" makes the "b" optional.

Exercise 10

Run str_view() on c("a", "ab", "abb") as the vector and "ab+" as the pattern.


str_view(..., "ab+")
str_view(c("a", "ab", "abb"), "ab+")

"ab+" matches an "a", followed by at least one "b".

Exercise 11

Run str_view() on c("a", "ab", "abb") as the vector and "ab*" as the pattern.


str_view(c("a", "ab", "abb"), ...)
str_view(c("a", "ab", "abb"), "ab*")

ab* matches an "a", followed by any number of "b"s, including zero "b"s.

Exercise 12

Character classes are defined by [] and let you match a set of characters, e.g., [abcd] matches “a”, “b”, “c”, or “d”.

Run str_view() on words as the vector and "[aeiou]x[aeiou]" as the pattern.


str_view(..., "[aeiou]x[aeiou]")
str_view(words, "[aeiou]x[aeiou]")

Can you explain what is going on?

We are matching all the words which feature the patter of any vowel, followed by an "x", followed by any vowel.

Exercise 13

You can also invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”.

Run str_view() on words as the vector and "[^aeiou]y[^aeiou]" as the pattern.


str_view(words, "...")
str_view(words, "[^aeiou]y[^aeiou]")

The "[^aeiou]y[^aeiou]" pattern finds every instance of any non-vowel, followed by "y", followed by any non-vowel.

Exercise 14

You can use alternation, |, to pick between one or more alternative patterns.

Run str_view() on fruit as the vector and "apple|melon|nut" as the pattern.


str_view(..., "apple|melon|nut")
str_view(fruit, "apple|melon|nut")

The "apple|melon|nut" pattern matches any fruit which contains one of the three options.

Exercise 15

Run str_view() on fruit as the vector and "aa|ee|ii|oo|uu" as the pattern. his should find all the fruits with at least one repeated vowel.


str_view(fruit, ...)
str_view(fruit, "aa|ee|ii|oo|uu")

Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature.

Key functions

Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.

Exercise 1

str_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise. Run str_detect() with the string argument equal to c("a", "b", "c") and the pattern argument equal to "[aeiou]".


str_detect(string = ..., 
           pattern = ...)
str_detect(string = c("a", "b", "c"), 
           pattern = "[aeiou]")

Most of the time, we admit the argument names (string and pattern) and simply rely on "positional mapping," meaning that string is always the first argument (and pattern the second) because that is how the function is defined.

Exercise 2

Since str_detect() returns a logical vector of the same length as the initial vector, it pairs well with filter(). Pipe babynames to filter(), using str_detect(name, "x") as the argument to filter().


babynames |> 
  filter(...(name, ...))
babynames |> 
  filter(str_detect(name, "x"))

Note that name is one of the variables in babynames. This pipe reduces the almost 2 million rows in babynames to just the 16,000 or so rows in which name contains the letter x.

Exercise 3

Continue the pipe with count(), using the argument name to indicate that we want to use the name variable.


... |> 
  count(...)
babynames |> 
  filter(str_detect(name, "x")) |>
  count(name)

Notice how although you have good data, it's all arranged in a random order, so you won't be able to tell which name is the most popular.

Exercise 4

Add wt = n after you stated your string in count().


... |> 
  count(name, ... = n)
babynames |> 
  filter(str_detect(name, "x")) |>
  count(name, wt = n)

We need wt = n because we want to account for the role of n in indicating how many times, in a single year, a given names was used.

Exercise 5

Remember how your data wasn't arranged in order? Let's fix that:

Add sort = TRUE at the end of your count() function.


... |> 
  count(name, wt = n, sort = ...)
babynames |> 
  filter(str_detect(name, "x")) |>
  count(name, wt = n, sort = TRUE)

We want to determine the most popular names with an "x" in them, which is why we have sort = TRUE.

Exercise 6

We can also use str_detect() with summarize() by pairing it with sum() or mean(): sum(str_detect([your-data-string], pattern)) tells you the number of observations that match and mean(str_detect([your-data-string], pattern)) tells you the proportion that match.

Pipe babynames to summarize() with the argument prop_x = mean(str_detect(name, "x")).


babynames |> 
  summarize(prop_x = ...(str_detect(name, ...)))
babynames |> 
  summarize(prop_x = mean(str_detect(name, "x")))

The result indicates that about 0.8% of the names in babynames include the letter "x."

Exercise 7

We are interested in how this percentage has changed over time, so modify the code by adding .by = year to the class to summarize().


babynames |> 
  summarize(... = mean(str_detect(name, "x")),
            .by = ...)
babynames |> 
  summarize(prop_x = mean(str_detect(name, "x")),
            .by = year)

This gives us the proportion of names that contain an "x." if you wanted the proportion of babies with a name containing an "x," you would need to perform a weighted mean.

Exercise 8

Continue the pipe to a call to ggplot(), with aes(x = year, y = prop_x). Add geom_line(). Don't forget that commands after ggplot() are separated by +, not |>.


... |> 
  ggplot(aes(x = ..., ... = prop_x)) + 
  geom_...()
babynames |> 
  summarize(prop_x = mean(str_detect(name, "x")),
            .by = year) |>
  ggplot(aes(x = year, y = prop_x)) +
  geom_line()

There are two functions that are closely related to str_detect():

Exercise 9

The next step up in complexity from str_detect() is str_count(): rather than a true or false, it tells you how many matches there are in each string. Run str_count() with two arguments: the vector c("apple", "banana", "pear") and the letter "p".


str_count(..., "p")
str_count(c("apple", "banana", "pear"), "p")

There are two "p"'s in "apple" but one in "pear."

Exercise 10

Note that each match starts at the end of the previous match, i.e. regex matches never overlap.

Run str_count() on "abababa" and "aba".


str_count("abababa", ...)
str_count("abababa", "aba")

For example, in "abababa", how many times will the pattern "aba" match? Regular expressions say two, not three.

Exercise 11

To better see this, str_view() on "abababa" and "aba".


str_view(..., "aba")
str_view("abababa", "aba")

In other words, the "second" "aba" string, which relies on the second "a" in the first "aba" string, does not count because regular expressions, by default, are exclusive.

Exercise 12

Pipe babynames to count(name).


babynames |> 
  ...(name)
babynames |> 
  count(name)

The reduces the almost 2 million entries to just the 100,000 or so unique names. Note how often the letter "a" appears in the first ten names.

Exercise 13

Continue the pipe with mutate(), creating a new variable, vowels, which is equal to str_count(name, "[aeiou]").


babynames |> 
  count(name) |> 
  mutate(
    vowels = ...(..., "[aeiou]")
  )
babynames |> 
  count(name) |> 
  mutate(
    vowels = str_count(name, "[aeiou]"))

If you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. Ignore that error for now.

Exercise 14

Add another variable creation argument to mutate(). Create consonants as the result of str_count(name, "[^aeiou]").


babynames |> 
  count(name) |> 
  mutate(
    vowels = str_count(name, "[aeiou]"),
    consonants = ...(name, ...)
  )
babynames |> 
  count(name) |> 
  mutate(
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )

This suffers from the same problem as vowels. The capital letter "A" is not in the set aeiou, so it is counted, incorrectly, as a consonant. Among, other approaches, we could fix this by:

Exercise 15

Change mutate() so that the first step is to change-in-place the variable name to be all in lower case. We do that with name = str_to_lower(name).


babynames |> 
  count(name) |> 
  mutate(
    ... = ...(name),
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )
babynames |> 
  count(name) |> 
  mutate(
    name = str_to_lower(name),
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )

This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.

Exercise 16

As well as detecting and counting matches, we can also modify them with str_replace() and str_replace_all(). Run str_replace_all() on with string equal to c("apple", "pear", "banana"), pattern equal to "[aeiou]", and replacement equal to "-".


str_replace_all(string = ..., 
                ... = "[aeiou]", 
                replacement = ...)
str_replace_all(string = c("apple", "pear", "banana"), 
                pattern = "[aeiou]", 
                replacement = "-")

We usually omit the argument names, so this code would normally be: str_replace_all(c("apple", "pear", "banana"), "[aeiou]", "-").

Exercise 17

str_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, ""). Run str_remove_all() on c("apple", "pear", "banana") and "[aeiou]".


str_remove_all(c("apple", "pear", "banana"), ...)
str_remove_all(c("apple", "pear", "banana"), "[aeiou]")

These commands with the _all suffix just act on the first match in each element of the vector. Try str_remove(c("apple", "pear", "banana"), "[aeiou]") for an example.

These functions are naturally paired with mutate() when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.

Exercise 18

The last function we’ll discuss in this section uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_position() and separate_wider_delim() functions that you learned about previously. These functions live in the tidyr package because they operate on (columns of) data frames, rather than individual vectors.

Run people to examine the tibble which we will use.


people
people

We have the name, gender, and age of a bunch of people in a rather weird format. We wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!

Exercise 19

To extract this data using separate_wider_regex() we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name. Consider:

people |> 
  separate_wider_regex(
    str,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".", "_", 
      age = "[0-9]+"
    )
  )
people |> 
  separate_wider_regex(
    str,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".", "_", 
      age = "[0-9]+"
    )
  )

If the match fails, you can use too_few = "debug" to figure out what went wrong, just like separate_wider_delim() and separate_wider_position().

Pattern details

Now that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, it’s time to dig into more of the details. First, we’ll start with escaping, which allows you to match metacharacters that would otherwise be treated specially. Next, you’ll learn about anchors which allow you to match the start or end of the string. Then, you’ll learn more about character classes and their shortcuts which allow you to match any character from a set. Next, you’ll learn the final details of quantifiers which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of operator precedence and parentheses. And we’ll finish off with some details of grouping components of the pattern.

Exercise 1

In order to match a literal ., you need an escape which tells the regular expression to match metacharacters literally. Like strings, regexps use the backslash for escaping. So, to match a ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.", as the following example shows. Run this code:

# To create the regular expression \., we need to use \\.
dot <- "\\."

# But the expression itself only contains one \
str_view(dot)

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
dot <- "\\."

str_view(dot)

str_view(c("abc", "a.c", "bef"), "a\\.c")

In this section, we’ll usually write regular expression without quotes, like \.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like "\\.".

Exercise 2

If \ is used as an escape character in regular expressions, how do you match a literal \? Well, you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \.

x <- "a\\b"
str_view(x)
str_view(x, "\\\\")
x <- "a\\b"
str_view(x)
str_view(x, "\\\\")

In other words, to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

Exercise 3

Probably better is to use the raw strings you learned about previously. That lets you avoid one layer of escaping. Run this code.

x <- "a\\b"
str_view(x, r"{\\}")
x <- "a\\b"
str_view(x, r"{\\}")

str_view() highlights the single backslash which is part of the x variable. But, to match that single backslash, we need a raw string pattern with two blackslashes.

Exercise 4

If you’re trying to match a literal ., $, |, *, +, ?, {, }, (, ), there’s an alternative to using a backslash escape: you can use a character class: [.], [$], [|], ... all match the literal values. Run this code.

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")

Regular expressions require care and attention. That last example, which uses ".[*]c", means any character (the .), followed by a * (captured in as a character class), followed by a "c". Only "a*c" matches this pattern.

Exercise 5

By default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^ to match the start or $ to match the end. Run str_view() with fruit as the first argument and "^a" as the second.


str_view(fruit, ...)
str_view(fruit, "^a")

Although there are many fruits that include the letter "a," only three begin with the letter "a."

Exercise 6

Run str_view() with fruit as the first argument and "a$" as the second.


str_view(fruit, ...)
str_view(fruit, "a$")

Note how we use "^a" with the ^ at the front of the pattern, indicating that "a" belongs at the start of the string, and "a$, with the $ at the end of the pattern, indicating that "a" belongs at the end of the string.

Exercise 7

To force a regular expression to match only the full string, anchor it with both ^ and $. Run str_view() twice, both times with fruit as the first argument. In one, "apple" is the pattern. In another, "^apple$ is the pattern.


str_view(..., "apple")
str_view(fruit, ...)
str_view(fruit, "apple")
str_view(fruit, "^apple$")

Are you getting overwhelmed yet? No worries. The rise of ChatGPT and other AI tools makes the creation of regular expressions to do exactly what you want much easier.

Exercise 8

You can also match the boundary between words (i.e. the start or end of a word) with \b. This can be particularly useful when using RStudio’s find and replace tool. Add str_view(x, "sum") to this exercise block

x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(..., "sum")
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")

Every element of our vector matches the pattern.

Exercise 9

Add str_view(x, "\\bsum\\b") to this exercise block

x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, ...)
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "\\bsum\\b")

This is not easy. \b means a word boundary, which includes the implicit start of the words "summary()," "summarize()," and "sum()." The first two, however, do not match the \b end of the pattern. Yet "sum()" does match because (, not being a letter, counts as a boundary. Finally, we need to escape the \ for both usages of \b, leading to \\b at both the front and the back.

Exercise 10

When used alone, anchors will produce a zero-width match. This means that they mark the positions of the part of the string that they mark (for example, "$" would be at the end of a string to signify that it marks the end of any given string).

Run str_view() on the string "abc" and the patterns c("$", "^", "\\b").


str_view("...", c("$", "^", "\\b"))
str_view("abc", c("$", "^", "\\b"))

Note how we have a single string but three different patterns, each of which does match the string. The display, with its zero-width brackets --- <> --- indicates that the matches do not involve the contents of the string.

Exercise 11

The previous example helps you understand what happens when you replace a standalone anchor. Run str_replace_all() on the string "abc" and the patterns c("$", "^", "\\b") with using "--" as the replacement.


str_replace_all("abc", ..., "--")
str_replace_all("abc", c("$", "^", "\\b"), "--")

Even though the input is a single string, the three patterns generate a vector of three elements as the output.

Pattern control

It’s possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you to control the so called regex flags and match various types of fixed strings, as described below.

Exercise 1

Notice the following tibble in the code box.

Below it, use str_view() on bananas to find the pattern "bananas".

bananas <- c("banana", "Banana", "BANANA")
bananas <- c("banana", "Banana", "BANANA")
str_...(bananas, "...")
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")

Notice how it only directed to one of the entries even though all of the entries in the tibble were the same word. This is because by default, it wants to match the case.

Exercise 2

What if we didn't want our search to be case-sensistive? Well, enter the world of regex() flags, which alow us to filter our searches.

Cope the previous code, and surround "bananas" with regex(). Then, add the argument ignore_case = TRUE.


bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, regex("banana", ignore_case = ...))
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, regex("banana", ignore_case = TRUE))

Exercise 3

If you’re doing a lot of work with multiline strings (i.e. strings that contain \n), dotall may also be useful.

Note that anything after \n is in a new line, so x would look something like this:

Hello
World

First, run the following code in the box. Note that .* means that it should print everything.

x <- "Hello\nWorld"

str_match(x, ".*")
x <- "Hello\nWorld"

str_match(x, ".*")

Notice how instead of printing everything, it only prints everything on the first line!

Exercise 4

To make it read all other lines, we need to utilize the argument dotall.

Notice how the following code has the dotall argument inside regex(). Run it and notice how it scans both lines now.

str_match(x, regex(".*", dotall = TRUE))
str_match(x, regex(".*", dotall = TRUE))

Exercise 5

Similarly, we can utilize the multiline argument to make ^ and $ match the start and end of each line rather than the start and end of the complete string:

y <- "Line 1\nLine 2\nLine 3"

str_view(y, "^Line")

str_view(y, regex("^Line", multiline = TRUE))
y <- "Line 1\nLine 2\nLine 3"

str_view(y, "^Line")

str_view(y, regex("^Line", multiline = TRUE))

Notice how the outcome of the first call only selects Line in Line 1, whereas the second call addresses it in all 3 lines.

Exercise 6

On the other hand, if you don't want to follow the regular expression rules, you can use fixed() instead of regrex().

Run the following code:

str_view(c("", "a", "."), stringr::fixed("."))
str_view(c("", "a", "."), stringr::fixed("."))

Notice how it filtered out . without adding any special backslashes or any other trickery involving regular expressions.

Note that in this case, we used stringr:: to tell R to get the package from stringr. We did this is because we already have a package loaded in that also has a fixed() function that works very differently from this one. To make sure that we pull it from the correct package, we have to specify the package in front of the function.

Exercise 7

fixed() also gives you the ability to ignore case, just like regrex().

Run the following code to see this in action.

str_view("x X", "X")

str_view("x X", stringr::fixed("X", ignore_case = TRUE))
str_view("x X", "X")

str_view("x X", stringr::fixed("X", ignore_case = TRUE))

Notice how the first entry returns just one X being highlighted, while the second entry shows both Xs being highlighted, indicating that it isn't case-sensitive.

Practice

What if we wanted to find all sentences that mention a color?

The basic idea is simple: we just combine alternation with word boundaries, as so:

str_view(sentences, "\\b(red|green|blue)\\b")

Exercise 1

As the number of colors grows, it would quickly get tedious to construct this pattern by hand.

Let's store the colors in a vector. We’d just need to create the pattern from the vector using str_c() and str_flatten().

In the background, we have stored the following variable:

rgb <- c("red", "green", "blue")

Run the following code, which stores the colors in a vector:

str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
str_c("\\b(", str_flatten(rgb, "|"), ")\\b")

In this code, str_flatten turns our rgb list into red|green|blue.

You might be thinking why we had to do it this way, and this will be come apparent in the following steps.

Exercise 2

We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots.

Run str_view() on colors().


str_view(...())
str_view(colors())

Exercise 3

Notice how the list had multiple versions of the same color. Let's remove those numbered variants with the following code:

cols <- colors()
cols <- cols[!str_detect(cols, "\\d")]
str_view(cols)
cols <- colors()
cols <- cols[!str_detect(cols, "\\d")]
str_view(cols)

Exercise 4

Now let's turn this into one giant pattern. We mapped our previous pattern to the variable pattern for you, so you don't have to type it out.

Below the code, run str_view() on sentences, setting the pattern as the variable pattern.

pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")

..._view(..., pattern)
pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")

str_view(sentences, pattern)
pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")

str_view(sentences, pattern)

Now you see why we had to set pattern the way we did? It was because we had data of colors, but needed to format it so we could search with it.

Regular expressions in other places

Just like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.

Exercise 1

As part of the Tidyverse package, matches() is a very useful function that looks for variables with the matching pattern.

In this example, let's look for all columns in penguins (from the PalmerPenguins package) that have the pattern "bill".

Below the code, pipe penguins to select(), and within select(), insert matches("bill").

library(palmerpenguins)
library(palmerpenguins)

penguins |>
  ...(matches("..."))
library(palmerpenguins)

penguins |>
  select(matches("bill"))

Notice how it only returned the columns that have the pattern "bill" in their names. This is very useful when handling alrge datasets with many columns!

Exercise 2

apropos() is also a very useful function that comes from Base R itself (no packages). It searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function.

In the Console, type apropos("replace"). This searches for all functions that have "replace" in their name. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Do you see how useful this is?!! If you forget a function and need it on the fly, all you have to recall is one word from it, and this function will find it for you!

Exercise 3

The pattern argument in list.files() is surely one that you've seen so far, and also comes from Base R. It allows you to look for files that have the given pattern in their name.

Just for funsies, type list.files(pattern = regular) into the Console. This will most likely return character(0), which means that it didn't find anything. However, still CP/CR for fun.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Summary

This tutorial covered Chapter 15: Regular expressions from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. With the help of the stringr package, we used regular expressions, a concise and powerful language for describing patterns within strings.

If you want to learn more, a good place to start is vignette("regular-expressions", package = "stringr"): it documents the full set of syntax supported by the stringr package. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.