knitr::opts_chunk$set(echo = TRUE, fig.height = 1, fig.width = 5) library("learnr") library("tidyverse") theme_set(theme_classic()) tutorial_options(exercise.cap = "Exercise")
In this tutorial, you will learn how to manipulate character (text) vectors with the stringr
package.
Consider this vector of diatom species names (which could be a column in a data.frame
or tibble
)
diatoms <- c("Navicula_elkab", "Nitzschia_palea", "Aulacoseira_granulata") # a vector
Each element in the character vector is known as a string.
We might want to detect which strings have a particular pattern, or replace, remove or extract part of the text.
We can do this with the stringr
package which is loaded when tidyverse
is loaded.
library("tidyverse")# load stringr, ggplot2, dplyr etc
This tutorial starts with detecting or replacing fixed patterns and then shows how you can use regular expressions to extract varying patterns.
str_detect
We might want to detect which of the species are in the genus Navicula. We can do this with str_detect
(the base R equivalent is grepl
).
str_detect(string = diatoms, pattern = "Navicula")
This return a logical vector: TRUE
where the character vector includes the pattern "Navicula", FALSE
otherwise.
str_detect
with filter
If the vector diatoms
was a column in a tibble
(or data.frame
), we can use this test in a filter
to select rows.
diatom_df <- tibble(species = diatoms, count = c(27, 3, 46)) diatom_df diatom_df %>% filter(str_detect(string = diatoms, pattern = "Navicula"))
The code
str_detect(string = diatoms, pattern = "Navicula")
works just fine when we know exactly what we are searching for. Sometimes what you want to detect fixed pattern but something more general. Perhaps we want to detect everything that starts with an "N", but ignore any other "N". For this type of problem, regular expressions are a very powerful tool.
Regular expressions (often shortened to regex) are sequences of special metacharacters and literal characters. They can be though of as an extension of wildcard for searching.
Characters such as a
, Z
, 3
, _
are literal characters.
We can use them as we already have.
Metacharacters are like wildcards in that they can match different literal characters.
.
Matches any character.\d
Matches any digit. \D
matches anything that is not a digit.\s
matches any whitespace. \S
matches anything that is not whitespace\w
matches any alphanumeric character\W
matches anything that is not alphanumericBecause the \
is a special character, it needs to be escaped with another backslash.
So to match white space character, use \\s
.
Sometimes it is useful to make our own set of characters. We can do this with square brackets.
[aeiou]
matches vowels [^aeiou]
matches anything but vowels[a-z]
matches lower case letters[a-zA-Z]
matches upper or lower case lettersThe vertical line |
matches the group of characters either before or after it.
So "Navicula|Nitzschia"
will match either genus.
Detect which diatom name includes a number
diatoms <- c("Navicula sp2", "Nitzschia palea", "Aulacoseira granulata") str_detect()
diatoms <- c("Navicula sp2", "Nitzschia palea", "Aulacoseira granulata") str_detect(string = diatoms, pattern = "\\d")
If we want to match a repeating series of characters
We can control how many times something gets matched by following it with a quantifier.
?
: Zero or more times +
: One or more times*
: Any number of times {2}
: Exactly twice {2,4}
: Between two and four times{,4}
: At most four times{2,}
: At least twiceSo to match either "palaeoecology" or "paleoecology", we can follow the "a" with a "?".
#match both palaeoecology and paleoecology str_detect(c("palaeoecology", "paleoecology"), pattern = "pala?eo")
To match a four digit sequence we can use "\\d{4}"
.
#detect year from code str_detect(c("x2020", "20.20"), pattern = "\\d{4}")
Detect which diatoms have a word with at least 10 characters
diatoms <- c("Navicula elkab", "Nitzschia palea", "Aulacoseira granulata") str_detect()
diatoms <- c("Navicula elkab", "Nitzschia palea", "Aulacoseira granulata") str_detect(string = diatoms, pattern = "\\w{10,}")
You can use anchors so that matches are only made at the start or end of a string.
^A
Matches "A" but only at the start of a stringA$
Matches "A" but only at the end of a stringDetect which diatoms end in an "a".
diatoms <- c("Navicula_elkab", "Nitzschia_palea", "Aulacoseira_granulata") str_detect()
diatoms <- c("Navicula_elkab", "Nitzschia_palea", "Aulacoseira_granulata") str_detect(string = diatoms, pattern = "a$")
If you want to detect a literal "." then there is a problem as "." is a metacharacter.
We need to escape metacharacters {}[]()^$.|*+? and \
with two backslashes to treat them as literals.
has_dot <- c("Navicula.elkab", "Navicula radiosa") str_detect(string = has_dot, pattern = "\\.")
If we don't want to use any metacharacters as metacharacters, it is easier to use a helper function.
has_dot <- c("Navicula.elkab", "Navicula radiosa") str_detect(string = has_dot, pattern = coll("."))
What happens if you forget to escape a metacharacter
has_dot <- c("Navicula.elkab", "Navicula radiosa") str_detect(string = has_dot, pattern = ".")
Why?
regexplain
Writing regular expressions is tricky.
Fortunately there are Rstudio addins from the regexplain
package that can help you write them.
You will find these in the addins menu.
Note that in the regexplain
addins, the backslashes in the regular expressions are not doubled.
str_replace
We can replace characters in some text using str_replace
. So to replace the underscore in diatoms with a space, we could use
str_replace(diatoms, pattern = "_", replacement = " ")
This will replace the first underscore in each element.
If there were several underscores and we want to replace them all, we can use str_replace_all
.
If we want to remove some character, we can either use str_replace
and set replacement = ""
, or use str_remove
.
str_replace
with mutate
We can use str_replace
on a column in a tibble with a mutate
diatom_df %>% mutate(species = str_replace(string = species, pattern = "_", replacement = " "))
Sometimes you want to extract some characters from a string.
For example, we might want to extract the year embedded in a file name
filename <- "c:/pond_2020.xls"
There are two ways to do this.
The first is to use capture groups in str_replace
.
Capture groups are groups of characters in the pattern surrounded by brackets.
str_replace(filename, pattern = ".*_(\\d{4})\\.xls", replacement = "\\1")
The pattern ".*_(\\d{4})\\.xls$"
will match strings that have any number of any characters followed by an underscore, followed by four numbers followed by ".xls" which has to be the end of the string because of the $
.
The \\1
in the replacement will return the first (and only) capture group.
str_extract
The second solution is the use str_extract
.
str_extract(filename, pattern = "\\d{4}")
It is usually possible to use either of these methods.
Here, str_extract
is easier.
If there were multiple numeric sequences and we wanted a specific one, then capture groups might be easier.
One important difference is that if the match fails, str_extract
will return NA
, whereas str_replace
will return the original string.
Extract the habitat type from these filenames.
filenames <- c("c:/pond_2020.xls", "d:/data/marsh_2018.xls")
str_replace(filenames, pattern = ".*/(\\w*)_\\d{4}\\.xls$", replacement = "\\1") #or extract word followed by an "_", then remove "_" str_extract(filenames, pattern = "\\w*_") %>% str_remove(pattern = "_")
stringr
functionYou can change the case of a character vector with str_to_upper
and str_to_lower
.
Changing the case can often solve many formatting errors.
mess <- c("ponds", "Ponds", "PONDS") str_to_lower(mess)
Make the diatoms vector upper case.
diatoms <- c("Navicula_elkab", "Nitzschia_palea", "Aulacoseira_granulata")
diatoms <- c("Navicula_elkab", "Nitzschia_palea", "Aulacoseira_granulata") str_to_upper(diatoms)
One very common problem with raw data is extra spaces.
"Navicula" is not equal to "Navicula " or " Navicula", but it may be impossible to see the difference.
str_trim
will remove whitespace from the start or end of a string.
str_trim(c("Navicula ", " Navicula"))
str_wrap
puts a line return (\n
) into long lines of text.
This is useful in when making a figure if captions or labels are too long.
str_wrap("The quick brown fox jumps over the lazy dog.", width = 30)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.