knitr::opts_chunk$set(echo = TRUE, fig.height = 1, fig.width = 5)
library("learnr")
library("tidyverse")
theme_set(theme_classic())

tutorial_options(exercise.cap = "Exercise")

Text manipulation

In this tutorial, you will learn how to manipulate character (text) vectors with the stringr package.

Consider this vector of diatom species names (which could be a column in a data.frame or tibble)

diatoms <- c("Navicula_elkab",
             "Nitzschia_palea",
             "Aulacoseira_granulata") # a vector

Each element in the character vector is known as a string. We might want to detect which strings have a particular pattern, or replace, remove or extract part of the text. We can do this with the stringr package which is loaded when tidyverse is loaded.

library("tidyverse")# load stringr, ggplot2, dplyr etc

This tutorial starts with detecting or replacing fixed patterns and then shows how you can use regular expressions to extract varying patterns.

Detecting a pattern

Using str_detect

We might want to detect which of the species are in the genus Navicula. We can do this with str_detect (the base R equivalent is grepl).

str_detect(string = diatoms, pattern = "Navicula")

This return a logical vector: TRUE where the character vector includes the pattern "Navicula", FALSE otherwise.

Using str_detect with filter

If the vector diatoms was a column in a tibble (or data.frame), we can use this test in a filter to select rows.

diatom_df <- tibble(species = diatoms, 
                    count = c(27, 3, 46))
diatom_df

diatom_df %>% 
  filter(str_detect(string = diatoms, pattern = "Navicula"))

Regular expressions

The problem

The code

str_detect(string = diatoms, pattern = "Navicula")

works just fine when we know exactly what we are searching for. Sometimes what you want to detect fixed pattern but something more general. Perhaps we want to detect everything that starts with an "N", but ignore any other "N". For this type of problem, regular expressions are a very powerful tool.

Regular expressions (often shortened to regex) are sequences of special metacharacters and literal characters. They can be though of as an extension of wildcard for searching.

Literal characters

Characters such as a, Z, 3, _ are literal characters. We can use them as we already have.

Metacharacters

Metacharacters are like wildcards in that they can match different literal characters.

Because the \ is a special character, it needs to be escaped with another backslash. So to match white space character, use \\s.

Sometimes it is useful to make our own set of characters. We can do this with square brackets.

The vertical line | matches the group of characters either before or after it. So "Navicula|Nitzschia" will match either genus.

Your turn

Detect which diatom name includes a number

diatoms <- c("Navicula sp2",
             "Nitzschia palea",
             "Aulacoseira granulata") 
str_detect()
diatoms <- c("Navicula sp2",
             "Nitzschia palea",
             "Aulacoseira granulata") 

str_detect(string = diatoms, pattern = "\\d")

Repeats

If we want to match a repeating series of characters

We can control how many times something gets matched by following it with a quantifier.

So to match either "palaeoecology" or "paleoecology", we can follow the "a" with a "?".

#match both palaeoecology and paleoecology
str_detect(c("palaeoecology", "paleoecology"), pattern = "pala?eo")

To match a four digit sequence we can use "\\d{4}".

#detect year from code
str_detect(c("x2020", "20.20"), pattern = "\\d{4}")

Your turn

Detect which diatoms have a word with at least 10 characters

diatoms <- c("Navicula elkab",
             "Nitzschia palea",
             "Aulacoseira granulata") 
str_detect()
diatoms <- c("Navicula elkab",
             "Nitzschia palea",
             "Aulacoseira granulata") 

str_detect(string = diatoms, pattern = "\\w{10,}")

Anchors

You can use anchors so that matches are only made at the start or end of a string.

Your turn

Detect which diatoms end in an "a".

diatoms <- c("Navicula_elkab",
             "Nitzschia_palea",
             "Aulacoseira_granulata") 
str_detect()
diatoms <- c("Navicula_elkab",
             "Nitzschia_palea",
             "Aulacoseira_granulata") 

str_detect(string = diatoms, pattern = "a$")

Escaping metacharacters

If you want to detect a literal "." then there is a problem as "." is a metacharacter. We need to escape metacharacters {}[]()^$.|*+? and \ with two backslashes to treat them as literals.

has_dot <- c("Navicula.elkab", "Navicula radiosa")

str_detect(string = has_dot, pattern = "\\.")

If we don't want to use any metacharacters as metacharacters, it is easier to use a helper function.

has_dot <- c("Navicula.elkab", "Navicula radiosa")

str_detect(string = has_dot, pattern = coll("."))

Your turn

What happens if you forget to escape a metacharacter

has_dot <- c("Navicula.elkab", "Navicula radiosa")
str_detect(string = has_dot, pattern = ".")

Why?

Help from regexplain

Writing regular expressions is tricky. Fortunately there are Rstudio addins from the regexplain package that can help you write them. You will find these in the addins menu.

Note that in the regexplain addins, the backslashes in the regular expressions are not doubled.

Replacing text

Using str_replace

We can replace characters in some text using str_replace. So to replace the underscore in diatoms with a space, we could use

str_replace(diatoms, pattern = "_", replacement = " ")

This will replace the first underscore in each element. If there were several underscores and we want to replace them all, we can use str_replace_all. If we want to remove some character, we can either use str_replace and set replacement = "", or use str_remove.

Using str_replace with mutate

We can use str_replace on a column in a tibble with a mutate

diatom_df %>% 
  mutate(species = str_replace(string = species,
                               pattern = "_", 
                               replacement  = " "))

Extracting characters

Sometimes you want to extract some characters from a string.

For example, we might want to extract the year embedded in a file name

filename <- "c:/pond_2020.xls"

There are two ways to do this.

The first is to use capture groups in str_replace. Capture groups are groups of characters in the pattern surrounded by brackets.

str_replace(filename, pattern = ".*_(\\d{4})\\.xls", replacement = "\\1")

The pattern ".*_(\\d{4})\\.xls$" will match strings that have any number of any characters followed by an underscore, followed by four numbers followed by ".xls" which has to be the end of the string because of the $. The \\1 in the replacement will return the first (and only) capture group.

Using str_extract

The second solution is the use str_extract.

str_extract(filename, pattern = "\\d{4}")

It is usually possible to use either of these methods. Here, str_extract is easier. If there were multiple numeric sequences and we wanted a specific one, then capture groups might be easier. One important difference is that if the match fails, str_extract will return NA, whereas str_replace will return the original string.

Your turn

Extract the habitat type from these filenames.

filenames <- c("c:/pond_2020.xls", "d:/data/marsh_2018.xls")
str_replace(filenames, pattern = ".*/(\\w*)_\\d{4}\\.xls$", replacement = "\\1")
#or extract word followed by an "_", then remove "_"
str_extract(filenames, pattern = "\\w*_") %>% 
  str_remove(pattern = "_")

Some other useful stringr function

Changing case

You can change the case of a character vector with str_to_upper and str_to_lower. Changing the case can often solve many formatting errors.

mess <- c("ponds", "Ponds", "PONDS")
str_to_lower(mess)

Your turn

Make the diatoms vector upper case.

diatoms <- c("Navicula_elkab",
             "Nitzschia_palea",
             "Aulacoseira_granulata") 
diatoms <- c("Navicula_elkab",
             "Nitzschia_palea",
             "Aulacoseira_granulata") 
str_to_upper(diatoms)

Trimming white space

One very common problem with raw data is extra spaces.

"Navicula" is not equal to "Navicula " or " Navicula", but it may be impossible to see the difference. str_trim will remove whitespace from the start or end of a string.

str_trim(c("Navicula ", " Navicula"))

Wrapping long strings

str_wrap puts a line return (\n) into long lines of text. This is useful in when making a figure if captions or labels are too long.

str_wrap("The quick brown fox jumps over the lazy dog.", width = 30)


Bio302-UiB/data-handling documentation built on Dec. 6, 2020, 12:15 p.m.