library(stringr)
knitr::opts_chunk$set(
  comment = "#>", 
  collapse = TRUE
)

There are four main families of functions in stringr:

  1. Character manipulation: these functions allow you to manipulate the individual characters inside the strings inside character vectors.

  2. Whitespace tools to add, remove, and manipulation whitespace.

  3. Locale sensitive operation whose operation will vary for locale to locale

  4. Pattern matching functions. These recognise four engines of pattern description. The most common is regular expresssions, but there are a three other tools.

Getting and setting individual characters

You can get the length of the string with str_length():

str_length("abc")

This is now equivalent to the base R function nchar(). Previously it was needed to work around issues with nchar() such as the fact that it returned 2 for nchar(NA). This has been fixed as of R 3.3.0, so it is no longer so important.

You can access individual character using sub_str(). It takes three arguments: a character vector, a starting position and an end position. Either position can either be a positive integer, which counts from the length, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.

x <- c("abcdef", "ghifjk")

# The 3rd letter
str_sub(x, 3, 3)

# The 2nd to 2nd-to-last character
str_sub(x, 2, -2)

You can also use str_sub() to modify strings:

str_sub(x, 3, 3) <- "X"
x

To duplicate individual strings, you can use str_dup():

str_dup(x, c(2, 3))

Whitespace

Three functions add, remove, or modify whitespace:

  1. str_pad() pads a string to a fixed length by adding extra whitespace on the left, right, or both sides.

    r x <- c("abc", "defghi") str_pad(x, 10) str_pad(x, 10, "both")

    (You can pad with other characters by using the pad argument.)

    str_pad() will never make a string shorter:

    r str_pad(x, 4)

    So if you want to ensure that all strings are the same length (often useful for print methods), combine str_pad() and str_trunc():

    ```r x <- c("Short", "This is a long string")

    x %>% str_trunc(10) %>% str_pad(10, "right") ```

  2. The opposite of str_pad() is str_trim(), which removes leading and trailing whitespace:

    r x <- c(" a ", "b ", " c") str_trim(x) str_trim(x, "left")

  3. You can use str_wrap() to modify existing whitespace in order to wrap a paragraph of text so that the length of each line as a similar as possible.

    r jabberwocky <- str_c( "`Twas brillig, and the slithy toves ", "did gyre and gimble in the wabe: ", "All mimsy were the borogoves, ", "and the mome raths outgrabe. " ) cat(str_wrap(jabberwocky, width = 40))

Locale sensitive

A handful of stringr are functions are locale-sensitive: they will perform differently in different regions of the world. These functions case transformation functions:

x <- "I like horses."
str_to_upper(x)
str_to_title(x)

str_to_lower(x)
# Turkish has two sorts of i: with and without the dot
str_to_lower(x, "tr")

And string ordering and sorting:

x <- c("y", "i", "k")
str_order(x)

str_sort(x)
# In Lithuanian, y comes between i and k
str_sort(x, locale = "lt")

The locale always defaults to English to ensure that the default behaviour is identically across systems. Locales always include a two letter ISO-639-1 language code (like "en" for English or "zh" for Chinese), and optionally a ISO-3166 country code (like "en_UK" vs "en_US"). You can see a complete list of available locales by running stringi::stri_locale_list().

Pattern matching

The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.

Tasks

Each pattern matching function has the same first two arguments, a character vector of strings to process and a single pattern to match. stringr provides pattern matching functions to detect, locate, extract, match, replace, and split strings. I'll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:

strings <- c(
  "apple", 
  "219 733 8965", 
  "329-293-8753", 
  "Work: 579-499-7527; Home: 543.355.3679"
)
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

Engines

There are four main engines that stringr can use to describe patterns:

Fixed matches

fixed(x) only matches the exact sequence of bytes specified by x. This is a very limited "pattern", but the restriction can make matching much faster. Beware using fixed() with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define "á": either as a single character or as an "a" plus an accent:

a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
a1 == a2

They render identically, but because they're defined differently, fixed() doesn't find a match. Instead, you can use coll(), defined next, to respect human character comparison rules:

str_detect(a1, fixed(a2))
str_detect(a1, coll(a2))

Collation search

coll(x) looks for a match to x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules diffe around the world, so you'll also need to supply a locale parameter.

i <- c("I", "İ", "i", "ı")
i

str_subset(i, coll("i", ignore_case = TRUE))
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))

The downside of coll() is speed; because the rules for recognising which characters are the same are complicated, coll() is relatively slow compared to regex() and fixed(). Note that will both fixed() and regex() have ignore_case arguments, they perform a much simpler comparison than coll().

Boundary

boundary() matches boundaries between characters, lines, sentences or words. It's most useful with str_split(), but can used with all pattern matching functions

x <- "This is a sentence."
str_split(x, boundary("word"))
str_count(x, boundary("word"))
str_extract_all(x, boundary("word"))

By convention, "" is treated as boundary("character"):

str_split(x, "")
str_count(x, "")


UBC-MDS/Karl documentation built on May 22, 2019, 1:53 p.m.