Strings

This section covers basic string manipulation in R. See Chapter 11 in R for Data Science for a more complete coverage. The first part of this chapter will probably look familiar, but the rest will focus on regular expressions. Regular expressions (sometimes referred to as regex or regexp) are a concise language for describing patterns in strings.

While base R has some string functions they tend to be inconsistent in the order of the parameters, which makes them hard to remember and you end up needing the help files. As with most of this course we will focus on the Tidyverse, and in particular the stringr package, which in not part of the core tidyverse package so you will have to load it explicitly with library(stringr)

library(tidyverse)
library(stringr)

Basics

As with most languages you create a string by putting the text in single or double quotes x <- "This is my string!", and you can create a vector of strings with c(). There are a handful of special characters but the most common are the newline character "\n", and the tab character "\t". You may also see strings like \u00AE or \u00b5 which is a way of including non-English characters or special symbols that work on all platforms. These are typically known as Unicode characters.

x <- "This my registered\u00AE drug"
x

One nice feature of the stringr package is all string functions begin with str_. This makes it easy in RStudio to find useful string processing functions.

```{block2, type='rmdtip'} See the tidyverse package glue which can glue strings to data in R. Small, fast, dependency free interpreted string literals.

### Helpful Basic String Functions

-    `str_c()`        joins multiple strings together into a single string (similar to `paste` or `paste0` from base R)
-    `str_to_upper()` converts the string to all uppercase characters
-    `str_to_lower()` converts the string to all lowercase characters
-    `str_to_title()` capitalizes the first character of each word in the string
-    `str_length()`   the length of a string
-    `str_wrap()`     wrap strings into nicely formatted paragraphs
-    `str_trim()`     trim white space from a string
-    `str_pad()`      pad a string
-    `str_order()`    order a character vector
-    `str_sort()`     sort a character vector
-    `str_sub()`      extract and replace substrings from a character vector

```{block type='rmdcaution'}
While `str_c` looks and acts very similar to the base R functions `paste` or `paste0`, they act very differently with regards to missing data.  `paste` and `paste0` replace missing values with the text "NA", while `str_c` propagates the missing value.
str_c("This", "is", "a", "string", NA, "with", "a", "missing", "value")
paste("This", "is", "a", "string", NA, "with", "a", "missing", "value")

Notice str_c and paste have both a sep and collapse argument. While these appear to be the same they are not. The sep argument is the string inserted between arguments to str_c, while collapse is the string used to separate any elements of the character vector into a character vector of length one.

str_c("This", "is", "not", "a", "character", "vector", sep = "_")
x <- c("This", "is", "a", "character", "vector")
x
str_c(x, sep = "_")
str_c(x, collapse = "_")

Regular Expressions

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings. Patterns can be a bit tricky to wrap your head around at first since the most common case is simply looking for a sequence of letters. Patterns are much more general. Think about how would you tell a computer to look for any phone number or any email address.

```{block type='rmdnote'} Just about every modern programming language has the ability to use regular expressions. In SAS regular expressions are implemented in the PRX series of functions such as: PRXMATCH, PRXSEARCH, PRXPARSE, PRXCHANGE.

Regexps are a very terse language that allow you to describe patterns in strings. They take a little while to get your head around, but once you understand them, you’ll find them extremely useful.  Any non-trivial regular expression looks likes a cat walked across your keyboard.

```{block2, type='rmdtip'}
To learn regular expressions, use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. 

Basic Matches

We’ll start with very simple regular expressions and then gradually get more and more complicated. Once you’ve mastered pattern matching, you’ll learn how to apply those ideas with various stringr functions. We'll use the built in fruit data set.

fruit

One of the most common uses of regular expressions is to find strings that contain an exact sequence of letters.

str_subset(fruit, "berry")

``{block2, type='rmdnote'}str_subset()` keeps strings matching a pattern. We will use these as examples of how to use patterns, but work with any function which takes a pattern.

This works great for quite a few cases, but what if you wanted to find just "grape" and not "grapefruit"?

```r
str_subset(fruit, "grape")

Special Characters

Once you get beyond basic substring matching you need some helpers. Below is a list of common helpers.

So to match "grape" but not "grapefruit" you could do this is a couple ways.

str_subset(fruit, "grape$")   # match all words that end in "grape"
str_subset(fruit, "^grape$")  # match all words the start AND end in "grape"

Find the fruits that have an "a" in them but not ones that have the only "a" as the starting character.

str_subset(fruit, ".a")

If "." matches any character, how do you match the character "."?

You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behavior. Like strings, regexps use the backslash, \, to escape special behavior. So to match an ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.".

If \ is used as an escape character in regular expressions, how do you match a literal \? Well you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" --- you need four backslashes to match one!

Character Classes and Alternatives

There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are many other useful tools:

``{block2, type='rmdimportant'} Remember, to create a regular expression containing\d, you need to escape the` for the string, so you type "\\d".

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single meta character in a regex. Many people find this more readable.

### Repetition

The next step up in power involves controlling how many times a pattern matches:

* `?` match zero or 1
* `*` match zero or more
* `+` match one or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

Find all the fruits with three or more "e"'s.
```r
str_subset(fruit, ".*e.*e.*e") 

# or more succinctly
str_subset(fruit, "(.*e){3,}")

Grouping

Parentheses are a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc.

For example, find all the fruits with a double letter

str_subset(fruit, "([:alpha:])\\1")

Find all fruits that have a repeated pair of letters.

str_subset(fruit, "(..)\\1")

Helpful String Functions

Don’t forget that you’re in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it’s often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

Detect Matches

What proportion of the fruits have double letters?

str_detect(fruit, "([:alpha:])\\1") %>% mean()

What is the distribution of vowels in fruits?

str_count(fruit, "[aeiou]") %>% hist()

``{block2, type='rmdtip'}str_detect()` has a natural fit with filter to subset data sets.

### Extract Matches

-    `str_subset()`       keep strings matching a pattern
-    `str_extract()`      extract first matching patterns from a string.
-    `str_extract_all()`  extract all matching patterns from a string.
-    `str_which()`        find the positions of strings matching a pattern (similar to base R `grep()`)

Which fruits have a color in their name?
```r
# create our regular expression from a character vector
colors <- c("red", "orange", "yellow", "green", "blue", "purple", "black")
color_match <- str_c(colors, collapse = "|")
color_match

str_subset(fruit, color_match)

Pull out the vowels in all the fruits

str_extract(fruit, "[aeiou]")
str_extract_all(fruit, "[aeiou]") %>% head()

Replacing Matches

``{block type='rmdnote'} These functions are similar to base R functionssub()andgsub()`

Replace vowels with a hyphen
```r
str_replace(fruit, "[aeiou]", "-") %>% head()
str_replace_all(fruit, "[aeiou]", "-") %>% head()

Also, str_replace_all() can perform multiple replacements by supplying a named vector.

Instead of replacing with a fixed string you can use backreferences to insert components of the match.

If the fruit name is comprised of 2 or more words swap the first and second word.

str_replace(fruit, "([^ ]+) ([^ ]+)", "\\2 \\1")

Other String Functions

Split the fruits up by word

str_split(fruit, " ")  %>% head(., 10)  # split by a space
str_split(fruit, boundary("word"))  %>% head(., 10) # split by word boundry
str_split(fruit, " ", n = 2, simplify = TRUE)   # only split into 2 peices

stringr is built on top of the stringi package. stringr is useful when you’re learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has 234 functions to stringr’s 46.

If you find yourself struggling to do something in stringr, it’s worth taking a look at stringi. The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way. The main difference is the prefix: str_ vs. stri_.

Exercises

Zip Codes

  1. Create a function that takes a vector of numeric zip codes and converts them to a three-digit zip code, padding with 0's as necessary.
    • Test your function with c(1, 11, 111) which should return c(001, 011, 111)
  2. For the above function how would you make this more generic. What happens if you get a 5-digit zip code? What if the user wants to return a 4-digit zip code? Extend your function to handle these cases.
    • Test your function with c(1, 11, 111, 1111, 11111). (Note: Think about how a 1 changes based on the assumption of a 5-digit zip code vs a 3-digit zip code)

``{block2, type='rmdtip'} There are several ways to accomplish the above. The followingstringrfunctions may be useful:str_pad(),str_length(),str_extract(), andstr_trunc()`. However, this is not an extensive list and there may be other functions which are useful.

**Files**  

1.  Create a function that returns the latest (most recent) input file name for your data set, regardless of which year-quarter it is. **See tip below**. 
1.  Extend the above to return a specific year-quarters data file. Note: depending on your naming convention this may be simple or difficult.  It may be easier after you work the **Year-Quarter** exercises below.
1.  Create an error message for the above if they enter a year-quarter in which there is no data.

```{block2, type='rmdtip'}
R has many functions to help with system files.  See the help for `dir()`, `list.files()`, `list.dirs()`, `file.exists()`, `basename()`, `dirname()`, `tools::file_path_sans_ext()`, and many many more.

Year-Quarter

  1. Create a function that takes a year-quarter in numeric format and returns the year quarter in a character format with the Q (Ex 20174 becomes 2017Q4).
  2. How would you modify the above to return 2017 Q4. How about 2017-Q4?
  3. How would you modify the above to return 4q17 or 4Q17?
  4. Create a function that does the inverse of the above. Ex. "2017Q4" or "2017 Q4" becomes a numeric 20174.
  5. Create function that takes a character year-quater (ex 2018Q3) and the quarter n-quarters before it. Example two quarters previous to 2018Q3 is 2018Q1. Hint: Use your function from part 1 in conjuction with your function from the exercise in the Function Basics chapter.

Read function

(@) Create a function to make your column names syntatic. + make column name lower case + remove any leading non-alpha characters + replace special characters and spaces with underscores; (@) Modify your read_* function to: + Default file name is the most recent file + create syntatic column names + create a complete 3 digit zip code (if your data has zip data) + if any of your columns need string manipulation perform them.



DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.