library(learnr) library(tutorial.helpers) library(tidyverse) library(babynames) library(gghighlight) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") x <- "n" x1 <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604") set.seed(9) df <- tibble(name = c("Flora", "Preceptor", "Terra", NA)) df1 <- tibble(name = c("Flora", "Preceptor", "Terra", NA)) df2 <- tribble( ~ name, ~ fruit, "Carmen", "banana", "Carmen", "apple", "Marvin", "nectarine", "Terence", "cantaloupe", "Terence", "papaya", "Terence", "mandarin" ) y <- c("Apple", "Banana", "Pear") x2 <- "text\nEl Ni\xf1o was particularly bad this year" x3 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd" u <- c("\u00fc", "u\u0308") tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""
This tutorial covers Chapter 14: Strings from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.
You will learn about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Some important functions which we will learn include:
str_c()
,
str_glue()
,
str_flatten()
,
separate_longer_delim()
, and more.
We’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (') or double quotes (").
Type this is a string
in double quotes and assign it to a variable named string1
using <-
.
string1 <- "..."
string1 <- "This is a string"
Now there’s no difference in behavior between the single quotes and double quotes, but in the interests of consistency, the Tidyverse style guide recommends using "
, unless the string itself contains double quotes.
Now type hello world
with one double quote, the other missing, and assign to str2
.
str2 <- ...
str2 <- "hello world"
Now you'll get a error to close the quotes. In the console, if you forget to close a quote, you’ll see +
: the continuation prompt. If this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.
Now type hello "world"
in double quotes and use another pair of double quotes on world
and assign to str3
.
str3 <- "..."
The code str3 <- "hello "world""
doesn't work in RStudio due to a syntax error. Without the proper escape character, RStudio will interpret the second double quote as the end of the string, resulting in a syntax error.
In the previous exercise, we encountered a syntax error when trying to assign the string str3 <- "hello "world""
in RStudio. To fix this error, we can modify the code by changing the outer double quotes to single quotes, while keeping the inner double quotes intact. This way, the double quotes will be visible within the string.
str3 <- 'hello "world"'
str3 <- 'hello "world"'
If you wanted to include single quotes in a string you could put the outer quotes as double quotes.
New create a new variable x
and set it n
using <-
in a string format using double quotes. Then on a new line call x.
x <- "n" x
x <- "n" x
R shows the whole string as a vector, which is why we don't see 'n' without quotes.
To view only the string and not the quotes, lets load the stringr
library using library()
.
library(...)
library(stringr)
The stringr library is a part of the Tidyverse. We can just load the tidyverse library and the stringr library will be automatically loaded.
To utilize the variable x
that we previously set, let's type str_view()
and specify x
as the argument.
str_view(...)
str_view(x)
The printed representation of a string is not the same as the string itself because the printed representation shows the escapes. To see the raw contents of the string, we use str_view()
To perform the same action with single quotes, let's create a new variable y
and assign it the value "'"
using the <-
operator. Then, on a new line, type str_view()
and pass y
as the argument.
y <- ... str_view(...)
y <- "'" str_view(y)
There's another way to include quotes in a string. To include a literal single or double quote, you can use the backslash \
to escape it. For example, "\""
will return '"'
.
Let's create a string with one single quote and assign it to the variable single_quote
. Then, on a new line, type str_view()
and pass single_quote
as the argument.
single_quote <- '\'' str_view(...)
single_quote <- '\'' str_view(single_quote)
You might wonder what do you do if you wanted to include a literal backslash in your string. We'll see how in next exercise.
To include a literal backlash in your string, it is pretty simple and all you need to do is have two backslashes in a string to have one literal one. So just type two backslashes in a string and run it within str_view()
.
str_view("...")
str_view("\\")
If you find yourself dealing with a complex situation where you have many backslashes and quotes to include, it can become confusing to keep track of them.
To illustrate this problem, consider the string tricky
, defined as below. Hit "Run Code."
tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""
tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""
That’s a lot of backslashes! (This is sometimes called leaning toothpick syndrome.)
Run str_view()
with tricky
set as the argument.
str_view(tricky)
str_view(tricky)
After running the code, we should expect to see double_quote <- "\"" # or '"' single_quote <- '\'' # or "'"
as the output. However, creating a string representation like this can be extremely confusing due to the excessive use of backslashes and quotes. In the next exercise, we will explore a solution to this problem.
Let's modify tricky
as below. Add str_view()
with tricky
as the argument.
tricky <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")"
tricky <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")" str_view(tricky)
tricky <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")" str_view(tricky)
To solve the issue, we utilized a raw string, which is a specific type of string literal that doesn't interpret any special characters or escape sequences.
Typically, a raw string starts with r"(, ends with )"
, and allows for any text representation. However, if the string contains )", alternatives like r"[]"
or r"{}"
can be used. Furthermore, you can add dashes to ensure unique opening and closing pairs, such as r"--()--"
, r"---()---"
, and so on. Raw strings offer flexibility to handle any text without problems.
Besides \"
, \'
, and \\
, there are a few other special characters that can be useful. The most common ones are \n
for a new line and \t
for a tab.
Now let's observe how str_view()
handles these special characters.
To do that, let's create a new variable called x1
and set it as a vector with the following elements: "one\ntwo"
, "one\ttwo"
, "\u00b5"
, "\U0001f604"
.
x1 <- c(...)
x1 <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
If you want to check out a complete list of other special characters, check out in Quotes.
Run x1
.
x1
x1
All we get is the same as what we typed and doesn't actually take all the special characters that we put except the smiley face.
Now run str_view()
with x1
as the argument.
str_view(x1)
str_view(x1)
Compared to just printing the strings in console, we actually see the spaces for the first string, we see a tab which shows that str_view
can take in special characters.
Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame.
We’ll show you how to combine strings with str_c()
and str_glue()
and how you can use them with mutate()
.
The first function we will learn is str_c()
.
str_c()
takes any number of vectors as arguments and returns a character vector, so run str_c("x","y")
and see what you will get.
str_c(...,...)
str_c("x","y")
str_c()
is a function in the stringr
package in R that combines multiple character vectors into a single character vector. It is similar to the paste()
function, but it uses tidyverse
recycling and NA rules.
Copy previous code and change the argument to str_c("Hello", c("Precptor","Anish"))
and run it.
str_c(..., c(...))
str_c("Hello ", c("Precptor","Anish"))
Based from the result we can see that str_c()
is vectorized, which means it can take multiple arguments and combine them element-wise.
To demonstrate the usage of str_c()
on a data set, let's create a tibble. Create a variable called df
and assign it to tibble(name = c("Flora", "Preceptor", "Terra", NA))
. This will generate a data set with string characters.
... <- tibble(...=c(...))
df <- tibble(name = c("Flora", "Preceptor", "Terra", NA))
You might be wondering what tibble is, A tibble is a new form of a data frame in R that is part of the tidyverse library, and Tibbles print the data in a more efficient format than a data frame, showing the values of the columns, their datatype, and the size of the dataset.
Run df
to see the data frame.
...
df
the tibble has one column named "name." The first three rows contain the names "Flora," "Preceptor," and "Terra," respectively. However, the fourth row has an NA value, which typically represents missing or unknown data.
Let's modify the tibble by first piping the tibble df
to the mutate()
function, then in mutate()
, we will create new column greeting
and set it equal to str_c("Hi ", name, "!")
.
... |> mutate(... = ...("...",name,"..."))
df |> mutate(greeting = str_c("Hi ", name, "!"))
After modifying the tibble, we still have a column called name
. We used the mutate()
function from the "dplyr" package to create a new column called greeting
. The str_c()
function is used to concatenate the string "Hi " with the values in the name
column.
If you are mixing many fixed and variable strings with str_c()
, you’ll notice that you type a lot of "
s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()
.
Let's create a variable named df1
and set it to tibble(name = c("Flora","Preceptor","Terra",NA))
which create a data set with the string characters.
... <- tibble(...=c(...))
df1 <- tibble(name = c("Flora","Preceptor","Terra",NA))
How str_glue()
works is that if you give it a single string that has a special feature: anything inside {}
will be evaluated like it’s outside of the quotes.
Let's modify the tibble by first piping the tibble df1
to the mutate()
function, then in mutate()
, we will create new column greeting
and set it equal to str_glue("Hi {name}!")
.
df1 |> ...(greeting = ...())
df1 |> mutate(greeting = str_glue("Hi {name}!"))
After modifying the tibble, we still have a column called name
. We used the mutate()
function from the "dplyr" package to create a new column called greeting
. As you can see, str_glue()
currently converts missing values to the string "NA" unfortunately making it inconsistent with str_c()
.
str_c()
and str_glue()
are suitable for use with mutate()
since their output matches the length of their inputs. However, if you need a function that works well with summarize()
and always returns a single string, str_flatten()
comes into play. It takes a character vector as input and combines each element of the vector into a single string.
Type str_flatten()
and within it have a vector of strings like this c("x","y","z")
.
str_flatten(...)
str_flatten(c("x","y","z"))
One variation of str_flatten()
is str_flatten_comma()
is a variation designed specifically for flattening with commas. It automatically recognizes if last uses the Oxford comma and handles the special case of 2 elements.
What if you want commas after every letter. Copy the previous code and within the argument, add a comma and type a comma in string like this ","
.
str_flatten(c(...),...)
str_flatten(c("x","y","z"), ",")
The second argument which was a comma as a string as we saw, it is actually known as a collapse
: a string to enter between each piece. Defaults to ""
.
Let's fix the grammar of the string we got in previous exercise, copy the previous code and add last = ", and "
.
str_flatten(c(...),..., last = ...)
str_flatten(c("x","y","z"), ",", last = ", and ")
The last
argument is a optional string to use in place of the final separator.
Let's now use str_flatten()
with summarize()
. Run df2
to have a quick look at the data frame.
df2
df2
The tibble has two columns named "name" and "fruit". Carmen has more than one fruit to his name and Terence has more than two fruits to his name.
Start a pipe with df2
to summarize()
and within summarize()
, type the column fruit
and set it to str_flatten(fruit, ", ")
.
df2 |> summarize(...)
df2 |> summarize(fruit = str_flatten(fruit, ", "))
summarize()
creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input.
Within summarize()
add .by = name
.
... |> summarize(...,.by = ...)
df2 |> summarize(fruit = str_flatten(fruit, ", "), .by = name)
The .by
arguement is a selection of columns to group by for just this operation, functioning as an alternative to group_by()
. For details and examples, see ?dplyr_by
.
The final outcome ensures the absence of duplicate names while effectively organizing all fruits based on their respective associations.
To review, mutate()
either changes an existing column or adds a new one. summarize()
calculates a single value (per group).
It's very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:
df |> separate_longer_delim(col, delim)
df |> separate_longer_position(col, width)
df |> separate_wider_delim(col, delim, names)
df |> separate_wider_position(col, widths)
Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim()
to split based on a delimiter.
Create a new tibble using tibble()
, create a column x
and set it to be a vector which should look like c("a,b,c", "d,e", "f")
and assign it to df1
.
df1 <- tibble(... = c())
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
Just like with pivot_longer()
and pivot_wider()
, _longer
like separate_longer_delim()
functions make the input data frame longer by creating new rows and _wider
functions make the input data frame wider by generating new columns.
Copy the previous code and on a new line, start a pipe with df1
to separate_longer_delim()
. Within the function, type the column x
and set delim
equal to ","
.
df1 <- ... df1 |> separate_longer_delim(...,... = ",")
df1 <- tibble(x = c("a,b,c", "d,e", "f")) df1 |> separate_longer_delim(x, delim = ",")
If you are wondering what delim = ","
means, the delim
argument is used to specify the delimiter character in functions that involve reading or writing delimited data files.
For example, when reading a CSV (Comma-Separated Values) file using the read_delim()
function from the readr package, you can specify delim = ","
to indicate that the values in the file are separated by commas.
Sometimes distinct numbers within a string are not separated by commas or spaces. In such cases, how should we properly split the string to separate these numbers? To demonstrate the issue create a new tibble()
, set x
as the name of the column and set x
equal to a vector which looks like c("1211", "131", "21")
and lastly assign it to df2
.
df2 <- tibble(... = c())
df2 <- tibble(x = c("1211", "131", "21"))
As you look at the tibble, notice we can't separate the numbers by commas to make the column longer. That's where separate_longer_position()
comes into the play.
Copy the previous code and then, on a new line, pipe with df2
to separate_longer_position()
, within the function, type the column x
and then set the width
to 1.
df2 <- tibble(... = c()) df2 |> separate_longer_position(x, ... = ...)
df2 <- tibble(x = c("1211", "131", "21")) df2 |> separate_longer_position(x, width = 1)
Now you see that all the strings have been separated into one-digit integers, creating more rows to fit each integer into each row.
In R, the width = 1
parameter is used in functions that involve writing or formatting output to specify the width of the output field.
What if you want to create columns when separating? Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns.
Create a tibble()
, set x
as the name of the column and set x
equal to a vector which looks like c("a10.1.2022", "b10.2.2011", "e15.1.2015")
and lastly assign it to df3
.
df3 <- tibble(x = c(...))
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
In this following data set, x is made up of a code, an edition number, and a year, separated by "."
. To use separate_wider_delim()
, we supply the delimiter and the names in two arguments.
Copy the previous code and start pipe with df3
to separate_wider_delim()
, within the function, type x
, then set the delim
to "."
and finally set the names
to a vector which looks like c("code", "edition", "year")
.
df3 <- tibble(x = c(...)) df3 |> ....( x, .... = ".", names = c(...) )
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015")) df3 |> separate_wider_delim( x, delim = ".", names = c("code", "Edition", "year"))
What if I don't care about the edition of the book and want to delete it?
If you don't want a column, all we have to do rename to column to be NA
. Copy the previous code and change edition
in names to NA
.
df3 <- tibble(x = c(...)) df3 |> ....( x, .... = ".", names = c(...) )
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015")) df3 |> separate_wider_delim( x, delim = ".", names = c("code", NA, "year"))
What if you want to separate them by different width in position? separate_wider_position()
is the function that solves that.
Create a tibble()
, set x
as the name of the column and set x
equal to a vector which looks like c("202215TX", "202122LA", "202325CA")
and lastly assign it to df4
.
df4 <- tibble(x = c(...))
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
separate_wider_position()
works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them
Copy the previous code and start a pipe with df4
to separate_wider_position()
, within the function, type x
, then set the widths
to a vector which looks like c(year = 4, age = 2, state = 2)
.
df4 <- tibble(x = c(...)) ... |> ...( x, widths = c(...,...,...) )
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) df4 |> separate_wider_position( x, widths = c(year = 4, age = 2, state = 2) )
_wider
functions like the one we used above make the input data frame wider by generating new columns.
separate_wider_delim()
requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces?
Create a tibble()
, set x
as the name of the column and set x
equal to a vector which looks like c("1-1-1", "1-1-2", "1-3", "1-3-2", "1")
and lastly assign it to df
.
df <- tibble(x = c(...))
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
In response to not having the same number of pieces, there are two possible problems, too few or too many pieces, so separate_wider_delim()
provides two arguments to help: too_few
and too_many
.
Copy the previous code and start pipe with df
to separate_wider_delim()
, within the function, type x
, then set the delim
to "-"
and finally set the names to a vector which looks like c("x", "y", "z")
.
df <- tibble(x = c(...)) ... |> separate_wider_delim( x, ... = "-", ... = c("x", "y", "z") )
You’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem
Copy the previous code and within separate_wider_delim()
add too_few
to be "debug"
. Also create a variable debug
and set it to the pipe then at the end call debug
.
df <- tibble(x = c(...)) debug <- ... |> separate_wider_delim( x, ... = "-", ... = c("x", "y", "z"), too_few = ... ) ...
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1")) debug <- df |> separate_wider_delim( x, delim = "-", names = c("x", "y", "z"), too_few = "debug" ) debug
When you use the debug mode, you get three extra columns added to the output: x_ok
, x_pieces
, and x_remainder
(if you separate a variable with a different name, you’ll get a different prefix).
x_pieces
tells us how many pieces were found, compared to the expected 3 (the length of names). x_remainder
isn’t useful when there are too few pieces, but we’ll see it again shortly
In other cases, you may want to fill in the missing pieces with NA
s and move on. That’s the job of too_few
= "align_start"
and too_few
= "align_end"
which allow you to control where the NA
s should go.
Copy the previous code and change too_few
to be set to "align_start"
.
df <- tibble(x = c(...)) ... |> separate_wider_delim( x, ... = "-", ... = c("x", "y", "z"), too_few = ... )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1")) debug <- df |> separate_wider_delim( x, delim = "-", names = c("x", "y", "z"), too_few = "align_start" ) debug
You see that some of colums are filled with NA
since we put too_few = "align_start"
.
The same principles apply if you have too many pieces.
Create a tibble()
, set x
as the name of the column and set x
equal to a vector which looks like c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9")
and lastly assign it to df
.
df <- tibble(x = ...)
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
Copy the previous code and start pipe with df
to separate_wider_delim()
, within the function, type x
, then set the delim
to "-"
and finally set the names to a vector which looks like c("x", "y", "z")
.
df <- tibble(x = ...) ... |> separate_wider_delim( ..., delim = "...", names = c(...) )
You will get an error and get suggestions to use too_many = "debug"
or too_many = "drop/merge"
. We will use "debug" to see what being kept and what's being left and after that we will talk about use drop and merge.
Copy the previous code add too_many = "debug"
within separate_wider_delim()
.
df <- tibble(x = ...) ... |> separate_wider_delim( ..., delim = "...", names = c(...), too_many = "..." )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9")) df |> separate_wider_delim( x, delim = "-", names = c("x", "y", "z"), too_many = "debug" )
When we debug the result, you can see the purpose of x_remainder
: it shows what was left behind and not inserted into the columns.
Copy the previous code and change too_many
to "drop".
df <- tibble(x = ...) ... |> separate_wider_delim( ..., delim = "...", names = c(...), too_many = "..." )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9")) df |> separate_wider_delim( x, delim = "-", names = c("x", "y", "z"), too_many = "drop" )
As we are already aware by the word drop
, it has been established that Rstudio will discard the remaining x remainders.
Copy previous and now change too_many
to "merge".
df <- tibble(x = ...) ... |> separate_wider_delim( ..., delim = "...", names = c(...), too_many = "..." )
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9")) df |> separate_wider_delim( x, delim = "-", names = c("x", "y", "z"), too_many = "merge" )
The x remainders
are merged into the final column.
In this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract sub strings, and handle long strings in plots and tables.
How do you find length of a string? That's where str_length()
comes into play. str_length()
tells you the number of letters in the string.
Type the function str_length()
and type "hello world"
as the argument and run it.
str_length(...)
str_length("hello world")
str_length()
not only counts the letters but also the spaces in between the words in a string. If you want to explore more, check out str_length()
.
In addition to "hello world"
in str_length()
, add two more values which are "programming"
and NA
. Make sure to put these three string in vector like c(...,...,...)
.
str_length(c(...,...,...))
str_length(c("hello world", "programming", NA))
The values you'll get are 11, 11 and NA.
Let's load the "babynames" using library()
.
...
library("babynames")
The babynames
is a data set which contained names used for babies from 1880 to 2017. If you want to explore the data set, checkout popular baby names.
Run babynames
to have an overview of the data set.
babynames
babynames
We have five columns(year, sex, name, n, prop). The n
means the count of the name and the prop
is the proportion or fraction of individuals with that name out of the total number of individuals.
To discover some findings lets use this with count()
to find the distribution of lengths of US baby names.
Start a pipe with babynames to count()
.
... |> count()
babynames |> count()
We see that there are a total count of 1924665 names used. Let's make the function more complex to find the distribution of lengths of baby names.
Let's create two columns to see the length of the name and also the count. Within count()
type length
and set it equal to str_length(name)
and then type wt
and set it to n
.
... |> count(... = str_length(...),wt = ...)
babynames |> count( length = str_length(name), wt = n )
What wt = n
does is that it specifies that the count (n) should be used as the weight for the calculation. Looking at the results, we see that the biggest name contains 15 letters.
Now that we know that the longest name is 15 letters. Let's find the most famous names which are 15 letters. Start a pipe with babynames
to filter()
. We have to select the names which are 15 letters so type str_length(name)
and set it to be equal to be 15 using ==
.
... |> filter(...(name)== 15)
babynames |> filter( str_length(name) == 15 )
In RStudio, the ==
operator is used to test for equality between two values. It is a comparison operator that returns a logical value of TRUE if the values are equal and FALSE otherwise.
Now that we have all the names with 15 letters, let's continue the pipe from filter()
to count()
. Within count()
type name
, set wt
to n
and then set sort
to TRUE.
... |> ...|> count(..., wt = ..., sort = ...)
babynames |> filter( str_length(name) == 15 ) |> count( name, wt = n, sort = TRUE )
Looks at the results, the most famous name is Francisco Javier, being used 123 times. Looking at the other names, Christopher looks similar is most of the other names.
Now that we have learned str_length()
. Let's move on to learning about str_sub()
.
Create a variable y
and set it equal to a vector containing strings: Apple, Banana, and Pear.
y <- c(...,...,...)
y <- c("Apple", "Banana", "Pear")
You can extract parts of a string using str_sub(string, start, end)
, where start and end are the positions where the substring should start and end.
To extract a substring using the str_sub()
function, specify the following arguments: the first argument should be the variable y
that we have defined, the second argument should be the starting position (1 in this case), and the third argument should be the ending position (3 in this case).
str_sub(y, ...,...)
str_sub(y, 1, 3)
Looking at results, when using str_sub()
The start and end arguments are inclusive, so the length of the returned string will be end - start + 1
Let's modify the line from previous code and change the 2nd arguemnt to -3
and the 3rd argument to -1
.
str_sub(..., ...,...)
str_sub(y, -3, -1)
If you want to look at the end of the string, you can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.
You might ask what if the range is bigger than the actual string? Let's see then, type str_sub()
and in the first argument type "a", then type 1
for 2nd argument and 5
for 3rd arguemnt.
str_sub("...",1,...)
str_sub("a", 1, 5)
Seeing the results, we have to note that str_sub()
won’t fail if the string is too short: it will just return as much as possible
Now's that we have a sense of str_sub()
, let's find the first and last letter of each name.
Start a pipe with babynames
to mutate()
.
babynames |> mutate()
babynames |> mutate()
We are using mutate()
because we will be add two new columns called first and last.
Within mutate()
, create a new column called first
and set it to str_sub(name, 1,1)
to get the first letter.
... |> ...(... = str_...(name,1,...))
babynames |> mutate( first = str_sub(name, 1,1) )
If you didnt know, In Python, indexing starts from zero, whereas in RStudio, indexing starts from one, resulting in a slight variation in accessing elements and slicing sequences.
In addition to creating the first
column , let's add the last
column. Within mutate()
, add last
and set it to str_sub(name, -1,-1)
.
... |> ...(... = str_...(name,1,...), last = ..._sub(name, -1,-1))
babynames |> mutate( first = str_sub(name, 1,1), last = str_sub(name, -1,-1) )
When we ran the code, we see two new columns first
and last
which show the first and last letters of a name.
The babynames package contains data about baby names through the years.
Using this data, let's create the following graph.
boys_p <- babynames |> filter(sex == 'M') |> # prop is supposed to be proportion of a specific name in all male baby names # in that year. But I am not sure it is! For example, if you sum up prop for # all M in a given year, it never adds to 1. It always adds to a number less # than 1, but greater than 0.9. My *guess* would be that this is due to # babynames dropping names which have very small numbers of babies with that # name. If true, then prop is likely correct. select(name, year, prop) |> # Add last letter of baby names. mutate(last_letter = stringr::str_sub(name, -1, -1)) |> # frequency is the poroportion of a certain last letter in all male baby names in that year. summarize(frequency = sum(prop), .by = c(year, last_letter)) |> ggplot(aes(x = year, y = frequency, colour = last_letter)) + geom_line() + # only color last letters that appears more than 15 % of all boy names in any year. gghighlight(max(frequency) > 0.15, label_key = last_letter) + scale_x_continuous(guide = guide_axis(angle = 70), breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) + scale_y_continuous(labels = scales::percent_format()) + theme_minimal() + labs(x = NULL, y = NULL, subtitle = "Names ending with 'N' increased rapidly after 1950", title = "Last Letter of Boy Baby Names", caption = "Source: http://hadley.github.io/babynames") boys_p
glimpse()
the babynames
data set. Pay close attention to the data type of each variable.
glimpse(...)
glimpse(babynames)
We could have also used the head()
function to check the first few rows of the tibble.
To check for NA
values, use any()
function with argument is.na()
, argument of which should be the dataset babynames.
any(is.na(...))
any(is.na(babynames))
This function returned FALSE
, which indicates that there are no NA values.
Because we are trying to analyse only boy names by last letter, start a pipe with babynames
.filter()
the data set to only get the data of boy names by setting sex
equal to 'M'
.
babynames |> filter(... == 'M')
babynames |> filter(sex == 'M')
Continue the pipe by selecting the name
, year
, and prop
columns using select()
.
... |> select(..., ..., ...)
babynames |> filter( sex == 'M') |> select( name, year, prop)
When selecting columns, we can also combine multiple select helpers. For example,
when selecting name
and year
columns, we can use the following code:
select(starts_with("na") | ends_with("ar"))
.
We want the last letters of boy names. Therefore, create a new variable names last_letters
that should be equal to last letter of each boy name. To get the last letter of each boy name, we will make use of str_sub()
function of stringr package.
Add the mutate()
function to add a new column: last_letter
. Its value is the str_sub()
function with name
, -1
and -1
as its first three arguments.
... |> mutate(last_letter = stringr::str_sub(..., ..., -1))
babynames |> filter(sex == 'M')|> select(name, year, prop)|> mutate(last_letter = stringr::str_sub(name, -1, -1))
str_sub
takes three arguments: character vector to work on, starting character, and ending character. Because we wanted to extract last letter, or -1 element, of each name, we set the starting and ending characters as -1.
Add summarize()
to create the variable frequency
that is equal to sum()
of proportions (prop
) of each group. Also, set the .by
argument to c(year, last_letter)
because we want to know the value for each year/letter combination.
... |> summarize(frequency = sum(prop), .by = c(year, last_letter))
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter))
Now, create a ggplot object, and within ggplot(aes())
, set x-axis values to year
, y-axis values to frequency
, and colour to last_letter.
...|> ggplot(aes(x = ..., y = ..., colour = ...))
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter))
Add geom_line()
to the ggplot object from previous exercise.
... + geom_line()
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter)) + geom_line()
Alternatively, we could have used other chart types such as scatter chart. However, for our goal, line plot is a great option.
Because we want to analyse last letters that are prevalent, we want to color only last letters that have been above a certainn threshold.
We use 15% as the threshold. Therefore, we only want to color last letters that for certain time have been the last letter of more than 15% of all baby boys that were born during that time.
We will utilize gghighlight
library for this purpose. Add gghighlight()
to the ggplot object. Within this function, set max()
frequency
to be greater than 0.15
, and to avoid any warnings, set label_key
equal to last_letter
.
...+ gghighlight(max(...) > ..., label_key = ...)
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter)) + geom_line() + gghighlight( max(frequency) > .15, label_key = last_letter )
We don't think current x-axis ticks are informative enough. Therefore, we should set custom x-axis ticks.
Add scale_x_continuous
, and within this function, set breaks
argument equal to numerical vector of years. You can choose any years within the data set. We preferred 1880, 1900, 1925, 1950, 1975, 2000, and 2017.
... + scale_x_continuous(breaks = c(1880, 1900, ..., ..., ..., 2000, 2017))
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter)) + geom_line() + gghighlight( max(frequency) > .15, label_key = last_letter ) + scale_x_continuous( breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017))
A rule of thumb, however, is to include ending and beginning years so that whoever sees your plot could understand the range of years that data is collected.
Moreover, we want to visualize y-axis as percentage instead of proportion. Therefore, add scale_y_continuous
, and within this function, set labels
argument equal to scales::percent_format()
.
... + scale_y_continuous(... = scales::...())
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter)) + geom_line() + gghighlight( max(frequency) > .15, label_key = last_letter ) + scale_x_continuous( breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) + scale_y_continuous( labels = scales::percent_format())
scale
library provides many functions to assist the scaling of axes. For example, scale_y_binned()
can be used to discretize continuous position data.
Add theme_minimal()
.
... + theme_minimal()
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter)) + geom_line() + gghighlight( max(frequency) > .15, label_key = last_letter ) + scale_x_continuous( breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) + scale_y_continuous( labels = scales::percent_format()) + theme_minimal()
We also could have used themes such as theme_gray()
, theme_void()
, andtheme_bw()
.
Finally, add caption
, title
, subtitle
, and axis
labels of your choice. Add labs()
function to the plot.
Reminder: This is what your plot should look like
boys_p
... + labs(...)
babynames |> filter( sex == 'M') |> select( name, year, prop) |> mutate( last_letter = str_sub(name, -1, -1)) |> summarize( frequency = sum(prop), .by = c(year, last_letter)) |> ggplot( aes(x = year, y = frequency, colour = last_letter)) + geom_line() + gghighlight( max(frequency) > .15, label_key = last_letter ) + scale_x_continuous( breaks = c(1880, 1900, 1925, 1950, 1975, 2000, 2017)) + scale_y_continuous( labels = scales::percent_format()) + theme_minimal() + labs( x = "Year", y = "", title = "Last Letter of Boy Baby Names", subtitle = "Names ending with 'N' increased rapidly after 1950")
This tutorial covered Chapter 14: Strings from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.
You learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Some important functions which we learned include:
str_c()
,
str_glue()
,
str_flatten()
,
separate_longer_delim()
, and more.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.