library(learnr) knitr::opts_chunk$set(echo = TRUE, tidy.opts = list(width.cutoff = 60), tidy = "styler", warning = TRUE, comment = "") options(width = 80) source("waffles.R") x <- "Pawnee rules!" y <- c("Pawnee rules", "Eagleton drools") z <- c("Pawnee rules and Eagleton drools.", "I love friends, waffles, and work.") keep_words <- function(words) { words[nchar(words) > 0] } is_special_ending <- function(ending) { is_es <- all(ending == c("e", "s")) is_ed <- all(ending == c("e", "d")) is_e_not_le <- ending[2] == "e" & ending[1] != "l" is_es | is_ed | is_e_not_le } rm_special_endings <- function(word_letters) { word_tail <- tail(word_letters, n = 2) if (is_special_ending(word_tail)) { if (word_tail[2] == "e") { word_letters[-length(word_letters)] } else { head(word_letters, n = -2) } } else { word_letters } } is_vowel <- function(letter) { letter %in% c("a", "e", "i", "o", "u", "y") } count_syllables <- function(word) { word_letters <- unlist(strsplit(word, split = "")) if (length(word_letters) <= 3) { 1 } else { word_letters <- rm_special_endings(word_letters) word_vowels <- is_vowel(word_letters) sum(word_vowels) - sum(diff(which(word_vowels)) == 1) } } x <- "Pawnee rules" y <- c("Pawnee rules", "Eagleton drools") reading_ease <- function(text) { 96.76339 } z <- c("Pawnee rules and Eagleton drools.", "I love friends, waffles, and work.") word_z <- strsplit(z, split = " ") source("../../setup.R")
## https://github.com/yihui/knitr-examples/blob/master/077-wrap-output.Rmd library(knitr) hook_output = knit_hooks$get('output') knit_hooks$set(output = function(x, options) { # this hook is used only when the linewidth option is not NULL if (!is.null(n <- options$linewidth)) { x = knitr:::split_lines(x) # any lines wider than n should be wrapped if (any(nchar(x) > n)) x = strwrap(x, width = n) x = paste(x, collapse = '\n') } hook_output(x, options) }) source("waffles.R")
```{js, echo=FALSE} $(function() { $('.ace_editor').each(function( index ) { ace.edit(this).setFontSize("20px"); }); })
\newtheorem{question}{Question} \begin{center} All rights reserved, Michael Tsiang, 2019. \includegraphics[width=0.1\textwidth]{by-nc-sa.png} Acknowledgements: Miles Chen and Jake Kramer \end{center} \newpage ## Learning Objectives {-} After studying this chapter, you should be able to: * Perform basic string manipulation in R * Perform basic pattern matching with `grep()`, `grepl()`, and `gsub()` * Interpret and use basic regular expressions * Calculate the Flesch reading ease score \vspace{-0.1in} ## Introduction Most of statistical computing involves working with numeric data. However, many modern applications have considerable amounts of data in the form of text. There are whole areas of statistics and machine learning devoted to organizing and interpreting text-based data, such as textual data analysis, linguistic analysis, text mining, sentiment analysis, and natural language processing (NLP). For more information and resources: - <https://cran.r-project.org/web/views/NaturalLanguageProcessing.html> - <https://www.tidytextmining.com/> Text-based analyses are beyond the scope of this course. However, even in non-text-based analyses, working with data in R often requires processing of characters, such as in row/column names, dates, monetary quantities, longitude/latitude, etc. Other common scenarios involving characters: - Removing a given character in the names of your variables - Changing the level(s) of a categorical variable - Replacing a given character in a dataset - Converting labels to upper or lower case - Extracting a regular pattern of characters from a large text file - Parsing input from an XML or HTML file A basic understanding of character (or string) manipulation and regular expressions can be a valuable skill for any statistical analysis. We will discuss the most common syntax and functions for string manipulation in base R and introduce basic regular expressions in R. For more information and resources: Books and Articles - Gaston Sanchez's "Handling Strings with R": <https://www.gastonsanchez.com/r4strings/> - Garrett Grolemund and Hadley Wickham's "R for Data Science": <http://r4ds.had.co.nz/strings.html> - <https://en.wikibooks.org/wiki/R_Programming/Text_Processing> - <https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html> Cheat Sheets for `stringr` and Regular Expressions - <https://github.com/rstudio/cheatsheets/raw/master/strings.pdf> - <https://www.cheatography.com//davechild/cheat-sheets/regular-expressions/pdf/> Sites for Testing Regular Expressions - <https://regex101.com/> - <https://regexr.com/> ## Characters in R <!-- \vspace{-0.1in} --> ### Basic Definitions Symbols in R that represent text or words are called **characters**. A **string** is a character variable that contains one or more characters, but we often will use "character" and "string" interchangeably. Values that are stored as characters have base type **`character`** and are typically printed with quotation marks. ```r x <- "Pawnee rules" x
Use the typeof()
function to determine the type of object x
is.
Characters can be created using single or double quotation marks, but double quotation marks are almost universally preferred.
While it is possible to use single quotation marks within double quotation marks and vice versa, you should not. You cannot directly insert single quotes within single quotes or double quotes within double quotes.
The double quotation inside a string is a special character, so inserting it within double quotes requires a backslash `\" to escape the special property of the character.
"This is the 'R' Language" "This is the \"R\" Language"
"This is an "error""
Fix the error below:
"This is an "error""
The character()
function creates a character vector of a specified length. The default value in each element of the vector is the empty character ""
.
Make a length-5 character vector using the character()
function.
character(5)
character(5)
Note: The empty character ""
is not the same as character(0)
.
question("What is the length of the vector ''?", answer("0"), answer("1", correct = TRUE), answer("NA"), answer("2"), random_answer_order = TRUE, allow_retry = TRUE)
paste()
FunctionThe paste()
function is one of the most important functions for creating and building strings.
The paste()
function inputs one or more R objects, converts them to character
, and then concatenates (pastes) them to form one or several character strings.
The basic syntax is:
paste(..., sep=" ", collapse = NULL)
...
argument means the input can be any number of objects.sep
argument specifies the separator between characters after pasting. The default is a single whitespace " "
.collapse
argument specifies characters to separate the result.paste("I ate some", pi, "and it was deloicious.") paste("Bears", "Beets", "Battlestar Galactica", sep=", ") paste("h", c("a", "e", "o"), sep = "") # No collapsing paste("h", c("a", "e", "o"), sep = "", collapse = ", and ")
Note: Vectors of different lengths will be recycled in the usual way, but partial recycling will not throw a warning.
There are several functions to print strings:
print()
is for generic printing.noquote()
is for printing without quotation marks.cat()
is for concatenation.format()
is for (pretty) printing with special formatting.print()
and noquote()
FunctionsThe print()
function (technically the print.default()
method) has an optional logical quote
argument that specifies whether to print characters with or without quotation marks.
A similar output can be produced using noquote()
.
Using print()
display x
without quotation marks.
print(x, quote = FALSE) noquote(x)
print(x, quote = FALSE)
Do the same with noquote()
Side Note: While the output appears identical, the commands are not the same. The noquote()
function outputs a noquote
class object, which is then inputted into the print.noquote()
method.
cat()
FunctionThe cat()
funtion concatenates multiple character vectors into a single vector, adds a specified separator, and prints the result (without quotations).
cat(x, "Eagleton drools", sep = ", ")
The printing is slightly different from that of noquote()
. In particular, the printed output does not have the vector index, and the cat()
function returns an invisible NULL
(meaning assigning the printed output to a variable does not work).
One benefit of cat()
is that the printed output can be saved to an external file using the file
argument:
dir() cat(x, "Eagleton drools", sep=", ", file = "pawnee.txt") dir()
When file
is specified, an optional logical argument append
specifies whether the result should be appended to or overwrite an existing file.
Side Note: There are a few other optional arguments that are useful for longer text strings. Consult the R documentation for more information.
format()
FunctionThe format()
function formats an R object for "pretty" printing.
Some useful arguments used in format()
:
width
specifies the (minimum) width of strings produced.trim
specifies whether there should be no padding with spaces (TRUE
).justify
controls how padding takes place for strings. Takes the values "left"
, "right"
, "centre"
, or "none"
.For controling the printing of numbers:
digits
specifies the number of significant digits to use.nsmall
specifies the minimum number of digits to the right of the decimal point to include.scientific
specifies whether to use scientific notation (TRUE
) or standard notation (FALSE
).format(1 / (1:5), digits = 2) format(1 / (1:5), digits = 2, scientific = TRUE) format(c("Pawnee", "rules", "Eagleton", "drools"), width=10, justify = "left")
question("How would I use format() to print 100 / 7 in scientific notation with three significant digits?", answer("format(100 / 7, digits = 3, scientific = TRUE)", correct = TRUE), answer("format(100 / 7, nsmall = 3, scientific = TRUE)"), answer("format(100 / 7, digits = 3)"), answer("format(100 / 7, nsmall = 3)"))
There are many functions in base R for basic string manipulation.
Function | Description
---------|------------
nchar()
| Returns number of characters
tolower()
| Converts to lower case
toupper()
| Converts to upper case
casefold()
| Wrapper for tolower()
and toupper()
chartr()
| Translates characters
abbreviate()
| Abbreviates characters
substr()
| Extracts substrings of a character vector
strsplit()
| Splits strings into substrings
The best way to understand how these functions work is to try them on simple examples and see how the input character vector changes.
nchar()
FunctionThe nchar()
function inputs a character vector and outputs the number of (human-readable) characters contained in each entry of the vector.
y <- c("Pawnee rules", "Eagleton drools") nchar(y)
The process of coverting all characters into the same case (upper-case or lower-case) is called case folding. The tolower()
and toupper()
functions convert upper-case letters in a character vector to lower-case, or vice-versa.
tolower(y) toupper(y)
The casefold()
function is a wrapper for tolower()
and toupper()
that uses a logical argument upper
to determine which function to use.
casefold(y, upper = TRUE)
Question: What is the default value of the upper
argument in casefold()
? Equivalently, does casefold()
use tolower()
or toupper()
by default?
chartr()
FunctionThe chartr()
function performs character translation. The old
argument specifies the characters to be translated, and the new
argument specifies the translations.
For example, the command below translates p
into P
and e
into E
.
chartr(old = "pe", new = "PE", y)
We can also translate to and from non-alphabetic characters. For example, the command below translates a
into #
, e
into ?
, and o
into !
.
chartr("aeo", "#?!", y)
question("How can I change Mike to Jake using chartr()?", answer("chartr(old = 'Mike', new = 'Jake', 'Mike')", correct = TRUE), answer("chartr(old = 'Mi', new = 'Ja', 'Jake')"), answer("chartr(old = 'Mi', new = 'Ja', 'Mike')", correct = TRUE), answer("chartr(old = 'Mike', new = 'ja', 'Mike')"), answer("chartr(old = 'Jake', new = 'Mike', 'Jake')"), allow_retry = TRUE, random_answer_order = TRUE)
substr()
FunctionThe substr()
function inputs a character vector and extracts a substring (i.e., a subset of the original character values) starting from the start
position and ending with the stop
position.
## Extract the 2nd to 9th characters of `x` substr(x, start = 2, stop = 9) ## Extract the 3rd to 5th characters of each value in `y` substr(y, start = 3, stop = 5)
strsplit()
FunctionThe strsplit()
function splits elements of a character vector x
into substrings based on the pattern specified in the split
argument. The output of strsplit()
is always a list object with the same length as the input vector. In particular, the i
th component of the output list contains the vector of splits of x[i]
.
For example, the command below splits the character Pawnee rules
by the letter l
.
strsplit(x, split = "l")
To separate a sentence into separate words, we can split by the single space character " "
.
z <- c("Pawnee rules and Eagleton drools.", "I love friends, waffles, and work.") word_z <- strsplit(z, split = " ") word_z
Note: The i
th component in word_z
contains the words in the i
th sentence of z
. To combine all values in separate components of a list into a single vector, we can use the unlist()
function to remove the list structure.
z <- c("Pawnee rules and Eagleton drools.", "I love friends, waffles, and work.") word_z <- strsplit(z, split = " ")
unlist(word_z)
%in%
OperatorOne main application of string manipulation is pattern matching. Finding patterns in text are useful for data validation, data scraping, text parsing, filtering search results, etc.
A first tool for pattern/value matching is the %in%
operator. The %in%
operator is a vectorized binary operator that checks each value in the vector on the left and returns TRUE
if the entry matches one of the values on the right and FALSE
otherwise. The output of the %in%
operator is always a logical vector.
1:10 %in% c(5, 7, 9) ucla <- c("u", "c", "l", "a") letters %in% ucla letters[letters %in% ucla]
Question: How would you write a function is_vowel()
to find the vowels ("a"
, "e"
, "i"
, "o"
, "u"
, "y"
) in a character vector containing single letters?
is_vowel <- function(letter) { }
Use the %in% operator as described above
is_vowel <- function(letter) { letter %in% c("a", "e", "i", "o", "u", "y") }
is_vowel <- function(letter) { letter %in% c("a", "e", "i", "o", "u", "y") }
grep()
and grepl()
FunctionsThe grep()
and grepl()
functions searches for matches to a pattern in an input character vector. The syntax and use are largely the same, but the return output is different. The basic syntax is:
grep(pattern, x) grepl(pattern, x)
pattern
argument is the character string to be matched in the character vector x
. The pattern
can be the literal character(s) to match or a regular expression.x
argument is the input character vector where matches are to be found.ignore.case
argument that specifies whether the pattern to match is case sensitive or not.The command grep(pattern, x)
returns a numeric vector of the indices of the entries of x
that contain a match to pattern
. The command grepl(pattern, x)
returns a logical vector of whether each entry of x
contains a match (TRUE
) or not (FALSE
).
test <- c("April", "and", "Andy", "love", "Champion", "and", "Lil'", "Sebastian") grep(pattern = "a", test) grepl(pattern = "a", test)
gsub()
FunctionThe gsub()
function finds and replaces patterns in an input character vector. The basic syntax is:
gsub(pattern, replacement, x)
The pattern
, x
, and optional arguments of gsub()
are identical to those found in grep()
and grepl()
. While grep()
and grepl()
identify pattern matches, gsub()
replaces the pattern
match by the replacement
argument.
test <- c("April", "and", "Andy", "love", "Champion", "and", "Lil'", "Sebastian")
gsub(pattern = "A", replacement = "a", test) gsub(pattern = "a", replacement = "x", test)
\newpage
For more complicated patterns, we need more tools to efficiently specify the pattern to match.
A regular expression (or regex) is a set of symbols that describes a text pattern. More formally, a regular expression is a pattern that describes a set of strings.
Regular expressions are a formal language in their own right in the sense that the symbols have a defined set of rules to specify the desired patterns. Most programming languages, including R, can use and implement regular expressions. The best way to learn the syntax and become fluent with regular expressions is to practice.
Some common applications of regular expressions:
We will not cover a full treatment of regular expressions in R (it is typically covered in detail in Stats 102A). For more information, refer to the ?regex
help documentation or one of the references at the beginning of this chapter.
We will introduce a few basic regular expressions in the next section to allow us to compute a readability measure of the difficulty level of a passage in English is to understand.
The Flesch reading ease score is a numeric measure of English readability, i.e., the ease with which a reader can understand text.
The formula to compute the Flesch reading ease ($\mathrm{RE}$) score is $$ \begin{aligned} \mathrm{RE} &= 206.835 - \left(1.015 \times \frac{\hbox{total words}}{\hbox{total sentences}}\right) - \left(84.6 \times \frac{\hbox{total syllables}}{\hbox{total words}}\right) \ &= 206.835 - (1.015 \times \mathrm{ASL}) - (84.6 \times \mathrm{ASW}) \end{aligned} $$ where $\mathrm{ASL}$ is the average sentence length and $\mathrm{ASW}$ is the average number of syllables per word.
The reading ease score $\mathrm{RE}$ is usually a number between 0 and 100, though there are some exceptions for non-standard words/sentences. Higher values of $\mathrm{RE}$ indicate text that is easier to read, and lower values indicate text that is more difficult to read.
More information on the Flesch reading ease score:
From the Readability Formulas site:
"Though simple it might seem, the Flesch Reading Ease Formula has certain ambiguities. For instance, periods, explanation [sic] points, colons and semicolons serve as sentence delimiters; each group of continuous non-blank characters with beginning and ending punctuation removed counts as a word; each vowel in a word is considered one syllable subject to: (a) -es, -ed and -e (except -le) endings are ignored; (b) words of three letters or shorter count as single syllables; and (c) consecutive vowels count as one syllable."
We want to write a function in R that inputs an English passage and outputs the reading ease score of the passage.
As an example, we will compute the Flesch reading ease score of the following passage:
"We need to remember what's important in life: friends, waffles, work. Or waffles, friends, work. Doesn't matter, but work is third." -- Leslie Knope (Parks and Recreation)
For convenience, the command to create a waffles
object containing this passage is in the waffles.R
file on CCLE.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]]
source("waffles.R") waffles
The primary components of the Flesch reading score formula are: sentences, words, and syllables. The main steps to compute the reading score are:
The waffles
object is a single character value (i.e., a character vector of length 1) that contains the entire passage. We want to separate the text passage into individual sentences.
To find the individual sentences, we need to split the text string based on "end of sentence" punctuation. The sentence delimiters, i.e., what symbols represent the end of a sentence, we want to consider are periods (.
), exclamation points (!
), question marks (?
), colons (:
), and semicolons (;
).
A regular expression that represents the pattern of "any sentence delimiter" would be [.!?:;]
. In the context of regular expressions, the square brackets define a character set, which means any single character that is contained within the brackets will match the pattern. The regular expression for "any vowel" would be [aeiouy]
.
We will use this regular expression to split the sentences in the waffles
object into separate characters.
strsplit(waffles, split = "[.!?:;]")
Remember that the output of strsplit()
is always a list. Since the waffles
object was a single character value, then the output of strsplit(waffles, split = "[.!?:;]")
has a single component. To continue processing the text, we will extract the character vector inside.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]] waffles_sentences
Within each sentence, the capitalization and punctuation are not important to the Flesch reading ease formula (since it only counts syllables, words, and sentences, not capital letters, commas, or apostrophes). We can thus prepare our sentences for splitting into words by converting all letters to lower case and removing all remaining punctuation.
We can apply the tolower()
function to perform the case folding.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]]
waffles_sentences <- tolower(waffles_sentences) waffles_sentences
To remove the remaining punctuation, we can use the regular expression [[:punct:]]
that represents the pattern of any single punctuation symbol. We will replace the punctuation by the empty character ""
.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]] waffles_sentences <- tolower(waffles_sentences)
waffles_sentences <- gsub(pattern = "[[:punct:]]", replacement = "", waffles_sentences) waffles_sentences
Question: Why did we not use [[:punct:]]
to remove the sentence delimiters? Why is it safe to match all punctuation at this step, including sentence delimiters?
We now have a vector of sentences that have been processed to remove cases and punctuation. We are ready to move on to counting words!
At this stage in the notes, each entry in the waffles_sentences
vector is a sentence that we want to split into words. Since we have removed all punctuation already, the only character that separates words is the single whitespace character " "
. We thus can use strsplit()
again, splitting based on " "
.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]] waffles_sentences <- tolower(waffles_sentences) waffles_sentences <- gsub(pattern = "[[:punct:]]", replacement = "", waffles_sentences)
waffles_words <- strsplit(waffles_sentences, split = " ") waffles_words
Caution: Notice the output! By splitting based on the whitespace " "
, we have leading empty characters in all components of waffles_words
except for the first component. This is due to the space after the end of a sentence. We do not want to count the empty character ""
as a word in our reading ease formula, so we need to remove them.
Within each component of the waffles_words
list, we want to only keep the character values that have a nonzero number of characters. This can be done in one line, but we will create a helper function for clarity.
keep_words <- function(words) { words[nchar(words) > 0] }
The keep_words()
function inputs a vector of words and returns only the words that have a positive number of characters. We can now use lapply()
to apply the keep_words()
function to each component of the waffles_words
list.
waffles_sentences <- strsplit(waffles, split = "[.!?:;]")[[1]] waffles_sentences <- tolower(waffles_sentences) waffles_sentences <- gsub(pattern = "[[:punct:]]", replacement = "", waffles_sentences) waffles_words <- strsplit(waffles_sentences, split = " ")
waffles_words <- lapply(waffles_words, keep_words) waffles_words
Note: At this point in the notes, you are able to compute the $\mathrm{ASL}$ (average sentence length) value in the reading ease formula.
Question: How would you compute the $\mathrm{ASL}$ for the waffles
text using the waffles_words
object?
The last piece of the reading ease formula we need to compute is the number of syllables in a word.
Formally, each vowel sound, with or without consonants, defines a syllable within a word. For the purposes of the reading ease formula, we will count the syllables by the number of vowels in a word, subject to three rules:
(a) Words of three letters or shorter count as single syllables (b) -es, -ed and -e (except -le) endings are ignored (c) Consecutive vowels count as one syllable
Side Note: These rules will inevitably yield different numbers of syllables than you may expect due to a myriad of exceptions in English spelling and pronunciation. We will follow these rules to be consistent with the way the Flesch reading ease score is calculated, even if it may not be 100% accurate in all cases.
Before tackling all of the words in the entire text, we first want to know how to count the syllables in a single word.
As a first step, we need to separate each word into its component letters. To split a word into letters, we can again use strsplit()
, now splitting by the empty character ""
.
tom_letters <- unlist(strsplit("tom", split = "")) tom_letters horses_letters <- unlist(strsplit("horses", split = "")) horses_letters eagleton_letters <- unlist(strsplit("eagleton", split = "")) eagleton_letters
Note: Remember that the output of strsplit()
is always a list.
The first rule: Words of three letters or shorter count as single syllables.
Regardless of how many vowels are in a short word (at most 3 letters long), the Flesch reading ease formula counts the entire word as one syllable. Thus, before counting vowels, we simply need to check whether there are at most 3 letters in the word.
The tom_letters
vector contains the individual letters of the word "tom"
. The number of letters in the word is the length of the tom_letters
vector.
tom_letters <- unlist(strsplit("tom", split = ""))
length(tom_letters)
Since length(tom_letters)
is 3, then tom
is one syllable.
For longer words, such as "horses"
and "eagleton"
, we will need to consider the other rules and count the vowels.
The second rule: -es, -ed and -e (except -le) endings are ignored.
Before we count the syllables in words longer than 3 letters, we need to first ignore the special word endings: -es, -ed, and -e, unless the ending is -le.
The horses_letters
vector contains the individual letters of the word "horses"
. We can use the tail()
function to extract the last two letters of the word.
horses_letters <- unlist(strsplit("horses", split = ""))
horses_tail <- tail(horses_letters, n = 2) horses_tail
We can write a helper function is_special_ending()
that inputs a vector of two letters (that represent the last two letters of a word) and returns TRUE
if the word ends in a special ending (-es, -ed, or -e except -le) and FALSE
otherwise.
is_special_ending <- function(ending) { is_es <- all(ending == c("e", "s")) is_ed <- all(ending == c("e", "d")) is_e_not_le <- ending[2] == "e" & ending[1] != "l" is_es | is_ed | is_e_not_le }
horses_letters <- unlist(strsplit("horses", split = "")) horses_tail <- tail(horses_letters, n = 2)
is_special_ending(horses_tail)
Since the word ends in -es, we will remove the word ending and count the syllables in the remaining "word" "hors"
.
rm_special_endings <- function(word_letters) { word_tail <- tail(word_letters, n = 2) if (is_special_ending(word_tail)) { if (word_tail[2] == "e") { word_letters[-length(word_letters)] } else { head(word_letters, n = -2) } } else { word_letters } }
horses_letters <- unlist(strsplit("horses", split = "")) eagleton_letters <- unlist(strsplit("eagleton", split = ""))
rm_special_endings(horses_letters) rm_special_endings(eagleton_letters)
Note: Notice that rm_special_endings()
will remove two letters from words ending in -es or -ed but only one letter from words ending in -e other than -le. If there is no special ending, the function will simply leave the input letters unchanged.
\newpage
The third rule: Consecutive vowels count as one syllable.
Once the first two rules are accounted for, we next need to be able to identify which letters are vowels. As an example, we will use the character "eagleton"
.
We can write a helper function is_vowel()
that inputs a vector of letters and returns TRUE
for each vowel and FALSE
otherwise.
is_vowel <- function(letter) { letter %in% c("a", "e", "i", "o", "u", "y") }
eagleton_letters <- unlist(strsplit("eagleton", split = ""))
eagleton_vowels <- is_vowel(eagleton_letters) eagleton_vowels
If every vowel counted as one syllable, then sum(is_vowel(eagleton_letters))
would be the number of syllables in the word. However, we need to count consecutive vowels as a single syllable. For example, the word "eagleton"
has two consecutive vowels e
and a
that count as one syllable instead of two. How do we identify when two vowels are next to each other?
There are many ways to account for consecutive vowels. One way is to consider the numeric indices of the vowels: Notice that letters that are consecutive vowels will appear as consecutive TRUE
values in the output of is_vowel()
. This translates into consecutive indices of the TRUE
values.
eagleton_letters <- unlist(strsplit("eagleton", split = "")) eagleton_vowels <- is_vowel(eagleton_letters)
which(eagleton_vowels)
Based on the numeric indices alone, can you tell which vowels are consecutive?
Consecutive vowels will have consecutive indices! One trick to find consecutive indices is to find the consecutive differences with the diff()
function.
diff(which(eagleton_vowels))
Each consecutive difference of 1 indicates consecutive vowels. So the total number of syllables in the word is $$\hbox{(number of syllables)} = \hbox{(number of vowels)} - \hbox{(number of consecutive differences of 1 in the vowel indices)}.$$
We now have all the components to compute the syllables of a word.
count_syllables <- function(word) { word_letters <- unlist(strsplit(word, split = "")) if (length(word_letters) <= 3) { 1 } else { word_letters <- rm_special_endings(word_letters) word_vowels <- is_vowel(word_letters) sum(word_vowels) - sum(diff(which(word_vowels)) == 1) } }
\newpage
count_syllables("tom") count_syllables("horses") count_syllables("eagleton") count_syllables("pneumonoultramicroscopicsilicovolcanoconiosis")
Note: At this point in the notes, you are able to compute the $\mathrm{ASW}$ (average number of syllables per word) value in the reading ease formula.
Question: How would you compute the $\mathrm{ASW}$ for the waffles
text using the waffles_words
object?
Success! We have found a way to compute each component of the Flesch reading score formula.
Your task is to use the workflow and helper functions explained and provided in this section to write a reading_ease()
function that will compute the Flesch reading ease score for an input text passage.
reading_ease <- function(text) { 96.76339 }
waffles
reading_ease(waffles)
For the waffles
text, the Flesch reading score is r reading_ease(waffles)
. Use this value to verify that you have implemented your function correctly.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.