knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)




Character String Basics

v <- names(precip)[3:8]
v

replace(v, list = 1:3, c('replacement 1', 'replacement 2'))

\newline

metacharacters

\newline

\newline

Combining And Splitting Strings

paste('some', 'text')

paste(c('some', 'more'), 'text')

paste(c('some', 'text'), collapse = '_')


substr('some text', start = 3, stop = 8)

\newline

strsplit('even_more_text', split = '_')
strsplit(v, split = ' ')
text <- "In the Lenin Barracks in Barcelona, the day before I joined the militia, I saw an Italian militiaman standing in front of the officers' table.

He was a tough-looking youth of twenty-five or six, with reddish-yellow hair and powerful shoulders. His peaked leather cap was pulled fiercely over one eye. He was standing in profile to me, his chin on his breast, gazing with a puzzled frown at a map which one of the officers had open on the table. Something in his face deeply moved me. It was the face of a man who would commit murder and throw away his life for a friend--the kind efface you would expect in an Anarchist, though as likely as not he was a Communist. There were both candour and ferocity in it; also the pathetic reverence that illiterate people have for their supposed superiors. Obviously he could not make head or tail of the map; obviously he regarded map-reading as a stupendous intellectual feat. I hardly know why, but I have seldom seen anyone--any man, I mean--to whom I have taken such an immediate liking. While they were talking round the table some remark brought it out that I was a foreigner."

# split text on period followed by any character
sentences <- strsplit(text, split = '\\..')[[1]]
sentences

sapply(strsplit(sentences, ' '), length)




Matching Strings

vv <- c('word 1', 'word 2', 'word3', 'Word 4', 'Words', 'word')

startsWith(vv, 'word')

endsWith(vv, 's')
# grep returns indices of matching vector elements
grep('word', vv)

# it can also return the matching items
grep('word', vv, value = TRUE)

# or the non-matching ones
grep('word', vv, value = TRUE, invert = TRUE)

# character case can be ignored
grep('word', vv, value = TRUE, ignore.case = TRUE)

# grepl returns a logical vector based on the presence of absence of a match
grepl('word', vv)

# both functinos can ignore case
grepl('word', vv, ignore.case = TRUE)




Regular Expressions

The first argument fo grep and grepl specifies the pattern that is matched to elements of a caracter vector. Normally every character is matched exactly but it is possible to loosen the matching process. Regular expressions (often abbreviated as regex) are a way of ambiguating a the search pattern so that non-exact matches are identified. Matches can occur on alternative characters, repeated characters and more.

The ambiguation is achieved by using wildcards, which are put into the pattern as metacharacters. The most often used wildcards are:

There are other ways of specifying character classes, e.g.:

Whenever a wildcard character is to be taken literally, it must be escaped. Importantly, in R the backslash alone is a literal character so it must be escaped to turn int into the escape character. And so \. means "backslash-period" but \\. means "any character".

Some examples:

\newline

\newline

\newline

\newline

More examples:

vv

grep('^w', vv, value = TRUE)

grep('^[wW]', vv, value = TRUE)

grep('^Ws$', vv, value = TRUE)

grep('^W.s$', vv, value = TRUE)

grep('^W.+s$', vv, value = TRUE)

grep('^W.+s?$', vv, value = TRUE)

grep(' \\d$', vv, value = T)

grep(' *\\d$', vv, value = T)


replacing patterns

The sub function searches a string for matches to a regular expression just like grep does. However, rather than reporting matches, it replaces them with a replacement string. The replacement is a literal string, not a regex.

days <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

grep('day', days, value = TRUE)

sub('day', 'DAY', days)

sub only acts on the first occurrence of the pattern. To replace all accurrences use gsub.

sub('e', '.E.', 'Wednesday')

gsub('e', '.E.', 'Wednesday')


backtracing

Characters in a regular expression can be grouped using parentheses: (Wednes)(day). The ability to recall these groups individually is called backtracing. Backtracing allows for preserving a partial match during substitution, whereas normally the entire match is replaced.

sub('day', 'DAY', 'Wednesday')

sub('.*day', 'DAY', 'Wednesday')

sub('(.*)(day)', 'DAY', 'Wednesday')

sub('(.*)(day)', '\\1\\2', 'Wednesday')

sub('(.*)(day)', '\\1', 'Wednesday')

sub('(.*)(day)', '\\1DAY', 'Wednesday')

Regular expressions are a powerful tool for searching patterns within text.
They also allow for full and partial replacing of the mathcing patterns.




olobiolo/Rdlazer documentation built on Aug. 6, 2022, 11:37 a.m.