Description Usage Arguments Details Value Examples
These function can be used to 'clean' one or more sources. Cleaning consists of two operations: splitting the source at utterance markers, and conducting search and replaces using regular expressions.
1 2 3 4 5 6 7 8 9 10 | clean_source(input, outputFile = NULL,
replacementsPre = list(c("([^\\.])(\\.\\.)([^\\.])",
"\\1.\\3"), c("([^\\.])(\\.\\.\\.\\.+)([^\\.])",
"\\1...\\3"), c("(\\s*\\r?\\n){3,}", "\n")),
extraReplacementsPre = NULL,
utteranceSplits = c("([\\?\\!]+\\s?|…\\s?|[[:alnum:]\\s?]\\.(?!\\.\\.)\\s?)"),
utteranceMarker = "\n",
replacementsPost = list(c("([^\\,]),([^\\s])", "\\1, \\2")),
extraReplacementsPost = NULL, removeNewlines = FALSE,
encoding = "UTF-8")
|
input |
For |
outputFile |
If not |
replacementsPre, replacementsPost |
Each is a list of two-element vectors,
where the first element in each vector contains a regular expression to search for
in the source(s), and the second element contains the replacement (these are passed
as |
extraReplacementsPre, extraReplacementsPost |
To perform more replacements
than the default set, these can be conveniently specified in |
utteranceSplits |
This is a vector of regular expressions that specify where to
insert breaks between utterances in the source(s). Such breakes are specified using
|
utteranceMarker |
How to specify breaks between utterances in the source(s). The
ROCK convention is to use a newline ( |
removeNewlines |
Whether to remove all newline characters from the source before starting to clean them. |
encoding |
The encoding of the source(s). |
When called with its default arguments, the following will happen:
Double periods (..
) will be replaced with single periods (.
)
Four or more periods (...
or .....
) will be replaced with three periods
Three or more newline characters will be replaced by one newline character (which will become more, if the sentence before that character marks the end of an utterance)
All sentences will become separate utterances (in a semi-smart manner; specifically, breaks in speaking, if represented by three periods, are not considered sentence ends, wheread ellipses ("…" or unicode 2026, see the example) are.
If there are comma's without a space following them, a space will be inserted.
A character vector for clean_source
, or a list of character vectors , for clean_sources
.
1 2 3 4 5 6 7 8 9 10 11 12 13 | exampleSource <-
"Do you like icecream?
Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."
### Default settings:
cat(clean_source(exampleSource));
### First remove existing newlines:
cat(clean_source(exampleSource,
removeNewlines=TRUE));
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.