cleaning_sources: Cleaning sources

Description Usage Arguments Details Value Examples

Description

These function can be used to 'clean' one or more sources. Cleaning consists of two operations: splitting the source at utterance markers, and conducting search and replaces using regular expressions.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
clean_source(input, outputFile = NULL,
  replacementsPre = list(c("([^\\.])(\\.\\.)([^\\.])",
  "\\1.\\3"), c("([^\\.])(\\.\\.\\.\\.+)([^\\.])",
  "\\1...\\3"), c("(\\s*\\r?\\n){3,}", "\n")),
  extraReplacementsPre = NULL,
  utteranceSplits = c("([\\?\\!]+\\s?|…\\s?|[[:alnum:]\\s?]\\.(?!\\.\\.)\\s?)"),
  utteranceMarker = "\n",
  replacementsPost = list(c("([^\\,]),([^\\s])", "\\1, \\2")),
  extraReplacementsPost = NULL, removeNewlines = FALSE,
  encoding = "UTF-8")

Arguments

input

For clean_source, either a character vector containing the text of the relevant source or a path to a file that contains the source text; for clean_sources, a path to a directory that contains the sources to clean.

outputFile

If not NULL, this is the name (and path) of the file in which to save the cleaned source.

replacementsPre, replacementsPost

Each is a list of two-element vectors, where the first element in each vector contains a regular expression to search for in the source(s), and the second element contains the replacement (these are passed as perl regular expressions; see regex for more information). Instead of regular expressions, simple words or phrases can also be entered of course (since those are valid regular expressions). replacementsPre are executed before the utteranceSplits are applied; replacementsPost afterwards.

extraReplacementsPre, extraReplacementsPost

To perform more replacements than the default set, these can be conveniently specified in extraReplacementsPre and extraReplacementsPost. This prevents you from having to manually copypaste the list of defaults to retain it.

utteranceSplits

This is a vector of regular expressions that specify where to insert breaks between utterances in the source(s). Such breakes are specified using utteranceMarker.

utteranceMarker

How to specify breaks between utterances in the source(s). The ROCK convention is to use a newline (\n).

removeNewlines

Whether to remove all newline characters from the source before starting to clean them.

encoding

The encoding of the source(s).

Details

When called with its default arguments, the following will happen:

Value

A character vector for clean_source, or a list of character vectors , for clean_sources.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
exampleSource <-
"Do you like icecream?


Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."

### Default settings:
cat(clean_source(exampleSource));

### First remove existing newlines:
cat(clean_source(exampleSource,
                 removeNewlines=TRUE));

Matherion/rock documentation built on May 19, 2019, 6:20 p.m.