Pipe: Pipe for corpus preparation.

Description Usage Format Fields Usage Methods Arguments

Description

The Pipe class offers a framework for corpus preparation and auxiliary tools. The methods of the class (wrappers for standard tools or helpers) use subdirectories of a pipe directory to take files and their content through the different stages of corpus preparation. To use Stanford CoreNLP, the class is extended by the PipeCoreNLP class.

Usage

1

Format

An object of class R6ClassGenerator of length 24.

Fields

dir

a pipe directory, different processing stages of the corpus will be kept in subdirectories of this directory

time

a data.frame with information that different processing stages have consumed

threads

an integer, the number of cores to use

Usage

For usage details see Methods, Arguments and Fiels sections.

Methods

$new(dir, threads = 1L)

Initialize new Pipe object.

$summary()

Return data.frame with number of files in the subdirectories of the pipe directory.

$preparePipeDir(subdirs = character(), delete = FALSE, verbose = TRUE)

Create subdirectories provided by subdirs in the pipe directory, and delete existing files, if delete is TRUE.

$getFiles(sourceDir, targetDir, ...)

Copy Files from directories indicated by to the subdirectory of the pipe directory defined by targetDir. See documentation for helper function getFiles for options available through ....

$getMissingFiles(sourceDir, targetDir, ignore.extensions = TRUE)

Identify files that are present in sourceDir, but not in targetDir. If ignore.extensions is TRUE, file extensions are removed before comparing filenames.

$rsync()

Prepare rsync command that can be used to synchronize pipe directory with a remote storage.

mergeXMLFiles(sourceDir, targetDir, regex, rootElement, rootAttrs, mc=FALSE, verbose=TRUE, ...)

Merge files into single XML documents for faster processing during later stages of the pipe.

$validate(sourceDir, targetDir = NULL, dtd = NULL, ...)

Validate that all XML files in sourceDir are valid XML files. if dtd is provided, check against a DTD.

$getAttributeValues(sourceDir, pattern, element, attrs, unique = TRUE, mc = FALSE, progress = TRUE)

Get values of XML attributes defined by attrs for the element defined by element.

$consolidate(sourceDir, targetDir, consolidation, element, attribute, ...)

Perform replacements for XML attributes as provided by character vector consolidation. (Further documentation is needed!)

$xmlToDT(sourceDir = "xml", targetDir = "tsv", metadata)

Extract text and metadata from XML documents, and write resulting 'basetable' as tsv file to subdirectory specified by targetDir. The basetable is returned invisibly.

addTreetaggerLemmatization(sourceDir = "tsv", targetDir = "tsv", lang = "de", verbose = TRUE)

The method will look for a file 'tokenstream.tsv' in the subdirectory of the pipeDir specified by sourceDir. To use the treetagger, a temporary file is created (tokenstream only) and annotated. The result is read in again, added to the original table and saved to an updated file tokenstream.tsv in the targetDir. If sourceDir and targetDir are identical, the original file is overwritten.

$makeMetadataTable(sourceDir = "tsv", targetDir = "tsv", verbose = TRUE)

Dissect file basetable.tsv in sourceDir into 'texttable' and 'metadata' as more memory efficient ways for keeping the data. If targetDir is not NULL, the resulting tables will be stored as tsv files in the respective subdirectory of the pipe directory.

$makePlaintextTable(sourceDir = "tsv", targetDir = "tsv", verbose = TRUE)

Dissect basetable into 'texttable' and 'metadata' as more memory efficient ways for keeping the data. If targetDir is not NULL, the resulting tables will be stored as tsv files in the respective subdirectory of the pipe directory.

$xslt(sourceDir, targetDir, xslFile, ...)

Perform XSL transformation.

$subset(sourceDir, targetDir, sample = NULL, files = NULL)

Generate a subset of files in the sourceDir, copying a choice of files to targetDir. If files is a character vector with filenames, it will be these files that are copied. If sample is provided (a number), a random sample is drawn.

$recode(sourceDir, targetDir, from = "UTF-8", to = "ISO-8859-1", xml = FALSE, log = FALSE, ...)

Recode files in sourceDir, writing results to targetDir. See documentation for worker function recode for options that are available.

$replaceInvalidCharacters(sourceDir, targetDir, xml = FALSE, ...)

Replace characters that are known to cause problems.

$findAndReplace(sourceDir, targetDir, replacements, encoding, ...)

Find matches for a regular expression and perform replacemet; replacements is a list of length 2 character vectors, which provide the regex and the replacement.

$tokenize(sourceDir, targetDir, with = "stanfordNLP", lang = "de", ...)

Tokenize files in sourceDir, and save results to targetDir. The result will be a verticalized format that can be used for the TreeTagger.

$tokenizeSentences(sourceDir = "xml",targetDir="xmlAnno", targetElement = "p", para = FALSE, ...)

Use the NLTK sentence tokenizer.

$treetagger(sourceDir = "tok", targetDir = "vrt", lang = "de", ...)

Annotate all files in sourceDir using treetagger, and save results to targetDir.

$fix(sourceDir, targetDir, encoding = "UTF-8", replacements = list(), ...)

Check files in sourceDir for potential hickups they may cause, and save output with error corrections to targetDir.

$sAttributeList(sourceDir, sample = 100, ...)

Analyse structure of XML and return list describing this structure.

$getNestedElements(sourceDir, corpus, element, max.embedding = NULL)

Helper methode to detect errors in XML documents where cwb-encode will throw an error because elements are nested.

Arguments

dir

the pipe directory

threads

an integer, the number of threads to use

sourceDir

a subdirectory of the pipeDir where files to be processed reside

targetDir

a subdirectory of the pipeDir where processed output is stored

ignore.extension

logical, whether to remove file extensions before checking whether files in sourceDir are present in targetDir

corpus

name of the CWB corpus to create


PolMine/ctk documentation built on May 8, 2019, 3:20 a.m.