Description Usage Format Fields Usage Methods Arguments
The Pipe
class offers a framework for corpus preparation
and auxiliary tools. The methods of the class (wrappers for standard tools
or helpers) use subdirectories of a pipe directory to take files and their
content through the different stages of corpus preparation. To use Stanford
CoreNLP, the class is extended by the PipeCoreNLP
class.
1 |
An object of class R6ClassGenerator
of length 24.
dir
a pipe directory, different processing stages of the corpus will be kept in subdirectories of this directory
time
a data.frame with information that different processing stages have consumed
threads
an integer, the number of cores to use
For usage details see Methods, Arguments and Fiels sections.
$new(dir, threads = 1L)
Initialize new Pipe
object.
$summary()
Return data.frame
with number of files in
the subdirectories of the pipe directory.
$preparePipeDir(subdirs = character(), delete = FALSE, verbose =
TRUE)
Create subdirectories provided by subdirs
in the pipe
directory, and delete existing files, if delete
is TRUE
.
$getFiles(sourceDir, targetDir, ...)
Copy Files from
directories indicated by to the subdirectory of the pipe directory defined
by targetDir
. See documentation for helper function getFiles
for options available through ...
.
$getMissingFiles(sourceDir, targetDir, ignore.extensions =
TRUE)
Identify files that are present in sourceDir
, but not in
targetDir
. If ignore.extensions
is TRUE
, file
extensions are removed before comparing filenames.
$rsync()
Prepare rsync command that can be used to synchronize pipe directory with a remote storage.
mergeXMLFiles(sourceDir, targetDir, regex, rootElement,
rootAttrs, mc=FALSE, verbose=TRUE, ...)
Merge files into single XML documents for faster processing during later stages of the pipe.
$validate(sourceDir, targetDir = NULL, dtd = NULL,
...)
Validate that all XML files in sourceDir
are valid XML files.
if dtd
is provided, check against a DTD.
$getAttributeValues(sourceDir, pattern, element, attrs, unique
= TRUE, mc = FALSE, progress = TRUE)
Get values of XML attributes defined
by attrs
for the element defined by element
.
$consolidate(sourceDir, targetDir, consolidation, element,
attribute, ...)
Perform replacements for XML attributes as provided by
character vector consolidation
. (Further documentation is needed!)
$xmlToDT(sourceDir = "xml", targetDir = "tsv",
metadata)
Extract text and metadata from XML documents, and write resulting 'basetable' as tsv file to subdirectory specified by targetDir. The basetable is returned invisibly.
addTreetaggerLemmatization(sourceDir = "tsv", targetDir =
"tsv", lang = "de", verbose = TRUE)
The method will look for a file 'tokenstream.tsv' in the subdirectory of the pipeDir specified by sourceDir. To use the treetagger, a temporary file is created (tokenstream only) and annotated. The result is read in again, added to the original table and saved to an updated file tokenstream.tsv in the targetDir. If sourceDir and targetDir are identical, the original file is overwritten.
$makeMetadataTable(sourceDir = "tsv", targetDir = "tsv",
verbose = TRUE)
Dissect file basetable.tsv in sourceDir into 'texttable' and 'metadata' as more memory efficient ways for keeping the data. If targetDir is not NULL, the resulting tables will be stored as tsv files in the respective subdirectory of the pipe directory.
$makePlaintextTable(sourceDir = "tsv", targetDir = "tsv",
verbose = TRUE)
Dissect basetable into 'texttable' and 'metadata' as more memory efficient ways for keeping the data. If targetDir is not NULL, the resulting tables will be stored as tsv files in the respective subdirectory of the pipe directory.
$xslt(sourceDir, targetDir, xslFile, ...)
Perform XSL transformation.
$subset(sourceDir, targetDir, sample = NULL, files =
NULL)
Generate a subset of files in the sourceDir
, copying a
choice of files to targetDir
. If files
is a character vector
with filenames, it will be these files that are copied. If sample
is
provided (a number), a random sample is drawn.
$recode(sourceDir, targetDir, from = "UTF-8", to =
"ISO-8859-1", xml = FALSE, log = FALSE, ...)
Recode files in
sourceDir
, writing results to targetDir
. See documentation
for worker function recode
for options that are available.
$replaceInvalidCharacters(sourceDir, targetDir, xml = FALSE,
...)
Replace characters that are known to cause problems.
$findAndReplace(sourceDir, targetDir, replacements, encoding,
...)
Find matches for a regular expression and perform replacemet;
replacements
is a list of length 2 character vectors, which provide
the regex and the replacement.
$tokenize(sourceDir, targetDir,
with = "stanfordNLP", lang = "de", ...)
Tokenize files in
sourceDir
, and save results to targetDir
. The result will be
a verticalized format that can be used for the TreeTagger.
$tokenizeSentences(sourceDir = "xml",targetDir="xmlAnno",
targetElement = "p", para = FALSE, ...)
Use the NLTK sentence tokenizer.
$treetagger(sourceDir = "tok", targetDir = "vrt", lang = "de",
...)
Annotate all files in sourceDir
using treetagger, and save
results to targetDir
.
$fix(sourceDir, targetDir,
encoding = "UTF-8", replacements = list(), ...)
Check files in
sourceDir
for potential hickups they may cause, and save output with
error corrections to targetDir
.
$sAttributeList(sourceDir, sample = 100, ...)
Analyse structure of XML and return list describing this structure.
$getNestedElements(sourceDir, corpus, element, max.embedding =
NULL)
Helper methode to detect errors in XML documents where cwb-encode will throw an error because elements are nested.
the pipe directory
an integer, the number of threads to use
a subdirectory of the pipeDir where files to be processed reside
a subdirectory of the pipeDir where processed output is stored
logical, whether to remove file extensions before
checking whether files in sourceDir
are present in targetDir
name of the CWB corpus to create
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.