consolidate: Consolidate vrt files for CWB import.

Description Usage Arguments Details

Description

Files resulting from tagging/annotation may violate the requirements of the Corpus Workbench (CWB). Consolidate the known issues the vrt files may cause.

Usage

1
consolidate(x, sourceDir, targetDir, encoding, replacements, ...)

Arguments

x

a character vector providing a directory with vrt files

sourceDir

character vector with directory with files

targetDir

a character vector

encoding

encoding of the file, used by scan

replacements

a list of character vectors (length 2 each) with regular expressions / replacements

...

further parameters passed into dirApply

Details

Known issues resulting from annotating files (with the treetagger in particular) are whitespace characters invalid for XML, XML elements at the end of a line rather than in a seperate line, characters invalid for XML (such as ampersands), inter alia.

Before doing respective corrections, the method tests whether there is any text at all in the files. Empty files (files that contain nothing but XML tags) are dropped.


PolMine/ctk documentation built on May 8, 2019, 3:20 a.m.