tokenize: Tokenize files.

Description Usage Arguments Details

Description

Tokenize (XML) files with one standard tool (treetagger, stanfordNLP, openNLP).

Usage

1
2
3
4
5
tokenize(.Object, ...)

## S4 method for signature 'character'
tokenize(.Object, lang = "de", with = "stanfordNLP",
  ...)

Arguments

.Object

a ctk object

...

further paramters

lang

language of the files to be tagged

with

either "stanfordNLP", "treetagger" or "openNLP"

Details

One potential problem with the perl-tokenizer that comes with the treetagger is that the output is not valid XML. It is necessary to fix the XML with a shell command such as for i in $(ls); do sed 's/\xC2\xA0/ /g' $i > ../tok2/$i; done. The XML may still not be valid ("&" etc.), so fix method is still necessary.


PolMine/ctk documentation built on May 8, 2019, 3:20 a.m.