treetag-methods: A method to call TreeTagger
In unDocUMeantIt/koRpus: Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

Description Usage Arguments Details Value Author(s) References See Also Examples

This method calls a local installation of TreeTagger[1] to tokenize and POS tag the given text.

treetag(
  file,
  treetagger = "kRp.env",
  rm.sgml = TRUE,
  lang = "kRp.env",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  encoding = NULL,
  TT.options = NULL,
  debug = FALSE,
  TT.tknz = TRUE,
  format = "file",
  stopwords = NULL,
  stemmer = NULL,
  doc_id = NA,
  add.desc = "kRp.env",
  ...
)

## S4 method for signature 'character'
treetag(
  file,
  treetagger = "kRp.env",
  rm.sgml = TRUE,
  lang = "kRp.env",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  encoding = NULL,
  TT.options = NULL,
  debug = FALSE,
  TT.tknz = TRUE,
  format = "file",
  stopwords = NULL,
  stemmer = NULL,
  doc_id = NA,
  add.desc = "kRp.env"
)

## S4 method for signature 'kRp.connection'
treetag(
  file,
  treetagger = "kRp.env",
  rm.sgml = TRUE,
  lang = "kRp.env",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  encoding = NULL,
  TT.options = NULL,
  debug = FALSE,
  TT.tknz = TRUE,
  format = NA,
  stopwords = NULL,
  stemmer = NULL,
  doc_id = NA,
  add.desc = "kRp.env"
)

`file`	Either a connection or a character vector, valid path to a file, containing the text to be analyzed. If `file` is a connection, its contents will be written to a temporary file, since TreeTagger can't read from R connection objects.
`treetagger`	A character vector giving the TreeTagger script to be called. If set to `"kRp.env"` this is got from `get.kRp.env`. Only if set to `"manual"`, it is assumend not to be a wrapper script that can work the given text file, but that you would like to manually tweak options for tokenizing and POS tagging yourself. In that case, you need to provide a full set of options with the `TT.options` parameter.
`rm.sgml`	Logical, whether SGML tags should be ignored and removed from output
`lang`	A character string naming the language of the analyzed corpus. See `kRp.POS.tags` and `available.koRpus.lang`for all supported languages. If set to `"kRp.env"` this is fetched from `get.kRp.env`.
`apply.sentc.end`	Logical, whethter the tokens defined in `sentc.end` should be searched and set to a sentence ending tag.
`sentc.end`	A character vector with tokens indicating a sentence ending. This adds to TreeTaggers results, it doesn't really replace them.
`encoding`	A character string defining the character encoding of the input file, like `"Latin1"` or `"UTF-8"`. If `NULL`, the encoding will either be taken from a preset (if defined in `TT.options`), or fall back to `""`. Hence you can overwrite the preset encoding with this parameter.
`TT.options`	A list of options to configure how TreeTagger is called. You have two basic choices: Either you choose one of the pre-defined presets or you give a full set of valid options: `path` Mandatory: The absolute path to the TreeTagger root directory. That is where its subfolders `bin`, `cmd` and `lib` are located. `preset` Optional: If you choose one of the pre-defined presets of one of the available language packages (like `"de"` for German, see `available.koRpus.lang` for details), you can omit all the following elements, because they will be filled with defaults. Of course this only makes sense if you have a working default installation. Note that since koRpus 0.07-1, UTF-8 is the global default encoding. `tokenizer` Mandatory: A character string, naming the tokenizer to be called. Interpreted relative to `path/cmd/`. `tknz.opts` Optional: A character string with the options to hand over to the tokenizer. You don't need to specify "-a" if `abbrev` is given. If `TT.tknz=FALSE`, you can pass configurational options to `tokenize` by provinding them as a named list (instead of a character string) here. `pre.tagger` Optional: A character string with code to be run before the tagger. This code is used as-is, so you need make sure it includes the needed pipe symbols. `tagger` Mandatory: A character string, naming the tagger-command to be called. Interpreted relative to `path/bin/`. `abbrev` Optional: A character string, naming the abbreviation list to be used. Interpreted relative to `path/lib/`. `params` Mandatory: A character string, naming the parameter file to be used. Interpreted relative to `path/lib/`. `lexicon` Optional: A character string, naming the lexicon file to be used. Interpreted relative to `path/lib/`. `lookup` Optional: A character string, naming the lexicon lookup command. Interpreted relative to `path/cmd/`. `filter` Optional: A character string, naming the output filter to be used. Interpreted relative to `path/cmd/`. `no.unknown` Optional: Logical, can be used to toggle the `"-no-unknown"` option of TreeTagger (defaults to `FALSE`). `splitter` Optional: A character string, naming the splitter to be called (before the tokenizer). Interpreted relative to `path/cmd/`. `splitter.opts` Optional: A character string with the options to hand over to the splitter. You can also set these options globally using `set.kRp.env`, and then force `treetag` to use them by setting `TT.options="kRp.env"` here. Note: If you use the `treetagger` setting from kRp.env and it's set to `TT.cmd="manual"`, `treetag` will treat `TT.options=NULL` like `TT.options="kRp.env"` automatically.
`debug`	Logical. Especially in cases where the presets wouldn't work as expected, this switch can be used to examine the values `treetag` is assuming.
`TT.tknz`	Logical, if `FALSE` TreeTagger's tokenzier script will be replaced by `koRpus`' function `tokenize`. To accomplish this, its results will be written to a temporal file which is automatically deleted afterwards (if `debug=FALSE`). Note that this option only has an effect if `treetagger="manual"`.
`format`	Either "file" or "obj", depending on whether you want to scan files or analyze the text in a given object, like a character vector. If the latter, it will be written to a temporary file (see `file`).
`stopwords`	A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set `stopwords=tm::stopwords("en")` to use the english stopwords provided by the `tm` package.
`stemmer`	A function or method to perform stemming. For instance, you can set `SnowballC::wordStem` if you have the `SnowballC` package installed. As of now, you cannot provide further arguments to this function.
`doc_id`	Character string, optional identifier of the particular document. Will be added to the `desc` slot, and as a factor to the `"doc_id"` column of the `tokens` slot. If `NA`, the document name will be used (for `format="obj"` a random name).
`add.desc`	Logical. If `TRUE`, the tag description (column `"desc"` of the data.frame) will be added directly to the resulting object. If set to `"kRp.env"` this is fetched from `get.kRp.env`.
`...`	Only used for the method generic.

Note that the value of lang must match a valid language supported by kRp.POS.tags. It will also get stored in the resulting object and might be used by other functions at a later point. E.g., treetag is being called by freq.analysis, which will by default query this language definition, unless explicitly told otherwise. The rationale behind this is to comfortably make it possible to have tokenized and POS tagged objects of various languages around in your workspace, and not worry about that too much.

An object of class kRp.text. If debug=TRUE, prints internal variable settings and attempts to return the original output if the TreeTagger system call in a matrix.

m.eik michalke meik.michalke@hhu.de, support for various laguages was contributed by Earl Brown (Spanish), Alberto Mirisola (Italian) and Alexandre Brulet (French).

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44–49.

[1] https://www.cis.lmu.de/~schmid/tools/TreeTagger/

freq.analysis, get.kRp.env, kRp.text

  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
## Not run: 
# first way to invoke POS tagging, using a built-in preset:
tagged.results <- treetag(
  sample_file,
  treetagger="manual",
  lang="en",
  TT.options=list(
    path=file.path("~","bin","treetagger"),
    preset="en"
  )
)
# second way, use one of the batch scripts that come with TreeTagger:
tagged.results <- treetag(
  sample_file,
  treetagger=file.path("~","bin","treetagger","cmd","tree-tagger-english"),
  lang="en"
)
# third option, set the above batch script in an environment object first:
set.kRp.env(
  TT.cmd=file.path("~","bin","treetagger","cmd","tree-tagger-english"),
  lang="en"
)
tagged.results <- treetag(
  sample_file
)

# after tagging, use the resulting object with other functions in this package:
readability(tagged.results)
lex.div(tagged.results)

## enabling stopword detection and stemming
# if you also installed the packages tm and SnowballC,
# you can use some of their features with koRpus:
set.kRp.env(
  TT.cmd="manual",
  lang="en",
  TT.options=list(
    path=file.path("~","bin","treetagger"),
    preset="en"
  )
)
tagged.results <- treetag(
  sample_file,
  stopwords=tm::stopwords("en"),
  stemmer=SnowballC::wordStem
)

# removing all stopwords now is simple:
tagged.noStopWords <- filterByClass(
  tagged.results,
  "stopword"
)

## End(Not run)