as_conllu: Convert a data.frame to CONLL-U format

View source: R/udpipe_train.R

as_conlluR Documentation

Convert a data.frame to CONLL-U format

Description

If you have a data.frame with annotations containing 1 row per token, you can convert it to CONLL-U format with this function. The data frame is required to have the following columns: doc_id, sentence_id, sentence, token_id, token and optionally has the following columns: lemma, upos, xpos, feats, head_token_id, dep_rel, deps, misc. Where these fields have the following meaning

  • doc_id: the identifier of the document

  • sentence_id: the identifier of the sentence

  • sentence: the text of the sentence for which this token is part of

  • token_id: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.

  • token: Word form or punctuation symbol.

  • lemma: Lemma or stem of word form.

  • upos: Universal part-of-speech tag.

  • xpos: Language-specific part-of-speech tag; underscore if not available.

  • feats: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.

  • head_token_id: Head of the current word, which is either a value of token_id or zero (0).

  • dep_rel: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.

  • deps: Enhanced dependency graph in the form of a list of head-deprel pairs.

  • misc: Any other annotation.

The tokens in the data.frame should be ordered as they appear in the sentence.

Usage

as_conllu(x)

Arguments

x

a data.frame with columns doc_id, sentence_id, sentence, token_id, token, lemma, upos, xpos, feats, head_token_id, deprel, dep_rel, misc

Value

a character string of length 1 containing the data.frame in CONLL-U format. See the example. You can easily save this to disk for processing in other applications.

References

https://universaldependencies.org/format.html

Examples

file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
x <- udpipe_read_conllu(file_conllu)
str(x)
conllu <- as_conllu(x)
cat(conllu)
## Not run: 
## Write it to file, making sure it is in UTF-8
cat(as_conllu(x), file = file("annotations.conllu", encoding = "UTF-8"))

## End(Not run)

## Some fields are not mandatory, they will assummed to be NA
conllu <- as_conllu(x[, c('doc_id', 'sentence_id', 'sentence', 
                          'token_id', 'token', 'upos')])
cat(conllu)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.