udpipe | R Documentation |
Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in TIF format
udpipe(x, object, parallel.cores = 1L, parallel.chunksize, ...)
x |
either
All text data should be in UTF-8 encoding |
object |
either an object of class |
parallel.cores |
integer indicating the number of parallel cores to use to speed up the annotation. Defaults to 1 (use only 1 single thread). |
parallel.chunksize |
integer with the size of the chunks of text to be annotated in parallel. If not provided, defaults to the size of |
... |
other elements to pass on to |
a data.frame with one row per doc_id and term_id containing all the tokens in the data, the lemma, the part of speech tags, the morphological features and the dependency relationship along the tokens. The data.frame has the following fields:
doc_id: The document identifier.
paragraph_id: The paragraph identifier which is unique within each document.
sentence_id: The sentence identifier which is unique within each document.
sentence: The text of the sentence of the sentence_id.
start: Integer index indicating in the original text where the token starts. Missing in case of tokens part of multi-word tokens which are not in the text.
end: Integer index indicating in the original text where the token ends. Missing in case of tokens part of multi-word tokens which are not in the text.
term_id: A row identifier which is unique within the doc_id identifier.
token_id: Token index, integer starting at 1 for each new sentence. May be a range for multiword tokens or a decimal number for empty nodes.
token: The token.
lemma: The lemma of the token.
upos: The universal parts of speech tag of the token. See https://universaldependencies.org/format.html
xpos: The treebank-specific parts of speech tag of the token. See https://universaldependencies.org/format.html
feats: The morphological features of the token, separated by |. See https://universaldependencies.org/format.html
head_token_id: Indicating what is the token_id of the head of the token, indicating to which other token in the sentence it is related. See https://universaldependencies.org/format.html
dep_rel: The type of relation the token has with the head_token_id. See https://universaldependencies.org/format.html
deps: Enhanced dependency graph in the form of a list of head-deprel pairs. See https://universaldependencies.org/format.html
misc: SpacesBefore/SpacesAfter/SpacesInToken spaces before/after/inside the token. Used to reconstruct the original text. See https://ufal.mff.cuni.cz/udpipe/1/users-manual
The columns paragraph_id, sentence_id, term_id, start, end are integers, the other fields
are character data in UTF-8 encoding.
https://ufal.mff.cuni.cz/udpipe, https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364, https://universaldependencies.org/format.html
udpipe_load_model
, as.data.frame.udpipe_connlu
, udpipe_download_model
, udpipe_annotate
model <- udpipe_download_model(language = "dutch-lassysmall") if(!model$download_failed){ ud_dutch <- udpipe_load_model(model) ## Tokenise, Tag and Dependency Parsing Annotation. Output is in CONLL-U format. txt <- c("Dus. Godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg en hij had nog gelijk ook. Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont, maar dat hoefden we geenseens niet te zingen. Je kunt zeggen wat je wil van al die gesluierde poezenpas maar d'r kwam wel een vleeswarenwinkel onder te voorschijn van heb je me daar nou. En zo gaat het maar door.", "Wat die ransaap van een academici nou weer in z'n botte pan heb gehaald mag Joost in m'n schoen gooien, maar feit staat boven water dat het een gore vieze vuile ransaap is.") names(txt) <- c("document_identifier_1", "we-like-ilya-leonard-pfeiffer") ## ## TIF tagging: tag if x is a character vector, a data frame or a token sequence ## x <- udpipe(txt, object = ud_dutch) x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), object = ud_dutch) x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), object = ud_dutch) ## You can also directly pass on the language in the call to udpipe x <- udpipe("Dit werkt ook.", object = "dutch-lassysmall") x <- udpipe(txt, object = "dutch-lassysmall") x <- udpipe(data.frame(doc_id = names(txt), text = txt, stringsAsFactors = FALSE), object = "dutch-lassysmall") x <- udpipe(strsplit(txt, "[[:space:][:punct:][:digit:]]+"), object = "dutch-lassysmall") } ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it if(file.exists(model$file_model)) file.remove(model$file_model)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.