udpipe_train | R Documentation |
Train a UDPipe model which allows to do
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing or a combination of those.
This function allows you to build models based on data in in CONLL-U format
as described at https://universaldependencies.org/format.html. At the time of writing open data in CONLL-U
format for more than 50 languages are available at https://universaldependencies.org.
Most of these are distributed under the CC-BY-SA licence or the CC-BY-NC-SA license.
This function allows to build annotation tagger models based on these data in CONLL-U format, allowing you
to have your own tagger model. This is relevant if you want to tune the tagger to your needs
or if you don't want to use ready-made models provided under the CC-BY-NC-SA license as shown at udpipe_load_model
udpipe_train( file = file.path(getwd(), "my_annotator.udpipe"), files_conllu_training, files_conllu_holdout = character(), annotation_tokenizer = "default", annotation_tagger = "default", annotation_parser = "default" )
file |
full path where the model will be saved. The model will be stored as a binary file which |
files_conllu_training |
a character vector of files in CONLL-U format used for training the model |
files_conllu_holdout |
a character vector of files in CONLL-U format used for holdout evalution of the model. This argument is optional. |
annotation_tokenizer |
a string containing options for the tokenizer. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
annotation_tagger |
a string containing options for the pos tagger and lemmatiser. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
annotation_parser |
a string containing options for the dependency parser. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
In order to train a model, you need to provide files which are in CONLL-U format in argument files_conllu_training
.
This can be a vector of files or just one file. If you do not have your own CONLL-U files, you can download files for your language of
choice at https://universaldependencies.org.
At the time of writing open data in CONLL-U format for 50 languages are available at https://universaldependencies.org, namely for: ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin, latvian, lithuanian, norwegian, old_church_slavonic, persian, polish, portuguese, romanian, russian, sanskrit, slovak, slovenian, spanish, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese.
A list with elements
file: The path to the model, which can be used in udpipe_load_model
annotation_tokenizer: The input argument annotation_tokenizer
annotation_tagger: The input argument annotation_tagger
annotation_parser: The input argument annotation_parser
errors: Messages from the UDPipe process indicating possible errors for example when passing the wrong arguments to the annotation_tokenizer, annotation_tagger or annotation_parser
https://ufal.mff.cuni.cz/udpipe/1/users-manual
udpipe_annotation_params
, udpipe_annotate
, udpipe_load_model
,
udpipe_accuracy
## You need to have a file on disk in CONLL-U format, taking the toy example file put in the package file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu") file_conllu cat(head(readLines(file_conllu), 3), sep="\n") ## Not run: ## ## This is a toy example showing how to build a model, it is not a good model whatsoever, ## because model building takes more than 5 seconds this model is saved also in ## the file at system.file(package = "udpipe", "dummydata", "toymodel.udpipe") ## m <- udpipe_train(file = "toymodel.udpipe", files_conllu_training = file_conllu, annotation_tokenizer = list(dimension = 16, epochs = 1, batch_size = 100, dropout = 0.7), annotation_tagger = list(iterations = 1, models = 1, provide_xpostag = 1, provide_lemma = 0, provide_feats = 0, guesser_suffix_rules = 2, guesser_prefix_min_count = 2), annotation_parser = list(iterations = 2, embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, embedding_form = 50, embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01, learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200, batch_size = 10, transition_system = "projective", transition_oracle = "dynamic", structured_interval = 10)) ## End(Not run) file_model <- system.file(package = "udpipe", "dummydata", "toymodel.udpipe") ud_toymodel <- udpipe_load_model(file_model) x <- udpipe_annotate(object = ud_toymodel, x = "Ik ging deze morgen naar de bakker brood halen.") x <- as.data.frame(x) ## ## The above was a toy example showing how to build a model, if you want real-life scenario's ## look at the training parameter examples given below and train it on your CONLL-U file ## ## Example training arguments used for the models available at udpipe_download_model data(udpipe_annotation_params) head(udpipe_annotation_params$tokenizer) head(udpipe_annotation_params$tagger) head(udpipe_annotation_params$parser) ## Not run: ## More details in the package vignette: vignette("udpipe-train", package = "udpipe") ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.