udpipe_download_model: Download an UDPipe model provided by the UDPipe community for...

View source: R/udpipe_models.R

udpipe_download_modelR Documentation

Download an UDPipe model provided by the UDPipe community for a specific language of choice

Description

Ready-made models for 65 languages trained on 101 treebanks from https://universaldependencies.org/ are provided to you. Some of these models were provided by the UDPipe community. Other models were build using this R package. You can either download these models manually in order to use it for annotation purposes or use udpipe_download_model to download these models for a specific language of choice. You have the following options:

Usage

udpipe_download_model(
  language = c("afrikaans-afribooms", "ancient_greek-perseus", "ancient_greek-proiel",
    "arabic-padt", "armenian-armtdp", "basque-bdt", "belarusian-hse", "bulgarian-btb",
    "buryat-bdt", "catalan-ancora", "chinese-gsd", "chinese-gsdsimp",
    "classical_chinese-kyoto", "coptic-scriptorium", "croatian-set", "czech-cac",
    "czech-cltt", "czech-fictree", "czech-pdt", "danish-ddt", "dutch-alpino",
    "dutch-lassysmall", "english-ewt", "english-gum", "english-lines", "english-partut",
    "estonian-edt", "estonian-ewt", "finnish-ftb",      "finnish-tdt", "french-gsd",
    "french-partut", "french-sequoia", "french-spoken", "galician-ctg",
    "galician-treegal", "german-gsd", "german-hdt", "gothic-proiel", "greek-gdt",
    "hebrew-htb", "hindi-hdtb", "hungarian-szeged", "indonesian-gsd", "irish-idt",
    "italian-isdt", "italian-partut", "italian-postwita", "italian-twittiro",
    "italian-vit", "japanese-gsd", "kazakh-ktb", "korean-gsd", "korean-kaist",
    "kurmanji-mg", "latin-ittb", "latin-perseus", "latin-proiel", "latvian-lvtb",
    "lithuanian-alksnis",      "lithuanian-hse", "maltese-mudt", "marathi-ufal",
    "north_sami-giella", "norwegian-bokmaal", "norwegian-nynorsk",
    "norwegian-nynorsklia", "old_church_slavonic-proiel", "old_french-srcmf",
    "old_russian-torot", "persian-seraji", "polish-lfg", "polish-pdb", "polish-sz",
    "portuguese-bosque", "portuguese-br", "portuguese-gsd", "romanian-nonstandard",
    "romanian-rrt", "russian-gsd", "russian-syntagrus", "russian-taiga", "sanskrit-ufal",
    "scottish_gaelic-arcosg", "serbian-set", "slovak-snk", "slovenian-ssj",     
    "slovenian-sst", "spanish-ancora", "spanish-gsd", "swedish-lines",
    "swedish-talbanken", "tamil-ttb", "telugu-mtg", "turkish-imst", "ukrainian-iu",
    "upper_sorbian-ufal", "urdu-udtb", "uyghur-udt", "vietnamese-vtb", "wolof-wtb"),
  model_dir = getwd(),
  udpipe_model_repo = c("jwijffels/udpipe.models.ud.2.5",
    "jwijffels/udpipe.models.ud.2.4", "jwijffels/udpipe.models.ud.2.3",
    "jwijffels/udpipe.models.ud.2.0", "jwijffels/udpipe.models.conll18.baseline",
    "bnosac/udpipe.models.ud"),
  overwrite = TRUE,
  ...
)

Arguments

language

a character string with a Universal Dependencies treebank which was used to build the model. Possible values are:
afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb

Each language should have a treebank extension (e.g. english-ewt, russian-syntagrus, dutch-alpino, ...). If you do not provide a treebank extension (e.g. only english, russian, dutch), the function will use the default treebank of that language as was used in Universal Dependencies up to version 2.1.

model_dir

a path where the model will be downloaded to. Defaults to the current working directory

udpipe_model_repo

location where the models will be downloaded from. Either 'jwijffels/udpipe.models.ud.2.5', 'jwijffels/udpipe.models.ud.2.4', 'jwijffels/udpipe.models.ud.2.3', 'jwijffels/udpipe.models.ud.2.0', 'jwijffels/udpipe.models.conll18.baseline' or 'bnosac/udpipe.models.ud'.
Defaults to 'jwijffels/udpipe.models.ud.2.5'.

  • 'bnosac/udpipe.models.ud' contains models mainly released under the CC-BY-SA license constructed on Universal Dependencies 2.1 data, and some models released under the GPL-3 and LGPL-LR license

  • 'jwijffels/udpipe.models.ud.2.5' contains models released under the CC-BY-NC-SA license constructed on Universal Dependencies 2.5 data

  • 'jwijffels/udpipe.models.ud.2.4' contains models released under the CC-BY-NC-SA license constructed on Universal Dependencies 2.4 data

  • 'jwijffels/udpipe.models.ud.2.3' contains models released under the CC-BY-NC-SA license constructed on Universal Dependencies 2.3 data

  • 'jwijffels/udpipe.models.ud.2.0' contains models released under the CC-BY-NC-SA license constructed on Universal Dependencies 2.0 data

  • 'jwijffels/udpipe.models.conll18.baseline' contains models released under the CC-BY-NC-SA license constructed on Universal Dependencies 2.2 data for the 2018 conll shared task

See the Details section for further information on which languages are available in each of these repositories.

overwrite

logical indicating to overwrite the file if the file was already downloaded. Defaults to TRUE indicating it will download the model and overwrite the file if the file already existed. If set to FALSE, the model will only be downloaded if it does not exist on disk yet in the model_dir folder.

...

currently not used

Details

The function allows you to download the following language models based on your setting of argument udpipe_model_repo:

  • 'jwijffels/udpipe.models.ud.2.5': https://github.com/jwijffels/udpipe.models.ud.2.5

    • UDPipe models constructed on data from Universal Dependencies 2.5

    • languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, korean-gsd, korean-kaist, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, portuguese-bosque, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb

    • license: CC-BY-SA-NC

  • 'jwijffels/udpipe.models.ud.2.4': https://github.com/jwijffels/udpipe.models.ud.2.4

    • UDPipe models constructed on data from Universal Dependencies 2.4

    • languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, catalan-ancora, chinese-gsd, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-vit, japanese-gsd, korean-gsd, korean-kaist, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, portuguese-bosque, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb

    • license: CC-BY-SA-NC

  • 'jwijffels/udpipe.models.ud.2.3': https://github.com/jwijffels/udpipe.models.ud.2.3

    • UDPipe models constructed on data from Universal Dependencies 2.3

    • languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, catalan-ancora, chinese-gsd, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, japanese-gsd, korean-gsd, korean-kaist, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, urdu-udtb, uyghur-udt, vietnamese-vtb

    • license: CC-BY-SA-NC

  • 'jwijffels/udpipe.models.ud.2.0': https://github.com/jwijffels/udpipe.models.ud.2.0

    • UDPipe models constructed on data from Universal Dependencies 2.0

    • languages-treebanks: ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese

    • license: CC-BY-SA-NC

  • 'jwijffels/udpipe.models.conll18.baseline': https://github.com/jwijffels/udpipe.models.conll18.baseline

    • UDPipe models constructed on data from Universal Dependencies 2.2

    • languages-treebanks: afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, croatian-set, czech-cac, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-postwita, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, mixed, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, romanian-rrt, russian-syntagrus, russian-taiga, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, swedish-lines, swedish-talbanken, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb

    • license: CC-BY-SA-NC

  • 'bnosac/udpipe.models.ud': https://github.com/bnosac/udpipe.models.ud

    • UDPipe models constructed on data from Universal Dependencies 2.1

    • This repository contains models build with this R package on open data from Universal Dependencies 2.1 which allows for commercial usage. The license of these models is mostly CC-BY-SA. Visit that github repository for details on the licenses of the language of your choice. And contact www.bnosac.be if you need support on these models or require models tuned to your needs.

    • languages-treebanks: afrikaans, croatian, czech-cac, dutch, english, finnish, french-sequoia, irish, norwegian-bokmaal, persian, polish, portuguese, romanian, serbian, slovak, spanish-ancora, swedish

    • license: license is treebank-specific but mainly CC-BY-SA and GPL-3 and LGPL-LR

  • If you need to train models yourself for commercial purposes or if you want to improve models, you can easily do this with udpipe_train which is explained in detail in the package vignette.

Note that when you download these models, you comply to the license of your specific language model.

Value

A data.frame with 1 row and the following columns:

  • language: The language as provided by the input parameter language

  • file_model: The path to the file on disk where the model was downloaded to

  • url: The URL where the model was downloaded from

  • download_failed: A logical indicating if the download has failed or not due to internet connectivity issues

  • download_message: A character string with the error message in case the downloading of the model failed

References

https://ufal.mff.cuni.cz/udpipe, https://github.com/jwijffels/udpipe.models.ud.2.5, https://github.com/jwijffels/udpipe.models.ud.2.4, https://github.com/jwijffels/udpipe.models.ud.2.3, https://github.com/jwijffels/udpipe.models.conll18.baseline https://github.com/jwijffels/udpipe.models.ud.2.0, https://github.com/bnosac/udpipe.models.ud

See Also

udpipe_load_model

Examples

## Not run: 
x <- udpipe_download_model(language = "dutch-alpino")
x <- udpipe_download_model(language = "dutch-lassysmall")
x <- udpipe_download_model(language = "russian")
x <- udpipe_download_model(language = "french")
x <- udpipe_download_model(language = "english-partut")
x <- udpipe_download_model(language = "english-ewt")
x <- udpipe_download_model(language = "german-gsd")
x <- udpipe_download_model(language = "spanish-gsd")
x <- udpipe_download_model(language = "spanish-gsd", overwrite = FALSE)

x <- udpipe_download_model(language = "dutch-alpino", 
                           udpipe_model_repo = "jwijffels/udpipe.models.ud.2.5")
x <- udpipe_download_model(language = "dutch-alpino", 
                           udpipe_model_repo = "jwijffels/udpipe.models.ud.2.4")
x <- udpipe_download_model(language = "dutch-alpino", 
                           udpipe_model_repo = "jwijffels/udpipe.models.ud.2.3")
x <- udpipe_download_model(language = "dutch-alpino", 
                           udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0")
x <- udpipe_download_model(language = "english", udpipe_model_repo = "bnosac/udpipe.models.ud")
x <- udpipe_download_model(language = "dutch", udpipe_model_repo = "bnosac/udpipe.models.ud")
x <- udpipe_download_model(language = "afrikaans", udpipe_model_repo = "bnosac/udpipe.models.ud")
x <- udpipe_download_model(language = "spanish-ancora", 
                           udpipe_model_repo = "bnosac/udpipe.models.ud")
x <- udpipe_download_model(language = "dutch-ud-2.1-20180111.udpipe", 
                           udpipe_model_repo = "bnosac/udpipe.models.ud")                           
x <- udpipe_download_model(language = "english", 
                           udpipe_model_repo = "jwijffels/udpipe.models.conll18.baseline")

## End(Not run)

x <- udpipe_download_model(language = "sanskrit", 
                           udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0", 
                           model_dir = tempdir())
x
## cleanup for CRAN
if(file.exists(x$file_model)) file.remove(x$file_model)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.