R/udpipe_train.R
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Documented in as_conllu as_fasttext as_word2vec udpipe_accuracy udpipe_read_conllu udpipe_train

#' @title Train a UDPipe model
#' @description Train a UDPipe model which allows to do 
#' Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing or a combination of those. \cr
#' 
#' This function allows you to build models based on data in in CONLL-U format
#' as described at \url{https://universaldependencies.org/format.html}. At the time of writing open data in CONLL-U
#' format for more than 50 languages are available at \url{https://universaldependencies.org}. 
#' Most of these are distributed under the CC-BY-SA licence or the CC-BY-NC-SA license. \cr
#' 
#' This function allows to build annotation tagger models based on these data in CONLL-U format, allowing you 
#' to have your own tagger model. This is relevant if you want to tune the tagger to your needs
#' or if you don't want to use ready-made models provided under the CC-BY-NC-SA license as shown at \code{\link{udpipe_load_model}}
#' @param file full path where the model will be saved. The model will be stored as a binary file which \code{\link{udpipe_load_model}}
#' can handle. Defaults to 'my_annotator.udpipe' in the current working directory. 
#' @param files_conllu_training a character vector of files in CONLL-U format used for training the model
#' @param files_conllu_holdout a character vector of files in CONLL-U format used for holdout evalution of the model. This argument is optional.
#' @param annotation_tokenizer a string containing options for the tokenizer. This can be either 'none' or 'default' or a list 
#' of options as mentioned in the UDPipe manual. See the vignette \code{vignette("udpipe-train", package = "udpipe")} or
#' go directly to \url{https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_tokenizer} for a full description of the options
#' or see the examples below.
#' Defaults to 'default'. If you specify 'none', the model will not be able to perform tokenization.
#' @param annotation_tagger a string containing options for the pos tagger and lemmatiser. This can be either 'none' or 'default' or a list 
#' of options as mentioned in the UDPipe manual. See the vignette \code{vignette("udpipe-train", package = "udpipe")} or
#' go directly to \url{https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_tagger} for a full description of the options
#' or see the examples below.
#' Defaults to 'default'. If you specify 'none', the model will not be able to perform POS tagging or lemmatization.
#' @param annotation_parser a string containing options for the dependency parser.  This can be either 'none' or 'default' or a list 
#' of options as mentioned in the UDPipe manual. See the vignette \code{vignette("udpipe-train", package = "udpipe")} or
#' go directly to \url{https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_parser} for a full description of the options
#' or see the examples below.
#' Defaults to 'default'. If you specify 'none', the model will not be able to perform dependency parsing.
#' @return A list with elements 
#' \itemize{
#'  \item{file: }{The path to the model, which can be used in \code{udpipe_load_model}}
#'  \item{annotation_tokenizer: }{The input argument \code{annotation_tokenizer}}
#'  \item{annotation_tagger: }{The input argument \code{annotation_tagger}}
#'  \item{annotation_parser: }{The input argument \code{annotation_parser}}
#'  \item{errors: }{Messages from the UDPipe process indicating possible errors for example when passing the wrong arguments to the 
#'  annotation_tokenizer, annotation_tagger or annotation_parser}
#' }
#' @seealso \code{\link{udpipe_annotation_params}}, \code{\link{udpipe_annotate}}, \code{\link{udpipe_load_model}},
#' \code{\link{udpipe_accuracy}}
#' @references \url{https://ufal.mff.cuni.cz/udpipe/1/users-manual}
#' @details 
#' In order to train a model, you need to provide files which are in CONLL-U format in argument \code{files_conllu_training}. 
#' This can be a vector of files or just one file. If you do not have your own CONLL-U files, you can download files for your language of 
#' choice at \url{https://universaldependencies.org}. \cr
#' 
#' At the time of writing open data in CONLL-U
#' format for 50 languages are available at \url{https://universaldependencies.org}, namely for: 
#' ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, 
#' czech, danish, dutch, english, estonian, finnish, french, galician, german, gothic, greek, hebrew, hindi, hungarian, 
#' indonesian, irish, italian, japanese, kazakh, korean, latin, latvian, lithuanian, norwegian, 
#' old_church_slavonic, persian, polish, portuguese, romanian, russian, sanskrit, slovak, 
#' slovenian, spanish, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese. 
#' @export
#' @examples 
#' ## You need to have a file on disk in CONLL-U format, taking the toy example file put in the package
#' file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
#' file_conllu
#' cat(head(readLines(file_conllu), 3), sep="\n")
#' 
#' \dontrun{
#' ##
#' ## This is a toy example showing how to build a model, it is not a good model whatsoever, 
#' ##   because model building takes more than 5 seconds this model is saved also in 
#' ##   the file at system.file(package = "udpipe", "dummydata", "toymodel.udpipe")
#' ##
#' m <- udpipe_train(file = "toymodel.udpipe", files_conllu_training = file_conllu, 
#'   annotation_tokenizer = list(dimension = 16, epochs = 1, batch_size = 100, dropout = 0.7), 
#'   annotation_tagger = list(iterations = 1, models = 1, 
#'      provide_xpostag = 1, provide_lemma = 0, provide_feats = 0, 
#'      guesser_suffix_rules = 2, guesser_prefix_min_count = 2), 
#'   annotation_parser = list(iterations = 2, 
#'      embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, embedding_form = 50, 
#'      embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01, 
#'      learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200, 
#'      batch_size = 10, transition_system = "projective", transition_oracle = "dynamic", 
#'      structured_interval = 10))
#' }
#' 
#' file_model <- system.file(package = "udpipe", "dummydata", "toymodel.udpipe")
#' ud_toymodel <- udpipe_load_model(file_model)
#' x <- udpipe_annotate(object = ud_toymodel, x = "Ik ging deze morgen naar de bakker brood halen.")
#' x <- as.data.frame(x)
#' 
#' ##
#' ## The above was a toy example showing how to build a model, if you want real-life scenario's
#' ## look at the training parameter examples given below and train it on your CONLL-U file
#' ##
#' ## Example training arguments used for the models available at udpipe_download_model
#' data(udpipe_annotation_params)
#' head(udpipe_annotation_params$tokenizer)
#' head(udpipe_annotation_params$tagger)
#' head(udpipe_annotation_params$parser)
#' \dontrun{
#' ## More details in the package vignette:
#' vignette("udpipe-train", package = "udpipe")
#' }
udpipe_train <- function(file = file.path(getwd(), "my_annotator.udpipe"), 
                         files_conllu_training, files_conllu_holdout = character(), 
                         annotation_tokenizer = "default", 
                         annotation_tagger = "default", 
                         annotation_parser = "default") {
  
  collapse_list <- function(x){
    if(is.list(x)){
      x <- mapply(names(x), x, FUN=function(key, value) sprintf("%s=%s", key, value), SIMPLIFY = TRUE, USE.NAMES = FALSE)
      x <- paste(x, collapse = ";")  
    }
    x
  }
  annotation_tokenizer <- collapse_list(annotation_tokenizer)
  annotation_tagger <- collapse_list(annotation_tagger)
  annotation_parser <- collapse_list(annotation_parser)
  
  file <- path.expand(file)
  files_conllu_training <- path.expand(files_conllu_training)
  files_conllu_holdout <- path.expand(files_conllu_holdout)
  if(!dir.exists(dirname(file))){
    dir.create(dirname(file), recursive = TRUE)
  }
  stopifnot(all(file.exists(files_conllu_training)))
  stopifnot(all(file.exists(files_conllu_holdout)))
  result <- udp_train(file, files_conllu_training, files_conllu_holdout, 
            annotation_tokenizer, annotation_tagger, annotation_parser)

  structure(
    list(file_model = result$file_model, 
         annotation_tokenizer = annotation_tokenizer,
         annotation_tagger = annotation_tagger,
         annotation_parser = annotation_parser,
         errors = result$errors,
    class = "udpipe_trained_model"))
}



#' @title Evaluate the accuracy of your UDPipe model on holdout data
#' @description Get precision, recall and F1 measures on finding words / sentences / upos / xpos / features
#' annotation as well as UAS and LAS dependency scores on holdout data in conllu format.
#' @param object an object of class \code{udpipe_model} as returned by \code{\link{udpipe_load_model}}
#' @param file_conllu the full path to a file on disk containing holdout data in conllu format
#' @param tokenizer a character string of length 1, which is either 'default' or 'none'
#' @param tagger a character string of length 1, which is either 'default' or 'none'
#' @param parser a character string of length 1, which is either 'default' or 'none'
#' @return a list with 3 elements
#' \itemize{
#'  \item{accuracy: }{A character vector with accuracy metrics.}
#'  \item{error: }{A character string with possible errors when calculating the accuracy metrics}
#' }
#' @seealso \code{\link{udpipe_load_model}}
#' @references \url{https://ufal.mff.cuni.cz/udpipe}, 
#' \url{https://universaldependencies.org/format.html}
#' @export
#' @examples 
#' model <- udpipe_download_model(language = "dutch-lassysmall")
#' if(!model$download_failed){
#' ud_dutch <- udpipe_load_model(model$file_model)
#' 
#' file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
#' metrics <- udpipe_accuracy(ud_dutch, file_conllu)
#' metrics$accuracy
#' metrics <- udpipe_accuracy(ud_dutch, file_conllu, 
#'                            tokenizer = "none", tagger = "default", parser = "default")
#' metrics$accuracy
#' metrics <- udpipe_accuracy(ud_dutch, file_conllu, 
#'                            tokenizer = "none", tagger = "none", parser = "default")
#' metrics$accuracy
#' metrics <- udpipe_accuracy(ud_dutch, file_conllu, 
#'                            tokenizer = "default", tagger = "none", parser = "none")
#' metrics$accuracy
#' }
#' 
#' 
#' ## cleanup for CRAN only - you probably want to keep your model if you have downloaded it
#' if(file.exists(model$file_model)) file.remove(model$file_model)
udpipe_accuracy <- function(object, 
                            file_conllu, 
                            tokenizer = c("default", "none"), 
                            tagger = c("default", "none"), 
                            parser = c("default", "none")) {
  if(!inherits(object, "udpipe_model")){
    stop("object should be of class udpipe_model as returned by the function ?udpipe_load_model")
  }
  stopifnot(file.exists(file_conllu))
  tokenizer <- match.arg(tokenizer)
  tagger <- match.arg(tagger)
  parser <- match.arg(parser)
  
  f <- tempfile()
  out <- udp_evaluate(object$model, file_conllu, f, tokenizer, tagger, parser)
  out$accuracy <- readLines(f)
  class(out) <- "udpipe_accuracy"
  out
}



#' @title Read in a CONLL-U file as a data.frame
#' @description Read in a CONLL-U file as a data.frame
#' @param file a connection object or a character string with the location of the file
#' @return a data.frame with columns doc_id, paragraph_id, sentence_id, sentence, 
#' token_id, token, lemma, upos, xpos, feats, head_token_id, deprel, dep_rel, misc
#' @export
#' @examples 
#' file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
#' x <- udpipe_read_conllu(file_conllu)
#' head(x)
udpipe_read_conllu <- function(file){
  x <- list(conllu = paste(readLines(file, encoding = "UTF-8"), collapse = "\n"),
            error = "")
  read_connlu(x, is_udpipe_annotation = FALSE)
}



#' @title Convert a data.frame to CONLL-U format 
#' @description If you have a data.frame with annotations containing 1 row per token, you can convert it to CONLL-U format with this function.
#' The data frame is required to have the following columns: doc_id, sentence_id, sentence, token_id, token and
#' optionally has the following columns: lemma, upos, xpos, feats, head_token_id, dep_rel, deps, misc. Where these fields
#' have the following meaning
#' \itemize{
#' \item doc_id: the identifier of the document
#' \item sentence_id: the identifier of the sentence
#' \item sentence: the text of the sentence for which this token is part of
#' \item token_id: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
#' \item token: Word form or punctuation symbol.
#' \item lemma: Lemma or stem of word form.
#' \item upos: Universal part-of-speech tag.
#' \item xpos: Language-specific part-of-speech tag; underscore if not available.
#' \item feats: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
#' \item head_token_id: Head of the current word, which is either a value of token_id or zero (0).
#' \item dep_rel: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
#' \item deps: Enhanced dependency graph in the form of a list of head-deprel pairs.
#' \item misc: Any other annotation.
#' }
#' The tokens in the data.frame should be ordered as they appear in the sentence.
#' @param x a data.frame with columns doc_id, sentence_id, sentence, 
#' token_id, token, lemma, upos, xpos, feats, head_token_id, deprel, dep_rel, misc
#' @return a character string of length 1 containing the data.frame in CONLL-U format. See the example. You can easily save this to disk for processing in other applications.
#' @export
#' @references \url{https://universaldependencies.org/format.html}
#' @examples 
#' file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
#' x <- udpipe_read_conllu(file_conllu)
#' str(x)
#' conllu <- as_conllu(x)
#' cat(conllu)
#' \dontrun{
#' ## Write it to file, making sure it is in UTF-8
#' cat(as_conllu(x), file = file("annotations.conllu", encoding = "UTF-8"))
#' }
#' 
#' ## Some fields are not mandatory, they will assummed to be NA
#' conllu <- as_conllu(x[, c('doc_id', 'sentence_id', 'sentence', 
#'                           'token_id', 'token', 'upos')])
#' cat(conllu)
as_conllu <- function(x){
  stopifnot(is.data.frame(x))
  stopifnot(all(c("doc_id", "sentence_id", "sentence", "token_id", "token") %in% colnames(x)))
  replaceNA <- function(x){
    x[is.na(x)] <- "_"
    x
  }
  required_fields <- c("token_id", "token", "lemma", "upos", "xpos", "feats", "head_token_id", "dep_rel", "deps", "misc")
  missing_fields <- setdiff(required_fields, colnames(x))
  if(length(missing_fields) > 0){
    message(sprintf("Missing columns %s => will set them to NA", paste(missing_fields, collapse = ", ")))
  }
  for(field in required_fields){
    if(!field %in% colnames(x)){
      x[[field]] <- NA
    }
    x[[field]] <- replaceNA(x[[field]])
  }
  ##
  ## Conllu files basically look like this:
  ##
  ## # newdoc id = document_identification
  ## # sent_id = idofthesentence
  ## # text = From the BlogFeed comes this story :
  ## 1	From	from	ADP	IN	_	3	case	3:case	_
  ## ...
  ##
  x$.newdoc <- !duplicated(x$doc_id)
  x$.first <- !duplicated(x[, c("doc_id", "sentence_id")])
  ## l
  x$.header <- ifelse(x$.first, sprintf("%s%s# sent_id = %s\n# text = %s\n", 
                                        ifelse(seq_len(nrow(x)) == 1, "", "\n"), 
                                        ifelse(x$.newdoc, sprintf("# newdoc id = %s\n", x$doc_id), ""),
                                        x$sentence_id, x$sentence), "")
  x$.core <- sprintf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", 
                     x$token_id, 
                     x$token, 
                     x$lemma, 
                     x$upos, 
                     x$xpos, 
                     x$feats, 
                     x$head_token_id, 
                     x$dep_rel, 
                     x$deps, 
                     x$misc)
  txt <- paste(x$.header, x$.core, sep = "")
  txt <- paste(txt, collapse = "")
  sprintf("%s\n", txt)
}


#' @title Convert a matrix of word vectors to word2vec format 
#' @description The word2vec format provides in the first line the dimension of the word vectors and in the following lines one
#' has the elements of the wordvector where each line covers one word or token.\cr
#' 
#' The function is basically a utility function which allows one to write wordvectors created with other R packages in 
#' the well-known word2vec format which is used by \code{udpipe_train} to train the dependency parser.
#' @param x a matrix with word vectors where the rownames indicate the word or token and the number of columns
#' of the matrix indicate the side of the word vector
#' @return a character string of length 1 containing the word vectors in word2vec format which can be written to a file on disk
#' @export
#' @examples
#' wordvectors <- matrix(rnorm(1000), nrow = 100, ncol = 10)
#' rownames(wordvectors) <- sprintf("word%s", seq_len(nrow(wordvectors)))
#' wv <- as_word2vec(wordvectors)
#' cat(wv)
#' 
#' f <- file(tempfile(fileext = ".txt"), encoding = "UTF-8")
#' cat(wv, file = f)
#' close(f)
as_word2vec <- function(x){
  stopifnot(is.matrix(x))
  line1 <- sprintf("%s %s", nrow(x), ncol(x))
  lines_rest <- mapply(token = rownames(x), i = seq_len(nrow(x)), FUN=function(token, i){
    line <- sprintf("%s %s", token, paste(as.character(x[i, ]), collapse = " "))
    line
  })
  x <- c(line1, lines_rest)
  paste(x, collapse = "\n")
}




#' @title Combine labels and text as used in fasttext
#' @description Fasttext prepends a label or different labels to text using a special string (__label__).
#' This function takes a character vector of text and prepends the labels alongside the special string.
#' @param x a character vector
#' @param y a character vector of labels or a list of labels. \code{y} should be of the same length as \code{x} 
#' @param label the string to use to prepend to the label. Defaults to __label__
#' @return a character vector of text where \code{x} and \code{y} are combined
#' @export 
#' @examples 
#' as_fasttext(x = c("just a bit of txt", "example2", "more txt please", "more"), 
#'             y = c("pos", "neg", "neg", NA))
#' as_fasttext(x = c("just a bit of txt", "example2", "more txt please", "more"), 
#'             y = list(c("ok", "pos"), c("neg", "topic2"), "", NA))
as_fasttext <- function(x, y, label = "__label__"){
  if(is.list(y)){
    targets <- sapply(y, FUN=function(x){
      if(length(x) == 0 || all(is.na(x))){
        return(NA_character_)
      }
      paste(paste(label, x, sep = ""), collapse = " ") 
    })
  }else{
    targets <- ifelse(is.na(y), NA_character_, paste(label, y, sep = ""))
  }
  x <- ifelse(is.na(targets), x, paste(targets, x, sep = " "))
  x
}
Any scripts or data that you put into this service are public.
udpipe documentation built on Jan. 6, 2023, 5:06 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

R/udpipe_train.R
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Defines functions as_fasttext as_word2vec as_conllu udpipe_read_conllu udpipe_accuracy udpipe_train

Documented in as_conllu as_fasttext as_word2vec udpipe_accuracy udpipe_read_conllu udpipe_train

Try the udpipe package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

R/udpipe_train.R In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Defines functions as_fasttext as_word2vec as_conllu udpipe_read_conllu udpipe_accuracy udpipe_train

Documented in as_conllu as_fasttext as_word2vec udpipe_accuracy udpipe_read_conllu udpipe_train

Try the udpipe package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

R/udpipe_train.R
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit