options(width = 1000)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA)
print.data.frame <- function(x){
  print("here")
  base::print.data.frame(x, row.names = FALSE)
}

General

Parts of Speech (POS) tagging is the process of assigning a category (like verb/noun/adverb/...) to each word in given text. It is a crucial part in any statistical processing flow of text. This R package allows to do out-of-the-box Parts of Speech tagging for 45 languages. It does this by wrapping the Ripple Down Rules-based Part-Of-Speech Tagger (RDRPOS) available at https://github.com/datquocnguyen/RDRPOSTagger.

Types of tagging/languages

The R package allows you to perform 3 types of tagging.

library(RDRPOSTagger)
rdr_available_models()

Examples

If you want to tag text based on one of each of these taggers, you need to proceed as follows. First you create an object of class RDRPOSTagger by specifying the language and the type of tagging requested as shown previously in rdr_available_models(). This will basically extract the rules and the dictionary from the specific file in the Models folder of this package.

library(RDRPOSTagger)
tagger <- rdr_model(language = "Dutch", annotation = "UniversalPOS")
tagger

The model contains the rules which were found during training on the corpus of the language. If you need more detail on how the corpus was collected and the specific treebank, go to http://universaldependencies.org where you can find details on the corpus.

If you just want to use the models to tag text you have to provide a vector of text and use rdr_pos to tag your text. The output of this is always a data.frame with 1 line per token. It contains fields doc_id, token_id, token and pos with the POS tagged label for that word or token.

If you want to find out the meaning of the different POS tags, visit http://universaldependencies.org.

x <- c("Dus godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg.", 
 "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
 "  ", "", NA)
rdr_pos(tagger, x = x)
## Another example using a MORPH tagger 
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

Mark that rdr_pos requires to have spaces around punctuation symbols, which is done by default when you run the function. If you don't do this, the punctuation symbols and the word will be considered as 1 token which is probably not what you want.

## Another example using a POS tagger 
tagger <- rdr_model(language = "French", annotation = "POS")
rdr_pos(tagger, 
        x = c("Il pleure dans mon coeur comme il pleut sur la ville."), 
        add_space_around_punctuations = FALSE)

You can also provide a vector of document id's if you want to use this later on to link back to your database of documents.

tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, 
        x = c("We do not have health care, that is our idea of a state. I love that!", 
              "We also call that freedom."),
        doc_id = c("identifier_abc", "id_123"))

Details

Background

More information about the model and the tagging can be found at https://github.com/datquocnguyen/RDRPOSTagger

The general architecture and experimental results of RDRPOSTagger can be found in the following papers:

License

The package is licensed under the GPL-3 license as described at http://www.gnu.org/licenses/gpl-3.0.html.

Support in text mining

Need support in text mining. Contact BNOSAC: http://www.bnosac.be



bnosac/RDRPOSTagger documentation built on May 8, 2019, 3:43 p.m.