library(knitr)
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
````

```r
knit_hooks$set(htmlcap = function(before, options, envir) {
  if(!before) {
    paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="")
    }
    })
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")

Project Status: Active - The project has reached a stable, usable state and is being actively developed.Build Status Coverage Status r verbadge

tagger wraps the NLP and openNLP packages for easier part of speech tagging. tagger uses the openNLP annotator to compute "Penn Treebank parse annotations using the Apache OpenNLP chunking parser for English."

The main functions and descriptions are listed in the table below.

| Function | Description | |--------------------|-----------------------------------------------------| | tag_pos | Tag parts of speech | | select_tags | Select specific part of speech tags from tag_pos | | count_tags | Cross tabs of tags by grouping variable |

Installation

To download the development version of tagger:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(c(
    "trinker/termco", 
    "trinker/coreNLPsetup",
    "trinker/tagger"
))

Contact

You are welcome to: submit suggestions and bug-reports at: https://github.com/trinker/tagger/issues send a pull request on: https://github.com/trinker/tagger/ * compose a friendly e-mail to: tyler.rinker@gmail.com

Examples

The following examples demonstrate some of the functionality of tagger.

Load the Tools/Data

library(dplyr); library(tagger)
data(presidential_debates_2012)
mwe <- data_frame(
    person = c("Tyler", "Norah", "Tyler"),
    talk = c(
        "I need $54 to go to the movies.",
        "They refuse to permit us to obtain the refuse permit",
        "This is the tagger package; like it?"
    )
)

Tagging

Let's begin with a minimal example.

tag_pos(mwe$talk)

Note that the out put pretty pints but the underlying structure is simply a lst of named vectors, where the elements in the vectors are the tokens and the names are the part of speech tags. We can use c on the object to see it's true structure.

tag_pos(mwe$talk) %>%
    c()

Let's try it on a larger example, the built in presidential_debates_2012 data set. It'll take 30 seconds or so to run, depending on the machine.

tag_pos(presidential_debates_2012$dialogue)

This output is built into tagger as the presidential_debates_2012_pos data set, which we'll use form this point on in the demo.

Note that the user may choose to use CoreNLP as a backend by setting engine = "coreNLP". To ensure that coreNLP is setup properly use check_setup.

Plotting

The user can generate a horizontal barplot of the used tags.

presidential_debates_2012_pos %>%
    plot()

Interpreting Tags

The tags generated by openNLP are from Penn Treebank. As such there are many tags, more than the few parts of speech we learned in grade school. Remembering the meaning of each tags may be difficult, therefore the penn_tags creates a left aligned data frame of the possible tags and their meaning.

penn_tags()

Counts

The user can generate a count of the tags by grouping variable as well. The number of columns explodes quickly, even with this minimal example.

tag_pos(mwe$talk) %>%
    count_tags(mwe$person) 

The default is a pretty printing (counts + proportions) that can be turned off to print raw counts only.

tag_pos(mwe$talk) %>%
    count_tags(mwe$person) %>%
    print(pretty = FALSE)

Select Tags

The user may wish to select specific tags. The select_tags function enables selection of specific tags via element matching (which can be negated) or regular expression.

Here we select only the nouns.

presidential_debates_2012_pos %>%
    select_tags(c("NN", "NNP", "NNPS", "NNS"))

This could also have been accomplished with a simpler regex call by setting regex = TRUE.

presidential_debates_2012_pos %>%
    select_tags("NN", regex=TRUE)

In this way we could quickly select the nouns and verbs with the following call.

presidential_debates_2012_pos %>%
    select_tags("^(VB|NN)", regex=TRUE)

Note that the output is a tag_pos class and the plotting, count_tags, and as_word_tag functions can be used on the result.

presidential_debates_2012_pos %>%
    select_tags("^(VB|NN)", regex=TRUE) %>%
    plot()
presidential_debates_2012_pos %>%
    select_tags("^(VB|NN)", regex=TRUE) %>%
    count_tags()

Altering Tag Display

As Word Tags

The traditional way to display tags is to incorporate them into the sentence, placing them after/before their respective token, separated by a forward slash (e.g., talk/VB). This is the default printing style of tag_pos though not truly the structure of the output. The user can coerce the underlying structure with the as_word_tag function, converting the named list of vectors into a list of part of speech incorporated, unnamed vectors. Below I only print the first 6 elements of as_word_tag.

presidential_debates_2012_pos %>%
    as_word_tag() %>%
    head()
````

### As Tuples

**Python** uses a tuple construction of parts of speech to display tags.  This can be a useful structure.  Essentially the structure is a lists of lists of two element vectors.  Each vector contains a word and a part of speech tag.  `as_tuple` uses the following **R** structuring:

list(list(c("word", "tag"), c("word", "tag")), list(c("word", "tag")))

but prints to the console in the **Python** way.  Using `print(as_tuple(x), truncate=Inf, file="out.txt")` allows the user to print to an external file.

```r
tag_pos(mwe$talk) %>%
    as_tuple() %>%
    print(truncate=Inf)

As Universal Tags

Petrov, Das, & McDonald (2011) provide a mapping to convert Penn Treebank tags into universal part of speech tags. The as_universal function harnesses this mapping.

tag_pos(mwe$talk) %>%
    as_universal()

The out put is a tag_pos object and thus has a generic plot method.

tag_pos(mwe$talk) %>%
    as_universal() %>%
    plot()
tag_pos(mwe$talk) %>%
    as_universal() %>%
    count_tags()

As Basic Tags

as_basic provides an even more coarse tagset than as_universal. Basic tags include: (a) nouns, (b) adjectives, (c) prepositions, (d) articles, (e) verb, (f) pronouns, (g) adverbs, (h) interjections, & (i) conjunctions. The X and . tags are retained for punctuation and unclassified parts of speech.

tag_pos(mwe$talk) %>%
    as_basic()

This tagset can be useful for more coarse purposes, including formality (Heylighen & Dewaele, 2002) scoring.

The output is a tag_pos object and thus has a generic plot method.

tag_pos(mwe$talk) %>%
    as_basic() %>%
    plot()
tag_pos(mwe$talk) %>%
    as_basic() %>%
    count_tags()


trinker/tagger documentation built on May 31, 2019, 10:42 p.m.