library(knitr) desc <- suppressWarnings(readLines("DESCRIPTION")) regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)" loc <- grep(regex, desc) ver <- gsub(regex, "\\2", desc[loc]) verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver) ```` ```r knit_hooks$set(htmlcap = function(before, options, envir) { if(!before) { paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="") } }) knitr::opts_knit$set(self.contained = TRUE, cache = FALSE) knitr::opts_chunk$set(fig.path = "tools/figure/")
tagger wraps the NLP and openNLP packages for easier part of speech tagging. tagger uses the openNLP annotator to compute "Penn Treebank parse annotations using the Apache OpenNLP chunking parser for English."
The main functions and descriptions are listed in the table below.
| Function | Description |
|--------------------|-----------------------------------------------------|
| tag_pos
| Tag parts of speech |
| select_tags
| Select specific part of speech tags from tag_pos
|
| count_tags
| Cross tabs of tags by grouping variable |
To download the development version of tagger:
Download the zip ball or tar ball, decompress and run R CMD INSTALL
on it, or use the pacman package to install the development version:
if (!require("pacman")) install.packages("pacman") pacman::p_load_gh(c( "trinker/termco", "trinker/coreNLPsetup", "trinker/tagger" ))
You are welcome to: submit suggestions and bug-reports at: https://github.com/trinker/tagger/issues send a pull request on: https://github.com/trinker/tagger/ * compose a friendly e-mail to: tyler.rinker@gmail.com
The following examples demonstrate some of the functionality of tagger.
library(dplyr); library(tagger) data(presidential_debates_2012) mwe <- data_frame( person = c("Tyler", "Norah", "Tyler"), talk = c( "I need $54 to go to the movies.", "They refuse to permit us to obtain the refuse permit", "This is the tagger package; like it?" ) )
Let's begin with a minimal example.
tag_pos(mwe$talk)
Note that the out put pretty pints but the underlying structure is simply a lst of named vectors, where the elements in the vectors are the tokens and the names are the part of speech tags. We can use c
on the object to see it's true structure.
tag_pos(mwe$talk) %>% c()
Let's try it on a larger example, the built in presidential_debates_2012
data set. It'll take 30 seconds or so to run, depending on the machine.
tag_pos(presidential_debates_2012$dialogue)
This output is built into tagger as the presidential_debates_2012_pos
data set, which we'll use form this point on in the demo.
Note that the user may choose to use CoreNLP as a backend by setting engine = "coreNLP"
. To ensure that coreNLP is setup properly use check_setup
.
The user can generate a horizontal barplot of the used tags.
presidential_debates_2012_pos %>% plot()
The tags generated by openNLP are from Penn Treebank. As such there are many tags, more than the few parts of speech we learned in grade school. Remembering the meaning of each tags may be difficult, therefore the penn_tags
creates a left aligned data frame of the possible tags and their meaning.
penn_tags()
The user can generate a count of the tags by grouping variable as well. The number of columns explodes quickly, even with this minimal example.
tag_pos(mwe$talk) %>% count_tags(mwe$person)
The default is a pretty printing (counts + proportions) that can be turned off to print raw counts only.
tag_pos(mwe$talk) %>% count_tags(mwe$person) %>% print(pretty = FALSE)
The user may wish to select specific tags. The select_tags
function enables selection of specific tags via element matching (which can be negated) or regular expression.
Here we select only the nouns.
presidential_debates_2012_pos %>% select_tags(c("NN", "NNP", "NNPS", "NNS"))
This could also have been accomplished with a simpler regex call by setting regex = TRUE
.
presidential_debates_2012_pos %>% select_tags("NN", regex=TRUE)
In this way we could quickly select the nouns and verbs with the following call.
presidential_debates_2012_pos %>% select_tags("^(VB|NN)", regex=TRUE)
Note that the output is a tag_pos
class and the plotting, count_tags
, and as_word_tag
functions can be used on the result.
presidential_debates_2012_pos %>% select_tags("^(VB|NN)", regex=TRUE) %>% plot()
presidential_debates_2012_pos %>% select_tags("^(VB|NN)", regex=TRUE) %>% count_tags()
The traditional way to display tags is to incorporate them into the sentence, placing them after/before their respective token, separated by a forward slash (e.g., talk/VB). This is the default printing style of tag_pos
though not truly the structure of the output. The user can coerce the underlying structure with the as_word_tag
function, converting the named list of vectors into a list of part of speech incorporated, unnamed vectors. Below I only print the first 6 elements of as_word_tag
.
presidential_debates_2012_pos %>% as_word_tag() %>% head() ```` ### As Tuples **Python** uses a tuple construction of parts of speech to display tags. This can be a useful structure. Essentially the structure is a lists of lists of two element vectors. Each vector contains a word and a part of speech tag. `as_tuple` uses the following **R** structuring:
list(list(c("word", "tag"), c("word", "tag")), list(c("word", "tag")))
but prints to the console in the **Python** way. Using `print(as_tuple(x), truncate=Inf, file="out.txt")` allows the user to print to an external file. ```r tag_pos(mwe$talk) %>% as_tuple() %>% print(truncate=Inf)
Petrov, Das, & McDonald (2011) provide a mapping to convert Penn Treebank tags into universal part of speech tags. The as_universal
function harnesses this mapping.
tag_pos(mwe$talk) %>% as_universal()
The out put is a tag_pos
object and thus has a generic plot method.
tag_pos(mwe$talk) %>% as_universal() %>% plot()
tag_pos(mwe$talk) %>% as_universal() %>% count_tags()
as_basic
provides an even more coarse tagset than as_universal
. Basic tags include: (a) nouns
, (b) adjectives
, (c) prepositions
, (d) articles
, (e) verb
, (f) pronoun
s, (g) adverbs
, (h) interjections
, & (i) conjunctions
. The X
and .
tags are retained for punctuation and unclassified parts of speech.
tag_pos(mwe$talk) %>% as_basic()
This tagset can be useful for more coarse purposes, including formality (Heylighen & Dewaele, 2002) scoring.
The output is a tag_pos
object and thus has a generic plot method.
tag_pos(mwe$talk) %>% as_basic() %>% plot()
tag_pos(mwe$talk) %>% as_basic() %>% count_tags()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.