knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
devtools::load_all()

compromiser

The goal of compromiser is to make the elegant compromise javascript library available to R. So, what is compromise? Self-described as "modest natural language processing", compromise is an extremely lightweight, easy-to-use, rule-based NLP library for English. As of writing, the entire library is 200kB (smaller than a typical animated gif file). The library is optimized for speed and being "accurate enough".

Similar to the original compromise, compromiser is optimized for speed and ease of use. You may know spacyr. The current package is modeled after spacyr, but with speed, out-of-the-box experience and "enough" accuracy. Similar to spacyr, all output formats are tif-compatible.

Caveats:

Installation

devtools::install_github("chainsawriot/compromiser")

Example

Unlike spacyr, udpipe, or openNLP, no post-installation setup is required for compromiser. You can directly go POS parsing a small corpus.

library(compromiser)
textdata <- c("The dog has been selectively bred over millennia for various
 behaviors, sensory capabilities, and physical attributes. Dog breeds vary
 widely in shape, size, and color. They perform many roles for humans, such
 as hunting, herding, pulling loads, protection, assisting police and the
 military, companionship, therapy, and aiding disabled people. This influence
 on human society has given them the sobriquet of \"man's best friend.\"",
 "A methane gas explosion and fire in a Siberian coal mine left more than
 50 miners and rescuers dead. Another 239 people were rescued.")

x <- tag(textdata)
x

It gets the job done. Some tricky words are tagged incorrectly, though. (e.g. breeds, mine)

as.data.frame(x)

Integration with quanteda

library(quanteda)
library(quanteda)

Conversion to Quanteda's tokens.

x_toks <- as.tokens(x)
x_toks

Suppose you are only interested in nouns and adjectives.

tokens_select(x_toks, pattern = c("*/N*", "*/JJ"))

Integration with tidytext

I don't care. But probably the as.data.frame method is sufficient.

A more serious example

data_corpus_inaugural is a corpus of all US inaugural speeches.

system.time(inaug <- tag(data_corpus_inaugural))
inaug

It took < 15 seconds to do POS tagging of 138054 tokens. With that and using quanteda, we can study what are the most frequent nouns.

inaug %>% as.tokens %>% tokens_select(pattern = c("*/NN*")) %>% dfm %>% topfeatures(n = 30)


chainsawriot/compromiser documentation built on Dec. 19, 2021, 2:59 p.m.