knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) devtools::load_all()
The goal of compromiser is to make the elegant compromise javascript library available to R. So, what is compromise
? Self-described as "modest natural language processing", compromise
is an extremely lightweight, easy-to-use, rule-based NLP library for English. As of writing, the entire library is 200kB (smaller than a typical animated gif file). The library is optimized for speed and being "accurate enough".
Similar to the original compromise
, compromiser is optimized for speed and ease of use. You may know spacyr. The current package is modeled after spacyr, but with speed, out-of-the-box experience and "enough" accuracy. Similar to spacyr, all output formats are tif-compatible.
Caveats:
compromiser
does POS tagging onlydevtools::install_github("chainsawriot/compromiser")
Unlike spacyr, udpipe, or openNLP, no post-installation setup is required for compromiser. You can directly go POS parsing a small corpus.
library(compromiser)
textdata <- c("The dog has been selectively bred over millennia for various behaviors, sensory capabilities, and physical attributes. Dog breeds vary widely in shape, size, and color. They perform many roles for humans, such as hunting, herding, pulling loads, protection, assisting police and the military, companionship, therapy, and aiding disabled people. This influence on human society has given them the sobriquet of \"man's best friend.\"", "A methane gas explosion and fire in a Siberian coal mine left more than 50 miners and rescuers dead. Another 239 people were rescued.") x <- tag(textdata) x
It gets the job done. Some tricky words are tagged incorrectly, though. (e.g. breeds, mine)
as.data.frame(x)
library(quanteda)
library(quanteda)
Conversion to Quanteda's tokens.
x_toks <- as.tokens(x) x_toks
Suppose you are only interested in nouns and adjectives.
tokens_select(x_toks, pattern = c("*/N*", "*/JJ"))
I don't care. But probably the as.data.frame
method is sufficient.
data_corpus_inaugural
is a corpus of all US inaugural speeches.
system.time(inaug <- tag(data_corpus_inaugural)) inaug
It took < 15 seconds to do POS tagging of 138054 tokens. With that and using quanteda, we can study what are the most frequent nouns.
inaug %>% as.tokens %>% tokens_select(pattern = c("*/NN*")) %>% dfm %>% topfeatures(n = 30)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.