The goal of compromiser is to make the elegant
compromise javascript library available to
R. So, what is compromise
? Self-described as “modest natural language
processing”, compromise
is an extremely lightweight, easy-to-use,
rule-based NLP library for English. As of writing, the entire library is
200kB (smaller than a typical animated gif file). The library is
optimized for speed and being “accurate enough”.
Similar to the original compromise
, compromiser is optimized for speed
and ease of use. You may know
spacyr. The current package is
modeled after spacyr, but with speed, out-of-the-box experience and
“enough” accuracy. Similar to spacyr, all output formats are
tif-compatible.
Caveats:
compromiser
does POS tagging onlydevtools::install_github("chainsawriot/compromiser")
Unlike spacyr, udpipe, or openNLP, no post-installation setup is required for compromiser. You can directly go POS parsing a small corpus.
library(compromiser)
textdata <- c("The dog has been selectively bred over millennia for various
behaviors, sensory capabilities, and physical attributes. Dog breeds vary
widely in shape, size, and color. They perform many roles for humans, such
as hunting, herding, pulling loads, protection, assisting police and the
military, companionship, therapy, and aiding disabled people. This influence
on human society has given them the sobriquet of \"man's best friend.\"",
"A methane gas explosion and fire in a Siberian coal mine left more than
50 miners and rescuers dead. Another 239 people were rescued.")
x <- tag(textdata)
x
#> Parsed by compromiser
#> n. docs: 2
#> n. tokens: 87
It gets the job done. Some tricky words are tagged incorrectly, though. (e.g. breeds, mine)
as.data.frame(x)
#> doc_id sent_id token penn tags
#> 1 text1 1 The WDT Determiner
#> 2 text1 1 dog NN Noun, Singular
#> 3 text1 1 has VB Verb, Auxiliary
#> 4 text1 1 been VBD PastTense, Verb
#> 5 text1 1 selectively RB Adverb
#> 6 text1 1 bred VBD PastTense, Verb
#> 7 text1 1 over JJ Adjective
#> 8 text1 1 millennia NN Noun, Singular
#> 9 text1 1 for IN Preposition
#> 10 text1 1 various WDT Determiner
#> 11 text1 2 behaviors NNS Noun, Plural
#> 12 text1 2 sensory NN Singular, Noun
#> 13 text1 2 capabilities NNS Plural, Noun
#> 14 text1 2 and CC Conjunction
#> 15 text1 2 physical JJ Adjective
#> 16 text1 2 attributes NNS Noun, Plural
#> 17 text1 3 Dog NN Noun, Singular
#> 18 text1 3 breeds VBZ PresentTense, Verb
#> 19 text1 3 vary VBP Infinitive, PresentTense, Verb
#> 20 text1 4 widely RB Adverb
#> 21 text1 4 in IN Preposition
#> 22 text1 4 shape NN Noun, Singular
#> 23 text1 4 size NN Singular, Noun
#> 24 text1 4 and CC Conjunction
#> 25 text1 4 color NN Noun, Singular
#> 26 text1 5 They PRP Pronoun, Noun
#> 27 text1 5 perform VBP Infinitive, PresentTense, Verb
#> 28 text1 5 many JJ Adjective
#> 29 text1 5 roles NNS Noun, Plural
#> 30 text1 5 for IN Preposition
#> 31 text1 5 humans NNS Noun, Plural
#> 32 text1 5 such RB Adverb
#> 33 text1 6 as IN Preposition
#> 34 text1 6 hunting VBG Gerund, PresentTense, Verb
#> 35 text1 6 herding VBG Gerund, PresentTense, Verb
#> 36 text1 6 pulling VBG Gerund, PresentTense, Verb
#> 37 text1 6 loads VBZ PresentTense, Verb
#> 38 text1 6 protection NN Singular, Noun
#> 39 text1 6 assisting VBG Gerund, PresentTense, Verb
#> 40 text1 6 police NNS Noun, Plural
#> 41 text1 6 and CC Conjunction
#> 42 text1 6 the WDT Determiner
#> 43 text1 7 military JJ Adjective
#> 44 text1 7 companionship NN Noun, Singular
#> 45 text1 7 therapy NN Noun, Singular
#> 46 text1 7 and CC Conjunction
#> 47 text1 7 aiding NN Noun
#> 48 text1 7 disabled VBD PastTense, Verb
#> 49 text1 7 people NNS Plural, Noun
#> 50 text1 8 This WDT Determiner
#> 51 text1 8 influence VBP Infinitive, PresentTense, Verb
#> 52 text1 9 on IN Preposition
#> 53 text1 9 human NN Noun, Singular
#> 54 text1 9 society NN Noun, Singular
#> 55 text1 9 has VB Verb
#> 56 text1 9 given VB Verb
#> 57 text1 9 them PRP Pronoun, Noun
#> 58 text1 9 the WDT Determiner
#> 59 text1 9 sobriquet NN Noun, Singular
#> 60 text1 9 of IN Preposition
#> 61 text1 9 man's NN Singular, Noun, Possessive
#> 62 text1 9 best JJ Adjective
#> 63 text1 9 friend NN Singular, Noun
#> 64 text2 1 A WDT Determiner
#> 65 text2 1 methane NN Noun, Singular
#> 66 text2 1 gas NN Noun, Singular
#> 67 text2 1 explosion NN Noun, Singular
#> 68 text2 1 and CC Conjunction
#> 69 text2 1 fire NN Noun, Singular
#> 70 text2 1 in IN Preposition
#> 71 text2 1 a WDT Determiner
#> 72 text2 1 Siberian NNP ProperNoun, Noun, Singular
#> 73 text2 1 coal NN Uncountable, Noun
#> 74 text2 1 mine POS Possessive, Noun, Singular
#> 75 text2 1 left VBD PastTense, Verb
#> 76 text2 1 more RB Adverb
#> 77 text2 1 than IN Preposition
#> 78 text2 2 50 CD Cardinal, Value, NumericValue
#> 79 text2 2 miners NNS Noun, Plural
#> 80 text2 2 and CC Conjunction
#> 81 text2 2 rescuers NNS Noun, Plural
#> 82 text2 2 dead JJ Comparable, Adjective
#> 83 text2 3 Another WDT Determiner
#> 84 text2 3 239 CD Cardinal, Value, NumericValue
#> 85 text2 3 people NNS Plural, Noun
#> 86 text2 3 were VB Copula, Verb, PastTense, Auxiliary
#> 87 text2 3 rescued VBD PastTense, Verb
library(quanteda)
Conversion to Quanteda’s tokens.
x_toks <- as.tokens(x)
x_toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "The/WDT" "dog/NN" "has/VB" "been/VBD"
#> [5] "selectively/RB" "bred/VBD" "over/JJ" "millennia/NN"
#> [9] "for/IN" "various/WDT" "behaviors/NNS" "sensory/NN"
#> [ ... and 51 more ]
#>
#> text2 :
#> [1] "A/WDT" "methane/NN" "gas/NN" "explosion/NN" "and/CC"
#> [6] "fire/NN" "in/IN" "a/WDT" "Siberian/NNP" "coal/NN"
#> [11] "mine/POS" "left/VBD"
#> [ ... and 12 more ]
Suppose you are only interested in nouns and adjectives.
tokens_select(x_toks, pattern = c("*/N*", "*/JJ"))
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "dog/NN" "over/JJ" "millennia/NN" "behaviors/NNS"
#> [5] "sensory/NN" "capabilities/NNS" "physical/JJ" "attributes/NNS"
#> [9] "Dog/NN" "shape/NN" "size/NN" "color/NN"
#> [ ... and 16 more ]
#>
#> text2 :
#> [1] "methane/NN" "gas/NN" "explosion/NN" "fire/NN" "Siberian/NNP"
#> [6] "coal/NN" "miners/NNS" "rescuers/NNS" "dead/JJ" "people/NNS"
I don’t care. But probably the as.data.frame
method is sufficient.
data_corpus_inaugural
is a corpus of all US inaugural speeches.
system.time(inaug <- tag(data_corpus_inaugural))
#> user system elapsed
#> 13.923 0.254 11.545
inaug
#> Parsed by compromiser
#> n. docs: 59
#> n. tokens: 138054
It took \< 15 seconds to do POS tagging of 138054 tokens. With that and using quanteda, we can study what are the most frequent nouns.
inaug %>% as.tokens %>% tokens_select(pattern = c("*/NN*")) %>% dfm %>% topfeatures(n = 30)
#> people/nns government/nnp country/nn world/nn
#> 579 324 300 300
#> government/nn there/nn peace/nn citizens/nns
#> 262 260 259 244
#> power/nn nation/nn time/nn america/nnp
#> 240 239 219 202
#> constitution/nnp nations/nns freedom/nn states/nns
#> 200 190 179 175
#> american/nn united/nnp war/nn states/nnp
#> 169 168 165 159
#> now/nn fellow/nn years/nns justice/nn
#> 158 149 142 142
#> men/nns union/nnp spirit/nn life/nn
#> 139 139 137 136
#> law/nn rights/nns
#> 135 134
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.