knitr::opts_chunk$set(echo = TRUE)
The tidystopwords
package gives you potential stopwords in more than 100
languages. Its main function is generate_stoplist
. Its argument language
accepts atomic strings and character vectors of language names or language
abbreviations corresponding to those listed by the helping function
list_supported_languages
.
The list_supported_languages
function comes with three numbered output options.
1
outputs a character vector of unique word forms.2
outputs a named character vector of word forms. The names denote
stop classes
roughly corresponding to parts of speech. Note that, in this
output, the word forms are not unique. For instance, in English stopwords, that
would occur as a subordinating conjunction as well as as a pronoun. 3
(the default) outputs a data frame, where each row represents a combination
of language (columns lang_name
and lang_id
), word form and word lemma
(columns form
and lemma
), and several other columns explained below. The list_supported_languages
output is based on multilingual_stoplist
- a
data frame that was automatically extracted from the
Universal Dependencies treebanks
(henceforth UD). Universal Dependencies is a framework for cross-linguistically
consistent grammatical annotation. The tidystopwords
package uses their
lemmatization, universal parts of speech, and universal features to derive
an inventory of stop classes:
abbreviation
(e.g. e.g., cf., etc);adposition
(preposition or postposition e.g. in and ago);auxiliary verb
(e.g. been, have, must);conjunction_subordinator
(e.g. and, because);contraction
(e.g. 'nt);determiner_quantifier
(e.g. third, which, both);interjection
(e.g. yes );particle
(e.g. off in take off )pronominal
(functional words that act as nouns - e.g., him, it. Pronouns
acting as adjectives (your) and pronominal adverbs (where) are covered by
the determiner_quantifier
stop class.)In terms of the Universal Dependencies, the stop classes are defined as follows:
abbreviation
: ufeat
contains Abbr=Yes
and upos does not equal NOUN
or
ADJ
;adposition
: upos
equals AVP
;auxiliary verb
: upos
equals AUX
;conjunction_subordinator
: upos
equals CONJ
or SCONJ
;contraction
: neither form
nor lemma
equal _
, upos
equals _
and the
form has occurred more than twice in the corpus;determiner_quantifier
: either upos
equals DET
or ufeat
contains
PronType
and at least one of the following strings: NumType
,
Ind
, Dem
, Int
, Rel
, Tot
, Neg
; interjection
: upos
equals INTJ
;particle
: upos
equals PART
;pronominal
: upos
equals PRON
with no restrictions to ufeat
or
ufeat
contains PronType
but then upos
does not equal DET
. Each version of this package uses the latest UD release available to generate the
multilingual_stoplist
data frame. Therefore multilingual_stoplist
can differ
from version to version. Typically, a new UD release brings bigger annotated
corpora and emerging corpora of new languages.
All stopword lists in tidystopwords
have been generated automatically from the
data available at the moment. Hence their quality depends on the size of the
underlying corpora as well as the morphological richness of the given language.
To allow the user to assess the reliability of the stopword list for the given
language, the multilingual_stoplist
contains relevant frequency information
for each word form in three columns: n_formlemma
, n_uposlemma
, and
n_stopclasses
.
The n_formlemma
column gives the absolute frequency of the given word form with the
given lemma. The n_uposformlemma
column gives the absolute frequency of the
given word form with the given lemma and upos.
The n_stopclasses
column says in how many stop classes the given word form with
the given lemma occurs. For instance that occurs as determiner_quantifier
(that pie tastes good), pronominal
(don't mention that), and
conjunction_subordinator
(say that you will do it).
Even high-quality reference corpora such as the UD treebanks contain tagging
errors and typos. A two step frequency filter minimizes the noise:
1) a word form must occur more than three times with a given lemma;
2) if a word form with a given lemma (rendered by n_formlemma
) occurs in
several different upos
combinations (n_uposlemma
), only combinations that
represent more than 20% of n_formlemma
remain listed.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.