generate_stoplist: Listing of stop words in different languages.

View source: R/generate_stoplist.R

generate_stoplistR Documentation

Listing of stop words in different languages.

Description

Generate a vector of stop words in one or several languages.

Usage

generate_stoplist(language = NULL, output_form = 1)

Arguments

language

single string or a character vector. NULL by default. The strings can be language names or ISO-639 language codes as listed by the list_supported_languages(), freely combined, case-sensitive. When no language is recognized, the following error message appears: "The language name or language id you have selected is not supported. (Or you didn't specify a language at all). Check out the supported languages by calling 'list_supported_languages'.".

output_form

default 1, alternatively 2 or 3. Option 1 returns a character vector of unique stopwords word forms. Option 2 returns a named vector whose elements are the stopwords word forms and names are the associated stop classes. One word form can occur with different stop classes; hence the word forms in this vector are not unique, unlike Option 1. Option 3 returns a data frame filtered according to the language selection.

Value

The function comes with three output options.

  • Option '1' outputs a character vector of unique word forms.

  • Option '2' outputs a named character vector of word forms. The names denote 'stop classes' roughly corresponding to parts of speech. Note that, in this output, the word forms are not unique. For instance, in English stopwords, *that* would occur as a subordinating conjunction as well as as a pronoun.

  • Option '3' (the default) outputs a data frame, where each row represents a combination of language (columns 'lang_name' and 'lang_id'), word form and word lemma (columns 'form' and 'lemma'), and several other columns explained below.

All outputs are encoded in UTF-8.

Warning

  • The function stops when no language is selected.

  • The stop classes (pre-defined linguistic filters) are not mutually exclusive. Their overlap varies among languages.

  • The stoplists are fully data-driven. We have set a threshold of 3 occurrences of a combination of language, form, lemma, and upos to remove obvious noise, but some noise is bound to have come through anyway. It is mainly foreign words that were given a regular upos tag (e.g. the English "and" has sneaked in among the German coordinating conjunctions). Another known case is the contraction stop class in English, which, among well-suited instances such as *ain't* includes uses of the so-called Saxonic genitive (e.g. *world's*). Many languages are represented by balanced and large corpora of standard written texts, but some are not; e.g. based mainly on a Bible translation or Wikipedia. Hence also their stopwords can be biased.

Author(s)

Silvie Cinková, Maciej Eder

References

The underlying data frame 'multilingual_stoplist' is based on the official release of Version 2.8 of Universal Dependencies.

https://universaldependencies.org

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.

See Also

list_supported_languages, multilingual_stoplist

Examples

generate_stoplist(language = "English", output_form = 1) 

generate_stoplist(language = "English", output_form = 2) 
  
generate_stoplist(language = "English", output_form = 3) 


computationalstylistics/tidystopwords documentation built on April 6, 2024, 10:47 p.m.