multilingual_stoplist: Multilingual Stop-Word List

Description Format Details Source References

Description

This dataset contains a dataframe with individual word forms in rows. You can control the part of speech and various frequency counts of your desired stop-word list.

Format

A data frame encoded in UTF-8, with the following columns:

Details

This data frame has been derived from an official release of the Universal Dependencies (UD) treebanks. Treebanks are text corpora with linguistic annotation. The UD syntactic annotation follows the principles of dependency syntax. The annotation encompasses for each text token:

Source

The data set is based on the official release of Version 2.8.1 of the Universal Dependencies stored in the LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Czech Republic, http://hdl.handle.net/11234/1-3687.

References

https://universaldependencies.org

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (UFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.


tidystopwords documentation built on Oct. 27, 2021, 5:07 p.m.