stopwords: Stopwordlists in German, English, Dutch, French, Polish, and...

stopwordsR Documentation

Stopwordlists in German, English, Dutch, French, Polish, and Arab

Description

This data sets contain very common lists of words that want to be ignored when building up a document-term matrix. The stop word lists can be loaded by calling data(stopwords_en), data(stopwords_de), data(stopwords_nl), data(stopwords_ar), etc. The objects stopwords_de, stopwords_en, stopwords_nl, stopwords_ar, etc. must already exist before being handed over to textmatrix().

The French stopword list has been combined by Haykel Demnati by integrating the lists from rank.nl (www.rank.nl/stopwors/french.html), the one from the CLEF team at the University of Neuchatel (http://members.unine.ch/jacques.savoy/clef/frenchST.txt), and the one prepared by Jean VĂ©ronis (http://sites.univ-provence.fr/veronis/data/antidico.txt).

The Polish stopword list has been contributed by Grazyna Paliwoda-Pekosz, Cracow University of Economics and is taken from the Polish Wikipedia.

The Arab stopword list has been contributed by Marwa Naili, Tunisia. The list is based on the stopword lists by Shereen Khoja and by Siham Boulaknadel.

Usage

   data(stopwords_de)
   stopwords_de

   data(stopwords_en)
   stopwords_en

   data(stopwords_nl)
   stopwords_nl

   data(stopwords_fr)
   stopwords_fr

   data(stopwords_ar)
   stopwords_ar

Format

A vector containing 424 English, 370 German, 260 Dutch, 890 French stop, or 434 Arab words (e.g. 'he', 'she', 'a').

Author(s)

Fridolin Wild fridolin.wild@wu-wien.ac.at, Marco Kalz marco.kalz@ou.nl (for Dutch), Haykel Demnati Haykel.Demnati@isg.rnu.tn (for French), Marwa Naili naili.maroua@gmail.com (for Arab)


lsa documentation built on May 9, 2022, 9:10 a.m.

Related to stopwords in lsa...