stemArabic: Arabic Stemmer for Text Analysis

View source: R/stemmer.R

stemArabicR Documentation

Arabic Stemmer for Text Analysis

Description

Allows users to stem Arabic texts for text analysis.

Usage

stemArabic(dat, cleanChars = TRUE, cleanLatinChars = TRUE, 
    transliteration = TRUE, returnStemList = FALSE,
	defaultStopwordList=TRUE, customStopwordList=NULL,
	dontStemTheseWords = c("allh", "llh"))

Arguments

dat

The original data, as a vector of texts.

cleanChars

Removes all unicode characters except Latin characters and Arabic alphabet

cleanLatinChars

Removes Latin characters

transliteration

Transliterates the text

returnStemList

Performs stemming by removing prefixes and suffixes

defaultStopwordList

If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE.

customStopwordList

Optional user-specified stopword list of words to be removed, supplied as a vector of strings in either Arabic UTF-8 or Latin characters following the stemmer's transliteration scheme (words without Arabic UTF-8 characters are processed with reverse.transliterate()). Default is NULL.

dontStemTheseWords

Optional vector of strings that should not be stemmed. These words can be supplied as transliterated Arabic (according to the transliteration scheme of transliterate() and reverse.transliterate()) or in unicode Arabic. If a term matches an element of this argument at any intermediate point in stemming, that term will not be stemmed further. The default is c("allh","llh") because in most applications, stemming these common words for "God" creates some confusion by resulting in the string "lh".

Details

stemArabic prepares texts in Arabic for text analysis by stemming.

Value

stemArabic returns a named list with the following elements:

text

The stemmed text

stemlist

A list of the stemmed words.

Author(s)

Rich Nielsen

Examples

## generate some text in Arabic
x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647
     \u0627\u0644\u0631\u062D\u0645\u0646 
     \u0627\u0644\u0631\u062D\u064A\u0645"

## inspect
print(x)

## stem and transliterate
stemArabic(x)

## stem while not stemming certain words
stem(x, dontStemTheseWords = c("alr7mn"))

## stem and return the stemlist
out <- stemArabic(x,returnStemList=TRUE)
out$text
out$stemlist

arabicStemR documentation built on July 18, 2022, 9:06 a.m.