removeSuffixes: Remove Arabic suffixes

View source: R/stemmer.R

removeSuffixesR Documentation

Remove Arabic suffixes

Description

Removes some Arabic suffixes from a unicode string. The suffixes (in order of removal) are: "ha-alif", "alif-nun", "alif-ta", "waw-nun", "yah-nun", "yah-heh", "yah-ta marbutta", "heh", "ta marbutta", and "yah." Suffixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short. Only one suffix is removed from each word.

Usage

removeSuffixes(texts, x1 = 4, x2 = 4, x3 = 4, x4 = 4, 
x5 = 4, x6 = 4, x7 = 4, x8 = 3, x9 = 3, x10 = 3, 
dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))

Arguments

texts

An Arabic-language string in unicode.

x1

The number of letters that must be in a word for the function to remove the suffix "ha-alif".

x2

The number of letters that must be in a word for the function to remove the suffix "alif-nun".

x3

The number of letters that must be in a word for the function to remove the suffix "alif-ta".

x4

The number of letters that must be in a word for the function to remove the suffix "waw-nun".

x5

The number of letters that must be in a word for the function to remove the suffix "yah-nun".

x6

The number of letters that must be in a word for the function to remove the suffix "yah-heh".

x7

The number of letters that must be in a word for the function to remove the suffix "yah-ta marbutta".

x8

The number of letters that must be in a word for the function to remove the suffix "heh".

x9

The number of letters that must be in a word for the function to remove the suffix "ta marbutta".

x10

The number of letters that must be in a word for the function to remove the suffix "yah".

dontstem

Words that should not be stemmed (entered in unicode).

Value

Returns a string with Arabic suffixes removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters

x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629
 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627'

# Remove Suffixes

removeSuffixes(x)


arabicStemR documentation built on July 18, 2022, 9:06 a.m.