removePrefixes: Remove Arabic prefixes

View source: R/stemmer.R

removePrefixesR Documentation

Remove Arabic prefixes

Description

Removes some Arabic prefixes from a unicode string. The prefixes are: "waw", "alif-lam", "waw-alif-lam", "ba-alif-lam", "kaf-alif-lam", "fa-alif-lam", and "lam-lam." Prefixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short.

Usage

removePrefixes(texts, x1 = 4, x2 = 4, x3 = 5, x4 = 5, x5 = 5, x6 = 5, x7 = 4, 
dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))

Arguments

texts

An Arabic-language string in unicode

x1

The number of letters that must be in a word for the function to remove the prefix "waw".

x2

The number of letters that must be in a word for the function to remove the prefix "alif-lam".

x3

The number of letters that must be in a word for the function to remove the prefix "waw-alif-lam".

x4

The number of letters that must be in a word for the function to remove the prefix "ba-alif-lam".

x5

The number of letters that must be in a word for the function to remove the prefix "kaf-alif-lam".

x6

The number of letters that must be in a word for the function to remove the prefix "fa-alif-lam".

x7

The number of letters that must be in a word for the function to remove the prefix "lam-lam".

dontstem

Words that should not be stemmed (entered in unicode).

Value

Returns a string with Arabic prefixes removed.

Author(s)

Rich Nielsen

Examples

## Create string with Arabic characters

x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629
 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627'

# Remove Prefixes

removePrefixes(x)


arabicStemR documentation built on July 18, 2022, 9:06 a.m.