RmTokenizer: Remove Punctuation Characters and Tokenization

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/RmTokenizer.R

Description

It removes all punctuation characters from a given text and splits it into separated words.

Usage

1
RmTokenizer(text, intrnt = TRUE, split = FALSE, rmBlank = FALSE)

Arguments

text

an object.

intrnt

logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary.

split

logical. If TRUE, it will split the text into separated words.

rmBlank

logical. If TRUE, it removes blank words in the text, unless blanks located at the beginning or end of a sentence.

Details

This function also considers numbers as a separated word.

Assume that there is a text containing some sentences, by applying this function on a whole text, output will be a character string and if we set split = TRUE, result is one component list. If we want to tokenize by sentences, we have to split each sentence first and then apply RmTokenizer on each the sentence. (For more details, see examples below.)

Value

A vector or a list of characters or a character string. If the input is a matrix and split = TRUE , the output is a list of character vectors. (For more details, see examples below.)

Note

When Rmtokenizer removes punctuations, it creates blanks instead of them and if the user wants to remove these blanks, he/she can set rmBlank=TRUE and of course he/she should notice that this function can not remove blanks in the first or the last of each object.

Author(s)

Neda Daneshgar and Majid Sarmad.

See Also

MC_tokenizer (tm), whitespace_tokenizer (NLP)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
b1='Word-alignment is used by phrase-based systems to extract phrase pairs[2].'
 
b2='There are several methods for word alignment: generative and discriminative.'

b3='For the first time, Brown et. al [1] introduced IBM models.'

m1=matrix(c(b1,b2,b3))
RmTokenizer(m1,intrnt = FALSE)
RmTokenizer(m1,split = TRUE,intrnt = FALSE)
RmTokenizer(m1,rmBlank = TRUE, intrnt = FALSE)
RmTokenizer(m1,rmBlank = TRUE, split=TRUE,intrnt = FALSE)

# m2 is a text with multiple sentences.
m2='It is based on IBM models. The problem is that we do not have word-aligned data' 
RmTokenizer(m2,intrnt = FALSE) # It is one vector
RmTokenizer(m2,split = TRUE, intrnt = FALSE) # It is an object list

s1=strsplit(m2,'[.]')
RmTokenizer(s1, intrnt = FALSE) #It is a list with 3 objects

#l1 is a list
l1=list('A vector or a list of characters.','(For more details, see examples below.)' )

RmTokenizer(l1,intrnt = FALSE)         # It is not a list.
RmTokenizer(l1,split=TRUE,intrnt = FALSE) # It is a list.
RmTokenizer(l1,split=TRUE,rmBlank=TRUE,intrnt=FALSE) #Not removing blank at the first.

word.alignment documentation built on May 2, 2019, 4:58 p.m.