Initial Preparations of Bitext before Word Alignment and Evaluation

Share:

Description

For a given Sentence-Aligned Parallel Corpus, it prepars sentence pairs as an input for word_alignIBM1 and Evaluation1 functions in this package.

Usage

1
2
3
4
prepareData(file1, file2, 
           nrec = -1, minlen = 5, maxlen = 40, 
           ul_s = FALSE, ul_t = TRUE, all = FALSE, 
           removePt = TRUE, word_align = TRUE)

Arguments

file1

the name of source language file.

file2

the name of target language file.

nrec

number of sentences to be read.If -1, it considers all sentences.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of source language's sentences. When source language is an Arabic script, it can be FALSE.

ul_t

logical. If TRUE, it will convert the first character of target language's sentences. When target language is a right-to-left, it can be FALSE.

all

logical. If TRUE, it considers the third argument (lower = TRUE) in culf function.

removePt

logical. If TRUE, it removes all punctuation marks.

word_align

logical. If FALSE, it divides each sentence into its words. Results can be used in Symmetrization, fix.gold, consExcel and Evaluation1 functions.

Details

It balances between source and target language as much as possible. For example, it removes extra blank sentences and equalization sentence pairs. Also, using culf function, it converts the first letter of each sentence into lowercase. Moreover, it removes short and long sentences.

Value

A list.

if word_align = TRUE

len1

An integer.

aa

A matrix (n*2), where n is the number of remained sentence pairs after preprocessing.

if word_align = TRUE

initial

An integer.

used

An integer.

source.tok

A list of words for each source sentence.

target.tok

A list of words for each target sentence.

Note

Note that if there is a few proper nouns in the parallel corpus, we suggest you to set all=TRUE to convert all text into lowercase.

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

See Also

Evaluation1, culf, RmTokenizer, word_alignIBM1

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
## Not run: 

aa1 = prepareData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    nrec = 20, ul_s = TRUE)

aa2 = prepareData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    nrec = 20, ul_s = TRUE, word_align = FALSE)
                   
aa3 = prepareData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    nrec = 20, ul_s = TRUE, removePt = FALSE)

## End(Not run)