prepare.data: Initial Preparations of Bitext before the Word Alignment and...
In word.alignment: Computing Word Alignment Using IBM Model 1 (and Symmetrization) for a Given Parallel Corpus and Its Evaluation

Description Usage Arguments Details Value Note Author(s) References See Also Examples

For a given Sentence-Aligned Parallel Corpus, it prepars sentence pairs as an input for align.ibm1 and evaluation functions in this package.

1
2
3

prepare.data(file.sorc, file.trgt, n = -1L, 
             encode.sorc = 'unknown' , encode.trgt = 'unknown', 
             min.len = 5, max.len = 40, remove.pt = TRUE, word.align = TRUE)

`file.sorc`	the name of source language file.
`file.trgt`	the name of target language file.
`n`	the number of sentences to be read.If -1, it considers all sentences.
`encode.sorc`	encoding to be assumed for the source language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see `scan` function.
`encode.trgt`	encoding to be assumed for the target language. If the value is "latin1" or "UTF-8" it is used to mark character strings as known to be in Latin-1 or UTF-8. For more details please see `scan` function.
`min.len`	a minimum length of sentences.
`max.len`	a maximum length of sentences.
`remove.pt`	logical. If TRUE, it removes all punctuation marks.
`word.align`	logical. If FALSE, it divides each sentence into its words. Results can be used in `align.symmet`, `cross.table`, `align.test` and `evaluation` functions.

It balances between source and target language as much as possible. For example, it removes extra blank sentences and equalization sentence pairs. Also, using nfirst2lower function, it converts the first letter of each sentence into lowercase. Moreover, it removes short and long sentences.

A list.

if word_align = TRUE

`len1`	An integer.
`aa`	A matrix (n*2), where n is the number of remained sentence pairs after preprocessing.

otherwise,

`initial`	An integer.
`used`	An integer.
`source.tok`	A list of words for each the source sentence.
`target.tok`	A list of words for each the target sentence.

Note that if there is a few proper nouns in the parallel corpus, we suggest you to set all=TRUE to convert all text into lowercase.

Neda Daneshgar and Majid Sarmad.

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

evaluation, nfirst2lower, align.ibm1, scan

# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .
## Not run: 

aa1 = prepare.data ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    n = 20, encode.sorc = 'UTF-8')
 
aa2 = prepare.data ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    n = 20, encode.sorc = 'UTF-8', word.align = FALSE)
                   
aa3 = prepare.data ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                   'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                    n = 20, encode.sorc = 'UTF-8', remove.pt = FALSE)

## End(Not run)