preparData: Initial Preparations of Bitext before Word Alignment and...
In word.alignment: Finding Word Alignment Using IBM model 1 for a Given Parallel Corpus and Its Evaluation

Description Usage Arguments Details Value Note Author(s) References See Also Examples

For a given Sentence-Aligned Parallel Corpus, it prepars sentence pairs as an input for word_alignIBM1 and Evaluation1 functions in this package.

1
2
3

preparData (file1, file2, 
            nrec = -1, minlen = 5, maxlen = 40, 
            ul_s = FALSE, ul_t = TRUE, all = FALSE, intrnt = TRUE)

`file1`	the name of source language file.
`file2`	the name of target language file.
`nrec`	number of sentences to be read.If -1, it considers all sentences.
`minlen`	a minimum length of sentences.
`maxlen`	a maximum length of sentences.
`ul_s`	logical. If TRUE, it will convert the first character of source language's sentences. When source language is a right-to-left, it can be FALSE.
`ul_t`	logical. If TRUE, it will convert the first character of target language's sentences. When target language is a right-to-left, it can be FALSE.
`all`	logical. If TRUE, it considers the third argument (lower = TRUE) in culf function.
`intrnt`	logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary.

It balances between source and target language as much as possible. For examples, it removes extra blank sentences and equalization sentence pairs. It also removes long sentences to save the time and using culf function it converts the first letter of each sentence into lowercase, as well as it removes all punctuation characters by RmTokenizer function. Moreover, if word_align = FALSE, this function divide each sentence into its words.

A list.

`initial`	An integer.
`used`	An integer.
`source.tok`	A list of words for each source sentence.
`target.tok`	A list of words for each target sentence.

Note that if there are not a lot of proper nouns in your text string, we suggest you to set all=TRUE to convert all text to lowercase.

Neda Daneshgar and Majid Sarmad.

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

Evaluation1, culf, RmTokenizer, word_alignIBM1

## Not run: 

aa1 = preparData ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                  'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                   nrec = 20, intrnt = FALSE)

## End(Not run)