word_alignIBM1: Finding One-to-Many Word Alignment Using IBM Model 1 for a...

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/word_alignIBM1.R

Description

For a given sentence-aligned parallel corpus, it aligns words in each sentence pair. Moreover, it calculates expected length and vocabulary size of each language (source and taget language) and it finds word translation probability as a data.table.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
word_alignIBM1(file_train1, file_train2, 
               nrec = -1, iter = 4, minlen = 5, 
               maxlen = 40, ul_s = FALSE, ul_t = TRUE, 
               intrnt = TRUE, display = c("word1","number"), 
               dtfile = NULL, f1 = 'fa', e1 = 'en', sym = FALSE, input = FALSE)



## S3 method for class 'alignment'
print(x, ...)

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of the target language file in training set.

nrec

number of sentences to be read. If -1, it considers all sentences.

iter

number of iteration for IBM model 1.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of source language's sentences. When source language is a right-to-left, it should be FALSE.

ul_t

logical. If TRUE, it will convert the first character of target language's sentences. When target language is a right-to-left, it should be FALSE.

intrnt

logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary.

display

it consists of two arguments. If 'word1', alignments are exhibited as words and when 'number' is considered, alignments exhibits as numbers.

dtfile

to run this function for the first time, it must be assigned to NULL. In this case, the function will automatically save required data.table (it is necessary for obtaining MLE of IBM model1's parameters.) with a name which is combination of f1, e1, nrec and iter as "f1.e1.nrec.iter.RData". Note that, the abovementioned name must not been changed. For the next times, it is sufficient to set dtfile by any character, e.g. "a", "textfile" or "myproject".

f1

it is an abbreviation of source language (default = 'fa').

e1

it is an abbreviation of target language (default = 'en').

sym

logical. If TRUE, the output can be used by Symmetrization function.

input

logical. If TRUE, the output can be used by mydictionary and Evaluation1 functions.

x

an object of class "alignment".

...

further arguments passed to or from other methods.

Details

Here, word alignment is a map of target language to source language.

The results depend on the corpus. As an example, we used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size about 170 Mb. If all sentences are considered, it takes about 1.105531 hours using a computer with cpu: hpcompack-i73930 and Ram: 8*8 G = 64 G and word alignment is good. But for the 10,000 first sentences, the word alignment might not be good. In fact, it is sensitive to the original translation type (lexical or conceptual). The results have been reported in

http://www.um.ac.ir/~sarmad/word.a/example_wordalignIBM1.pdf

Value

word_alignIBM1 returns an object of class "alignment".

An object of class "alignment" is a list containing the following components:

if sym = TRUE

ef

A list of integer vectors.

if input = TRUE

dd1

A data.table

if sym = FALSE and input = FALSE

n1

An integer.

n2

An integer.

time

A number. (in second/minute/hour)

iterIBM1

An integer.

expended_l_source

A non-negative real number.

expended_l_target

A non-negative real number.

VocabularySize_source

An integer.

VocabularySize_target

An integer.

word_translation_prob

A data.table.

word_align

A list of one-to-many word alignment for each sentence pair.

Note

Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

Lopez A. (2008), "Statistical Machine Translation.", ACM Computing Surveys, 40(3).

Peter F., Brown J. (1990), "A Statistical Approach to Machine Translation.", Computational Linguistics, 16(2), 79-85.

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

See Also

Evaluation1, Symmetrization, mydictionary

Examples

1
2
3
4
5
6
7
8
9
#Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
#so the aforementioned unzip files have been exported to http://www.um.ac.ir/~sarmad/... .

## Not run: 
word_alignIBM1 ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                'http://www.um.ac.ir/~sarmad/word.a/euro.en',
                 nrec = 3000, ul_s = TRUE, intrnt = FALSE)

## End(Not run)

word.alignment documentation built on May 2, 2019, 4:58 p.m.