Building an Automatic Bilingual Dictionary

Share:

Description

It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

1
2
3
4
5
mydictionary (file_train1,file_train2, 
              nrec = -1, iter = 15, prob = 0.8, 
              minlen=5, maxlen = 40, ul_s = FALSE, ul_t = TRUE, 
              lang1 = 'Farsi', lang2 = 'English', removePt = TRUE, 
              dtfile = NULL, f1 = 'fa', e1 = 'en')

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

nrec

number of sentences to be read.If -1, it considers all sentences.

iter

number of iterations for IBM Model 1.

prob

the minimum word translation probanility.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of target language's sentences. When source language is an Arabic script, it can be FALSE.

ul_t

logical. If TRUE, it will convert the first character of source language's sentences. When target language is an Arabic script, it can be FALSE.

lang1

source language's name in mydictionary.

lang2

traget language's name in mydictionary.

removePt

logical. If TRUE, it removes all punctuation marks.

dtfile

if NULL, we already did not save data.table (dd1) and it has to built. If an address exists, means that data.table (dd1) has already saved and we want to use it.

f1

it is a notation for the source language (default = 'fa').

e1

it is a notation for the target language (default = 'en').

Details

The results depend on the corpus. As an example, we used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an usual computer. The results can be found at

http://www.um.ac.ir/~sarmad/word.a/mydictionary.pdf

Value

A list.

time

A number. (in second/minute/hour)

number_input

An integer.

Value_prob

A decimal number between 0 and 1.

iterIBM1

An integer.

dictionary

A matrix.

Note

Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
#so the aforementioned unzip files have been exported to http://www.um.ac.ir/~sarmad/... .

## Not run: 

dic1 = mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      nrec = 2000, ul_s = TRUE, lang1 = 'BULGARIAN')
              
dic2 = mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      nrec = 2000, ul_s = TRUE, lang1 = 'BULGARIAN',
                      removePt = FALSE)

## End(Not run)