mydictionary: Building a Suggested Dictionary

Description Usage Arguments Details Value Note Author(s) References Examples

Description

It builds a suggested dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

1
2
3
4
5
mydictionary (file_train1,file_train2, 
              nrec = -1, iter = 10, prob = 0.9, 
              minlen=5, maxlen = 40, ul_s = FALSE, ul_t = TRUE, 
              lang1 = 'Farsi', lang2 = 'English', intrnt = TRUE, 
              dtfile = NULL, f1 = 'fa', e1 = 'en')

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

nrec

number of sentences to be read.If -1, it considers all sentences.

iter

number of iteration for IBM model 1. The higher iteration builds more precise dictionary than the lower one.

prob

to build mydictionary we need this probability. The higher probability builds more precise dictionary than the lower one.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of target language's sentences. When source language is a right-to-left, it can be FALSE.

ul_t

logical. If TRUE, it will convert the first character of source language's sentences. When target language is a right-to-left, it can be FALSE.

lang1

source language's name in mydictionary.

lang2

traget language's name in mydictionary.

intrnt

logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary.

dtfile

if NULL, we did not save data.table (dd1) already and we have to run it. If an address exists, means that data.table(dd1) was saved and we use this saved data.table and we do not need to calculate it, again.

f1

it is an abbreviation of source language (default = 'fa').

e1

it is an abbreviation of target language (default = 'en').

Details

The results depend on the corpus. As an example, we used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size about 170 Mb. If all sentences are considered, it takes 1.391593 hours using a computer with cpu: hpcompack-i73930 and Ram: 8*8 G = 64 G and the suggested dictionary is not very good. But for the 10,000 first sentences it would be perfect, while it just take 1.356784 mins using an ordinary computer. The results have been reported in

http://www.um.ac.ir/~sarmad/word.a/mydictionary.pdf

Value

A list.

time

A number. (in second/minute/hour)

number_input

An integer.

iterIBM1

An integer.

dictionary

A matrix.

Note

Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
#so the aforementioned unzip files have been exported to http://www.um.ac.ir/~sarmad/... .

## Not run: 

mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
              'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
              nrec = 2000, ul_s = TRUE, lang1 = 'BULGARIAN', 
              intrnt = FALSE)

## End(Not run)

word.alignment documentation built on May 2, 2019, 4:58 p.m.