mydictionary: Building an Automatic Bilingual Dictionary

Description Usage Arguments Details Value Note Author(s) References Examples

View source: R/mydictionary.R

Description

It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

1
2
3
4
5
6
mydictionary(file_train1, file_train2, 
             nrec = -1, iter = 15, prob = 0.8, 
             minlen = 5, maxlen = 40, ul_s = FALSE, ul_t = TRUE, 
             lang1 = "Farsi", lang2 = "English", removePt = TRUE, 
             dtfile_path = NULL, f1 = "fa", e1 = "en", 
             result_file = "mydictionaryResults")

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

nrec

the number of sentences to be read.If -1, it considers all sentences.

iter

the number of iterations for IBM Model 1.

prob

the minimum word translation probanility.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of target language's sentences. When source language is an Arabic script, it can be FALSE.

ul_t

logical. If TRUE, it will convert the first character of source language's sentences. When target language is an Arabic script, it can be FALSE.

lang1

source language's name in mydictionary.

lang2

traget language's name in mydictionary.

removePt

logical. If TRUE, it removes all punctuation marks.

dtfile_path

if NULL (usually for the first time), a data.table will be created contaning cross words of all sentences with their matched probabilities. It saves into a file named as a combination of f1, e1, nrec and iter as "f1.e1.nrec.iter.RData".

If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages.

f1

it is a notation for the source language (default = 'fa').

e1

it is a notation for the target language (default = 'en').

result_file

the output results file name.

Details

The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at

http://www.um.ac.ir/~sarmad/word.a/mydictionary.pdf

Value

A list.

time

A number. (in second/minute/hour)

number_input

An integer.

Value_prob

A decimal number between 0 and 1.

iterIBM1

An integer.

dictionary

A matrix.

Note

Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.

In addition, if dtfile_path = NULL, the following question will be asked:

"Are you sure that you want to run the word_alignIBM1 function (It takes time)? (Yes/ No: if you want to specify word alignment path, please press 'No'.)

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .

## Not run: 

dic1 = mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      nrec = 2000, ul_s = TRUE, lang1 = 'BULGARIAN')
              
dic2 = mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      nrec = 2000, ul_s = TRUE, lang1 = 'BULGARIAN',
                      removePt = FALSE)

## End(Not run)              

word.alignment documentation built on May 19, 2017, 7:24 p.m.

Search within the word.alignment package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.