Description Usage Arguments Details Value Note Author(s) References See Also Examples
It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.
1 2 |
... |
Further arguments to be passed to |
n |
Number of sentences to be read. |
iter |
the number of iterations for IBM Model 1. |
prob |
the minimum word translation probanility. |
dtfile.path |
if If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages. |
name.sorc |
source language's name in mydictionary. |
name.trgt |
traget language's name in mydictionary. |
The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at
http://www.um.ac.ir/~sarmad/word.a/bidictionary.pdf
A list.
time |
A number. (in second/minute/hour) |
number_input |
An integer. |
Value_prob |
A decimal number between 0 and 1. |
iterIBM1 |
An integer. |
dictionary |
A matrix. |
Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.
In addition, if dtfile.path = NULL
, the following question will be asked:
"Are you sure that you want to run the align.ibm1 function (It takes time)? (Yes/ No: if you want to specify word alignment path, please press 'No'.)
Neda Daneshgar and Majid Sarmad.
Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.
http://statmt.org/europarl/v7/bg-en.tgz
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # Since the extraction of bg-en.tgz in Europarl corpus is time consuming,
# so the aforementioned unzip files have been temporarily exported to
# http://www.um.ac.ir/~sarmad/... .
## Not run:
dic1 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
'http://www.um.ac.ir/~sarmad/word.a/euro.en',
n = 2000, encode.sorc = 'UTF-8',
name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH')
dic2 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
'http://www.um.ac.ir/~sarmad/word.a/euro.en',
n = 2000, encode.sorc = 'UTF-8',
name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH',
remove.pt = FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.