Description Usage Arguments Details Value Note Author(s) References Examples
It builds a suggested dictionary of two languages based on given sentence-aligned parallel corpus.
1 2 3 4 5 |
file_train1 |
the name of source language file in training set. |
file_train2 |
the name of target language file in training set. |
nrec |
number of sentences to be read.If -1, it considers all sentences. |
iter |
number of iteration for IBM model 1. The higher iteration builds more precise dictionary than the lower one. |
prob |
to build mydictionary we need this probability. The higher probability builds more precise dictionary than the lower one. |
minlen |
a minimum length of sentences. |
maxlen |
a maximum length of sentences. |
ul_s |
logical. If TRUE, it will convert the first character of target language's sentences. When source language is a right-to-left, it can be FALSE. |
ul_t |
logical. If TRUE, it will convert the first character of source language's sentences. When target language is a right-to-left, it can be FALSE. |
lang1 |
source language's name in mydictionary. |
lang2 |
traget language's name in mydictionary. |
intrnt |
logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary. |
dtfile |
if NULL, we did not save data.table (dd1) already and we have to run it. If an address exists, means that data.table(dd1) was saved and we use this saved data.table and we do not need to calculate it, again. |
f1 |
it is an abbreviation of source language (default = 'fa'). |
e1 |
it is an abbreviation of target language (default = 'en'). |
The results depend on the corpus. As an example, we used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size about 170 Mb. If all sentences are considered, it takes 1.391593 hours using a computer with cpu: hpcompack-i73930 and Ram: 8*8 G = 64 G and the suggested dictionary is not very good. But for the 10,000 first sentences it would be perfect, while it just take 1.356784 mins using an ordinary computer. The results have been reported in
http://www.um.ac.ir/~sarmad/word.a/mydictionary.pdf
A list.
time |
A number. (in second/minute/hour) |
number_input |
An integer. |
iterIBM1 |
An integer. |
dictionary |
A matrix. |
Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.
Neda Daneshgar and Majid Sarmad.
Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.
http://statmt.org/europarl/v7/bg-en.tgz
1 2 3 4 5 6 7 8 9 10 11 | #Since the extraction of bg-en.tgz in Europarl corpus is time consuming,
#so the aforementioned unzip files have been exported to http://www.um.ac.ir/~sarmad/... .
## Not run:
mydictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
'http://www.um.ac.ir/~sarmad/word.a/euro.en',
nrec = 2000, ul_s = TRUE, lang1 = 'BULGARIAN',
intrnt = FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.