Description Usage Arguments Details Value Note Author(s) References See Also Examples
View source: R/word_alignIBM1.R
For a given sentence-aligned parallel corpus, it aligns words in each sentence pair. Moreover, it calculates expected length and vocabulary size of each language (source and taget language) and it finds word translation probability as a data.table.
1 2 3 4 5 6 7 8 9 10 |
file_train1 |
the name of source language file in training set. |
file_train2 |
the name of the target language file in training set. |
nrec |
number of sentences to be read. If -1, it considers all sentences. |
iter |
number of iteration for IBM model 1. |
minlen |
a minimum length of sentences. |
maxlen |
a maximum length of sentences. |
ul_s |
logical. If TRUE, it will convert the first character of source language's sentences. When source language is a right-to-left, it should be FALSE. |
ul_t |
logical. If TRUE, it will convert the first character of target language's sentences. When target language is a right-to-left, it should be FALSE. |
intrnt |
logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary. |
display |
it consists of two arguments. If 'word1', alignments are exhibited as words and when 'number' is considered, alignments exhibits as numbers. |
dtfile |
to run this function for the first time, it must be assigned to NULL. In this case, the function will automatically save required data.table (it is necessary for obtaining MLE of IBM model1's parameters.) with a name which is combination of f1, e1, nrec and iter as "f1.e1.nrec.iter.RData". Note that, the abovementioned name must not been changed. For the next times, it is sufficient to set dtfile by any character, e.g. "a", "textfile" or "myproject". |
f1 |
it is an abbreviation of source language (default = 'fa'). |
e1 |
it is an abbreviation of target language (default = 'en'). |
sym |
logical. If TRUE, the output can be used by Symmetrization function. |
input |
logical. If TRUE, the output can be used by mydictionary and Evaluation1 functions. |
x |
an object of class |
... |
further arguments passed to or from other methods. |
Here, word alignment is a map of target language to source language.
The results depend on the corpus. As an example, we used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size about 170 Mb. If all sentences are considered, it takes about 1.105531 hours using a computer with cpu: hpcompack-i73930 and Ram: 8*8 G = 64 G and word alignment is good. But for the 10,000 first sentences, the word alignment might not be good. In fact, it is sensitive to the original translation type (lexical or conceptual). The results have been reported in
http://www.um.ac.ir/~sarmad/word.a/example_wordalignIBM1.pdf
word_alignIBM1
returns an object of class "alignment"
.
An object of class "alignment"
is a list containing the following components:
if sym = TRUE
ef |
A list of integer vectors. |
if input = TRUE
dd1 |
A data.table |
if sym = FALSE and input = FALSE
n1 |
An integer. |
n2 |
An integer. |
time |
A number. (in second/minute/hour) |
iterIBM1 |
An integer. |
expended_l_source |
A non-negative real number. |
expended_l_target |
A non-negative real number. |
VocabularySize_source |
An integer. |
VocabularySize_target |
An integer. |
word_translation_prob |
A data.table. |
word_align |
A list of one-to-many word alignment for each sentence pair. |
Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.
Neda Daneshgar and Majid Sarmad.
Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.
Lopez A. (2008), "Statistical Machine Translation.", ACM Computing Surveys, 40(3).
Peter F., Brown J. (1990), "A Statistical Approach to Machine Translation.", Computational Linguistics, 16(2), 79-85.
Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.
http://statmt.org/europarl/v7/bg-en.tgz
Evaluation1, Symmetrization, mydictionary
1 2 3 4 5 6 7 8 9 | #Since the extraction of bg-en.tgz in Europarl corpus is time consuming,
#so the aforementioned unzip files have been exported to http://www.um.ac.ir/~sarmad/... .
## Not run:
word_alignIBM1 ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
'http://www.um.ac.ir/~sarmad/word.a/euro.en',
nrec = 3000, ul_s = TRUE, intrnt = FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.