Evaluation1: Evaluation of Word Alignment Quality

Description Usage Arguments Details Value Note Author(s) References See Also

View source: R/Evaluation1.R

Description

It measures Precision, Recall, AER and F-measure metrics to evaluate word alignment quality.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Evaluation1(file_train1, file_train2, nrec = -1, 
            tst.set_sorc, tst.set_trgt, nlen = -1, 
            minlen = 5, maxlen = 40, ul_s = FALSE, 
            ul_t = TRUE, intrnt = TRUE, iter = 3, 
            method = c("fix","Excel"), 
            agn = c("an.agn","my.agn"), 
            guideline = 'null', 
            excel1 = 'gold.xlsx',excel2 = 'align.xlsx', 
            fixfile_gld = NULL, fixfile_agn = NULL, 
            dtfile = NULL, f1 = 'fa', e1 = 'en', alpha = 0.5)

Arguments

file_train1

the name of source language file in training set.

file_train2

the name of target language file in training set.

nrec

number of sentences in the training set to be read. If -1, it considers all sentences.

tst.set_sorc

the name of source language file in test set.

tst.set_trgt

the name of target language file in test set.

nlen

number of sentences in the test set to be read. If -1, it considers all sentences.

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of source language's sentences. When source language is a right-to-left, it can be FALSE.

ul_t

logical. If TRUE, it will convert the first character of target language's sentences. When target language is a right-to-left, it can be FALSE.

intrnt

logical. TRUE means that one of the two languages is a right-to-left, so internet connection is necessary.

iter

number of iteration for IBM model 1.

method

it consists of two arguments. If "fix", it uses fix.gold function to achieve gold standard. If "Excel", it uses consExcel function to achieve gold standard.

agn

it consists of two arguments. If "my.agn", the user wants to evaluate one-to-many word alignment using the word_alignIBM1 function in this package. If "an.agn", the user applies another software or even another method to word alignment.

guideline

if the gold standard alignment is constructed based on "null tokens", it is set "null", otherwise it can be set any character, e.g. "a", "textfile" or "myproject".

excel1

the name of the excel file for gold standarad.

excel2

the name of the excel file for alignment.

fixfile_gld

it is related to create a gold standard using fix.gold function. For the first time, it must be assigned to NULL. In this case, the function will automatically save created matrices of gold standard with a name which is combination of the number of sentence and 'RData' as for example "1.RData", "2.RData" and ... . Note that, the abovementioned name must not been changed. For the next times, it is sufficient to set fixfile_gld by any character, e.g. "a", "textfile" or "myproject".

fixfile_agn

it is similar to fixfile_gld, but for creating alignment using another software or even another method instead of gold standard.

dtfile

to run this function for the first time, it must be assigned to NULL. In this case, the function will automatically save required codes of word_alignIBM1 function with a name which is combination of f1, e1, nrec and iter as "f1.e1.nrec.iter.RData". Note that, the abovementioned name must not been changed. For the next times, it is sufficient to set dtfile by any character, e.g. "a", "textfile" or "myproject".

f1

it is an abbreviation of source language (default = 'fa').

e1

it is an abbreviation of target language (default = 'en').

alpha

is a parameter that sets the trade-off between Precision and Recall.

Details

To evaluate word alignment quality, we need to a reference alignment (a gold standard for the word alignment) of a test set. Two methods to enter this gold standard is considered. When method = "fix", means that fix.gold function is called and the user should press 'Enter' to continue and edit the matrix to enter Sure/Possible alignments (Sure=1,Possible=2). Furthermore, when the user applies another software or even another method to word alignment, he/she should set agn = "an.agn", and he/she based on another word alignment's results should press'Enter' to continue and edit the matrix to enter 3 for alignment. (Note that for each sentence pair, one matrix is created.)

If method = "Excel", means that the created excel file of consExcel has been used. In this method, the aforementioned excel file should be completed by an expert with codes 1 or 2 for Sure or Possible alignments first and then this excel file named excel1 (as a default: "gold.xlsx") is set as an input argument. Moreover, to evaluate word alignment quality using another software or even another method, the user can be used excel2 file (as a default: "align.xlsx") that had been completed by 3 for alignments.

Value

A list.

Recall

A decimal number.

Precision

A decimal number.

AER

A decimal number.

F_measure

A decimal number.

Note

Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Fraser F., Marcu D. (2007), "MeasuringWord Alignment Quality for Statistical Machine Translation.", Computational Linguistics, 33(3), 293-303.

Koehn P. (2010), "Statistical Machine Translation.", Cambridge University, New York.

Och F., Ney H.(2003)."A Systematic Comparison Of Various Statistical Alignment Models.", 2003 Association for Computational Linguistics, J03-1002, 29(1).

Wang X. "Evaluation of Two Word Alignment Systems.", Final Thesis, Department of Computer and Information Science.

See Also

word_alignIBM1, fix.gold, consExcel


word.alignment documentation built on May 2, 2019, 4:58 p.m.