Constructing a Cross-tabulation Matrix of Source Language Words vs Target Language Words of a Given Sentence Pair

Share:

Description

It constructs a cross-tabulation matrix of source language words vs target language words of a given sentence pair to be filled by an expert (Sure|Possible : 1|2) or based on an external word alignment software (3).

Usage

1
2
3
4
fix.gold(tst.set_sorc, tst.set_trgt, nrec = -1, 
        method = c("gold", "aligns"), minlen = 5, 
        maxlen = 40, ul_s = FALSE, ul_t = TRUE, 
        removePt = TRUE, all = FALSE, num)

Arguments

tst.set_sorc

the name of source language file in test set.

tst.set_trgt

the name of target language file in test set.

nrec

number of sentences to be read. If -1, it considers all sentences.

method

it consists of two values. If "gold", it considers the message corresponding to gold standard (i.e. "Now, press 'Enter' to continue and edit the matrix to enter Sure/Possible alignments (Sure=1,Possible=2)"). If "aligns", it considers the message corresponding to another alignment (i.e. "Now, press 'Enter' to continue and edit the matrix to enter '3' for alignments").

minlen

a minimum length of sentences.

maxlen

a maximum length of sentences.

ul_s

logical. If TRUE, it will convert the first character of source language's sentences. When source language is an Arabic script, it can be FALSE.

ul_t

logical. If TRUE, it will convert the first character of target language's sentences. When target language is an Arabic script, it can be FALSE.

removePt

logical. If TRUE, it removes all punctuation marks.

all

logical. If TRUE, it considers the third argument (lower = TRUE) in culf function.

num

an integer. The number of which sentence pair that we want to cross tabulate its matrix.

Details

If we want to evaluate our word alignment results, the matrix that is constructed by this function will be filled by an expert with codes 1 or 2 for Sure or Possible alignments, while if we want to evaluate alignment based on an external word alignment software or even another method, this matrix is filled by an expert with code 3.

Note

In case of non-ascii problem, you can use consExcel function instead.

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Holmqvist M., Ahrenberg L. (2011), "A Gold Standard for English-Swedish Word Alignment.", NODALIDA 2011 Conference Proceedings, 106 - 113.

Och F., Ney H.(2003), "A Systematic Comparison Of Various Statistical Alignment Models.", 2003 Association for Computational Linguistics, J03-1002, 29(1).

See Also

consExcel

Examples

1
2
3
4
5
6
7
8
## Not run: 
 
 fix.gold ('http://www.um.ac.ir/~sarmad/word.a/source1.txt',
           'http://www.um.ac.ir/~sarmad/word.a/target1.txt',
            nrec = 5, num = 3)
 
## End(Not run)