BLEU instructions

bleu-1.pl Installation and Usage

This script implents BLEU.

This script has only been tested with English language references and candidates. Unpredictable results
may occur with other single byte target languages, and they will occur with multi-byte target languages.

This script has only been tested with Perl 5.6. Although there is no intentional incompatibility with generic
Perl 5 stable releases, we test for version 5.6.0 to forcibly call your attention to the lack of testing for such releases.

Installation Instructions

Simply unpack the tar file

Test and Reference Material Preparation
bleu-1.pl assumes the test and the reference translations to be marked up in light-weight sgml.
All references should be arranged in a single file as shown in the example file validate.ref included
in this distribution. Similarly all test documents should be in a single file as shown in validate.tst.
The markup itself should be obvious from the example files. bleu-1.pl matches n-grams at the
segment-level if the parallel documents have the same number of segments. Else, it will match
n-grams at the document-level.

Usage Instructions

Running bleu-1.pl with no options shows the usage as below.
USAGE: perl bleu-1.pl -t <test_file> -r <reference_file> [-d dbglevel] [-n ngram_size] [-s system2test]
By default ngram_size is 4 so that 1-, 2-, 3-, and 4-grams are matched.

Description of flags:
-t    test-file containing all documents to be scored.
-r    reference-file containing all the reference documents.
-n   optional n-gram_size which defaults to 4 so that all n-grams of size 4 or less are matched.
-s   optional argument specifying the system to be evaluated, in case there are many systems in the test-file.
-d   optional debug-level which defaults to 0
        1    shows document-by-document scores
        2    shows the highest-level n-gram matches
        3    shows the next highest n-gram matches
        4    and so on

perl bleu-1.pl -r validate.ref -t validate.tst -d 2

prints the following:

bleu-1.pl -t validate.tst -r validate.ref -d 2
test1: Number of docs = 2
Systems seen: test1. Evaluating test1
orig: Number of docs = 2
ref2: Number of docs = 2
2 reference translations found for doc 1
Matched 4-gr: translation are extensive but
Matched 4-gr: human evaluations can take
Matched 4-gr: we propose a method
Matched 4-gr: for them when there
doc_Id,1
SegsScored,4
SysWords,99
Ref2SysLen,0.8889
1-gPrec,0.7071
2-gPrec,0.4947
3-gPrec,0.3846
4-gPrec,0.2759
PrecScore,0.4389
BrevityPenalty,1
BLEU,0.4389
2 reference translations found for doc 2
Matched 4-gr: ) weigh many aspects
Matched 4-gr: such evaluations are extensive
Matched 4-gr: would benefit from a
doc_Id,2
SegsScored,3
SysWords,159
Ref2SysLen,0.8679
1-gPrec,0.6792
2-gPrec,0.4679
3-gPrec,0.3203
4-gPrec,0.2200
PrecScore,0.3868
BrevityPenalty,1
BLEU,0.3868

System,test1
SegsScored,7
SysWords,258
Ref2SysLen,0.8760
1-gPrec,0.6899
2-gPrec,0.4781
3-gPrec,0.3443
4-gPrec,0.2405
PrecScore,0.4065
BrevityPenalty,1
BLEU,0.4065