This script implents BLEU.
This script has only been tested with English language references and
candidates. Unpredictable results
may occur with other single byte target languages, and they will occur
with multi-byte target languages.
This script has only been tested with Perl 5.6. Although there
is no intentional incompatibility with generic
Perl 5 stable releases, we test for version 5.6.0 to forcibly call
your attention to the lack of testing for such releases.
Installation Instructions
Simply unpack the tar file
Test and Reference Material Preparation
bleu-1.pl assumes the test and the reference translations to be marked
up in light-weight sgml.
All references should be arranged in a single file as shown in the
example file validate.ref included
in this distribution. Similarly all test documents should be in a single
file as shown in validate.tst.
The markup itself should be obvious from the example files. bleu-1.pl
matches n-grams at the
segment-level if the parallel documents have the same number of segments.
Else, it will match
n-grams at the document-level.
Usage Instructions
Running bleu-1.pl with no options shows the usage as below.
USAGE: perl bleu-1.pl -t <test_file>
-r <reference_file> [-d dbglevel] [-n ngram_size]
[-s system2test]
By default
ngram_size is 4 so that 1-, 2-, 3-, and 4-grams are matched.
Description of flags:
-t test-file containing all documents to be scored.
-r reference-file containing all the reference documents.
-n optional n-gram_size which defaults to 4 so that all
n-grams of size 4 or less are matched.
-s optional argument specifying the system to be evaluated,
in case there are many systems in the test-file.
-d optional debug-level which defaults to 0
1 shows
document-by-document scores
2 shows
the highest-level n-gram matches
3 shows
the next highest n-gram matches
4 and
so on
perl bleu-1.pl -r validate.ref -t validate.tst -d 2
prints the following:
bleu-1.pl -t validate.tst -r
validate.ref -d 2
test1: Number of docs = 2
Systems seen: test1. Evaluating test1
orig: Number of docs = 2
ref2: Number of docs = 2
2 reference translations found for
doc 1
Matched 4-gr: translation are extensive
but
Matched 4-gr: human evaluations can
take
Matched 4-gr: we propose a method
Matched 4-gr: for them when there
doc_Id,1
SegsScored,4
SysWords,99
Ref2SysLen,0.8889
1-gPrec,0.7071
2-gPrec,0.4947
3-gPrec,0.3846
4-gPrec,0.2759
PrecScore,0.4389
BrevityPenalty,1
BLEU,0.4389
2 reference translations found for
doc 2
Matched 4-gr: ) weigh many aspects
Matched 4-gr: such evaluations are
extensive
Matched 4-gr: would benefit from a
doc_Id,2
SegsScored,3
SysWords,159
Ref2SysLen,0.8679
1-gPrec,0.6792
2-gPrec,0.4679
3-gPrec,0.3203
4-gPrec,0.2200
PrecScore,0.3868
BrevityPenalty,1
BLEU,0.3868
System,test1
SegsScored,7
SysWords,258
Ref2SysLen,0.8760
1-gPrec,0.6899
2-gPrec,0.4781
3-gPrec,0.3443
4-gPrec,0.2405
PrecScore,0.4065
BrevityPenalty,1
BLEU,0.4065