

This directory includes the files needed to train the parser
by reading in in treebank data and collecting the needed probabilities.

Note that while many of the files have the same names as those
of the parser, often they are slightly different and thus this
directory must be kept separate from that of the parser.

To execute the training regieme for the standard English parser,
execute

allScript En  -directory-for-data- -training-files- -development-files-

e.g.,

allScript En /pro/dpg/hpp/dist6b/DATA/  "/pro/dpg/ecdata/cp/train.txt"  "/pro/dpg/ecdata/cp/test.main"

The directory where the data goes needs to contain the following
files.  Most (featInfo.*) tell the data collection programs exactly
what features to attend to.  terms.txt defines the noun-terminal and
pre-terminal symbols of the lanauge.  headInfo.txt states which
children catagories like to be the heads of which parent categories.
You will also need training data, e.g., the Penn Tree-bank, or
Switchboard, which is not provided.  bugFix.txt includes shards of
sentences which are necessary to covery very unlikely combinations
which the training data do not cover. E.g., 
( (FRAG (NP (NN Task) (# #)) (. .)))


-rw-r--r--  1 ec fac  32 May 26 14:17 bugFix.txt
-rw-r--r--  1 ec fac 258 May 23 16:54 featInfo.h
-rw-r--r--  1 ec fac 411 May 23 16:54 featInfo.l
-rw-r--r--  1 ec fac  94 May 23 16:54 featInfo.lm
-rw-r--r--  1 ec fac 298 May 23 16:54 featInfo.m
-rw-r--r--  1 ec fac 405 May 23 16:54 featInfo.r
-rw-r--r--  1 ec fac  91 May 23 16:54 featInfo.rm
-rw-r--r--  1 ec fac  65 May 23 16:54 featInfo.ru
-rw-r--r--  1 ec fac 112 May 23 16:54 featInfo.s
-rw-r--r--  1 ec fac 138 May 23 16:54 featInfo.t
-rw-r--r--  1 ec fac  58 May 23 16:54 featInfo.tt
-rw-r--r--  1 ec fac 181 May 23 16:54 featInfo.u
-rw-r--r--  1 ec fac 553 May 23 16:54 headInfo.txt
-rw-r--r--  1 ec fac 609 May 23 16:54 terms.txt

All of the other files needed by the parser will be created
by runing allScript


LANGUAGE MODELING

To train the language model version using the featInfo files and
rules.txt found in ../DATA/LM/.

Also run allScript with the first argument lm rather En.  You will
need the Penn Treebank.

Chinese

To train the chinese parser, execute allScrip with first argument Ch.
You will need the Chinese Treebank from the LDC.
