/*
 * Copyright 1999, 2000, 2001, 2005 Brown University, Providence, RI.
 * 
 *                         All Rights Reserved
 * 
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose other than its incorporation into a
 * commercial product is hereby granted without fee, provided that the
 * above copyright notice appear in all copies and that both that
 * copyright notice and this permission notice appear in supporting
 * documentation, and that the name of Brown University not be used in
 * advertising or publicity pertaining to distribution of the software
 * without specific, written prior permission.
 * 
 * BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
 * INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY
 * PARTICULAR PURPOSE.  IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR
 * ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 */

New:  The parser will parse chinese.  See below.

----------------------------------------------------------------------


I  Basic Usage

The parser (which is to be found in the sub-directory PARSE)expects
sentences delimited by <s> ...</s>, and outputs the parsed versions in
Penn tree-bank style.  (The <s> and </s> must be separated by spaces
from all other text.) So if the input is

<s> (``He'll work at the factory.'') </s>

the output will be (to cout):

(S1 (PRN (-LRB- -LRB-)
     (S (`` ``)
      (NP (PRP He))
      (VP (MD 'll)
       (VP (VB work) (PP (IN at) (NP (DT the) (NN factory)))))
      (. .)
      ('' ''))
     (-RRB- -RRB-)))

The parser will take input from either cin, or, if given the name
of a file, from that file.  So in the latter case the call to
the parser would be:

parseIt <path to directory with parsing statistics>  <file of sentences>
E.g.,
parseIt ../DATA/EN/ testSentence
(Note that as the parser now has three separate DATA files, one each
for English, Chinese, and English Language Modeling, the DATA directories
have been made separate subdirectories under the directory DATA).

As indicated above, the parser will first tokenize the input.
If you do NOT want to to tokenize (for some reason you are
handing it pretokenized input, as you would do if you were
testing it's performance on the tree-bank), give it a -K option.

II To compile

The program was created from this file by make parseIt

III N-best Parsing

This version of the parser can produce multiple-best parses.  So if
you what 50 alternative parses rather than just one, change the line
#define NTH 1 in AnswerTree.h to #define NTH 50

In multiparse mode the output format is slightly different;
Sent# #ofParses
logProb(parse1)
parse1

logProb(parse2)
parse2

etc.

IV Other options

The parser can now parse CHINESE.  It requires that the Chinese
characters already be grouped into words.  Assuming you have 
trained on the Chinese Tree-bank from LDC (see the README for
the TRAIN programs), you tell the parser to be expecting Chinese
by giving it the command-line option -LCh.  (The default is
English, which is also be specified by -LEn .)  The files you
need to train Chinese are in DATA/CH/.

The parser will ignore any sentence consisting of > 100 words +
punctuation.  To change this to, say 200 you give it the on-line
argument -l200.

The parser is set to be case sensitive.  To make it case insensitive
add the command-line flat -C .

Currently there are various array sizes that make 400 the absolute
maximum sentence length.  To allow for longer sentences change (in
Feature.h)
#define MAXSENTLEN 400
Similarly to allow for a larger dictionary of words from training increase
#define MAXNUMWORDS 50000

To see debugging information give it the on-line argument -d#
where # is a number > 10.  As the numbers get larger, the verbosity of
the information increases.

V Training

There is a subdirectory TRAIN which contains the programs used to
collect the statistics the parser requires from tree-bank data.  As
the parser comes with the statistics it needs you will only need this
if you want to try experiments with the parser on more (or less, or
different) tree-bank data.  For more information see the README file
in TRAIN.

VII Language Modeling

To use the parser as the language model described in Charniak 2001
(Proceedings of ACL) you must first retrain the data using the
settings found in DATA/LM/. 

Then give parseIt a -M command-line argument.  If the data is from
speech, and thus all one case, also use the -C flag.

The output in -L  mode is of the form:

log-grammar-probability	  log-trigram-probability   log-mixed-probability
parse

Again, if the data is from speech and has a limited vocabulary, it
will often be the case that the parser will have a very difficult time
finding a parse because of incorrect words (or, in simulated speech
output, the presence of "unk" the unknown word replacement), and there
will be many parses with equally bad probabilities.  In such cases the
pruning that keeps memory in bounds for 50-best parsing fails.  So
just use 1-best, or maybe 10 best.

