Python: module MontyLingua

MontyLingua (version 2.1)

index
c:\work\montylingua-2.0\python\montylingua.py

Module MontyLingua MONTY LINGUA - An end-to-end natural language processor for English, for the Python/Java platform Author: Hugo Liu <hugo@media.mit.edu> Project Page: <http://web.media.mit.edu/~hugo/montylingua> Copyright (c) 2002-2004 by Hugo Liu, MIT Media Lab All rights reserved. Non-commercial use is free, as provided in the GNU GPL By downloading and using MontyLingua, you agree to abide by the additional copyright and licensing information in "license.txt", included in this distribution If you use this software in your research, please acknowledge MontyLingua and its author, and link to back to the project page http://web.media.mit.edu/~hugo/montylingua. Please cite montylingua in academic publications as: Liu, Hugo (2004). MontyLingua: An end-to-end natural language processor with common sense. Available at: web.media.mit.edu/~hugo/montylingua. ************************************************ DOCUMENTATION OVERVIEW About MontyLingua: - MontyTokenizer - normalizes punctuation, spacing and contractions, with sensitivity to abbrevs. - MontyTagger - Part-of-speech tagging using PENN TREEBANK tagset - enriched with "Common Sense" from the Open Mind Common Sense project - exceeds accuracy of Brill94 tbl tagger using default training files - MontyREChunker - chunks tagged text into verb, noun, and adjective chunks (VX,NX, and AX respectively) - incredible speed and accuracy improvement over previous MontyChunker - MontyExtractor - extracts verb-argument structures, phrases, and other semantically valuable information from sentences and returns sentences as "digests" - MontyLemmatiser - part-of-speech sensitive lemmatisation - strips plurals (geese-->goose) and tense (were-->be, had-->have) - includes regexps from Humphreys and Carroll's morph.lex, and UPENN's XTAG corpus - MontyNLGenerator - generates summaries - generates surface form sentences - determines and numbers NPs and tenses verbs - accounts for sentence_type WHERE MUST THE DATAFILES BE? - the "datafiles" include all files ending in *.MDF - the best solution is to create an environment variable called "MONTYLINGUA" and put the path to the datafiles there - alternatively, MontyLingua can find the datafiles if they are in the operating system "PATH" variable, or in the current working directory API: The MontyLingua Python API is MontyLingua.html The MontyLingua Java API is JMontyLingua.html RUNNING: MontyLingua can be called from Python, Java, or run at the command line. A. From Python, import the MontyLingua.py file B. From your Java code: 1. make sure "montylingua.jar" is in your class path, in addition to associated subdirectories and data files 2. in your code, you need something like: import montylingua.JMontyLingua; // loads namespace public class YourClassHere { public static JMontyLingua j = new JMontyLingua(); public yourFunction(String raw, String toked) { jisted = j.jist_predicates(raw); // an example function 3. For a good use case example, see Sample.java. C. From the command line: 1. if you have python installed and in your path: type "run.bat" 2. if you have java installed and in your path: type "runJavaCommandline.bat" VERSION HISTORY: New in version 2.1 (6 Aug 2004) - new MontyNLGenerator component (in Beta phase) - includes version 2.0.1 bugfix for problem where java api wasn't being exposed New in version 2.0 (29 Jul 2004) - 2.5X speed enhancement for whole system 2X speed enhancement for tagger component - rule-based chunker replaced with much faster and more accurate regular expression chunker - common sense added to MontyTagger component improves word-level tagger accuracy to 97% - updated and expanded lexicon for English - added a user-customizable lexicon CUSTOMLEXICON.MDF - improvements to MontyLemmatiser incorporating exception cases - html documentation added - speed optimizations to all code - improvements made to semantic extraction - added a morphological analyzer component, MontyMorph - expanded Java API New in version 1.3.1 (11 Nov 2003) - mainly bugfixes - datafiles can now sit in the current working directory (".") or in the path of either of the two environment variables "MONTYLINGUA" or "PATH" - presence of the '/' token in input won't crash system New in Version 1.3 (5 Nov 2003) - lisp-style predicate output added - Sample.java example file added to illustrate API New in Version 1.2 (12 Sep 2003) - MontyChunker rules expanded - MontyLingua JAVA API added - MontyLingua documentation added New in Version 1.1 (1 Sep 2003) - MontyTagger optimized, 2X loading and 2.5X tagging speed - MontyLemmatiser added to MontyLingua suite - MontyChunker added - MontyLingua command-line capability added New in Version 1.0 (3 Aug 2003) - First release - MontyTagger (since 15 Jan 2001) added to MontyLingua --please send bugs & suggestions to hugo@media.mit.edu--

Modules

MontyExtractor
MontyLemmatiser
MontyNLGenerator
MontyREChunker
MontyTagger
MontyTokenizer

Classes



MontyLingua

class MontyLingua

    Methods defined here:

__init__(self, trace_p=0)

chunk_lemmatised(self, lemmatised_text)
inputs lemmatised text of the form: "He/PRP/he ran/VBD/run" and outputs the form: "(NX He/PRP/he NX) (VX is/VB/be VX) (NX the/DT/the mailman/NN/mailman NX)"

chunk_tagged(self, tagged_text)
chunks tagged text and outputs the form: "(NX He/PRP NX) (VX is/VB VX) (NX the/DT mailman/NN NX)"

extract_info(self, chunked_text)
extracts many useful things from chunked_text outputted in a dictionary, which can be printed using pp_info() its keys and sample values: noun_phrases: ['the dog','the cat'] noun_phrases_tagged: ['the/DT dog/NN','the/DT cat/NN'] verb_phrases: ['will go quickly','go slowly'] verb_phrases_tagged: ['will/MD go/VB quickly/RB','go/VB slowly/RB'] prep_phrases: ['by the road','by chance'] prep_phrases_tagged: ['by/IN the/DT road/NN','by/IN chance/NN'] modifiers: ['red','best','quickly'] modifiers_tagged: ['red/JJ','best/JJS','quickly/RB'] verb_arg_structures: ['will/MD go/VB quickly/RB','the/DT dog/NN','to/IN the/DT cats/NNS'] verb_arg_structures_concise: ['("go" "dog" "to cat")]

generate_sentence(self, vsoo, sentence_type='declaration', tense='past', s_dtnum=('', 1), o1_dtnum=('', 1), o2_dtnum=('', 1), o3_dtnum=('', 1))
inputs verb-subject-object-object tuple outputs a generated sentence valid sentence types: declarative, imperative, (can|may|would|should|could), (who|what|when|where|why|how), question valid tenses: past, present, progressive, past_progressive, future, infinitive dtnum is a pair of determiner, number e.g. ('the',1),('some',2) valid determiners = 'a','the','some','',etc valid numbers = 1,2,3

generate_summary(self, vsoos)
uses MontyNLGenerator to generate a paragraph text summary in the past tense inputs a flat list of verb-subject-object-object tuples

jist(self, text)
inputs raw text, outputs a list of dictionaries with information digests of each sentence

jist_predicates(self, text)
similar to jist() except output is simpler returns a list (document-level) of lists (sentence-level) of lisp-style predicate argument structures - each structure should look something like this:    - ("verb" "subject" "obj1" "obj2" ... ) - words are all lemmatised, and determiners and   modals are stripped out - obj's can be direct or indirect, but not   subordinate clauses for now.

lemmatise_tagged(self, tagged_text)
lemmatises tagged text and outputs the form: 'These/DT/These sentences/NNS/sentence were/VBZ/be false/JJ/false' (lemma follows the pos tag)

parse_pred_arg(self, pp)
parses the predicate-argument string returned by jist_predicates(), of the form: '("pred name" "arg 1" "arg 2" etc)' and returns them as a list

pp_info(self, extracted_infos)
pretty prints sentence information digests returned by jist()

split_paragraphs(self, text)
inputs a raw text and outputs a list of paragraph segments

split_sentences(self, text)
input a raw text and outputs a list of sentence segments

strip_tags(self, tagged_or_chunked_text)
strips part-of-speech and chunk tags from text and returns plaintext

tag_tokenized(self, tokenized_text)
takes tokenized text and returns Penn Treebank tagset tagged text: i.e.:  "This/DT is/VB a/DT sentence/NN". more information on the tagset can be found at: http://www.cis.upenn.edu/~treebank/

tokenize(self, sentence, expand_contractions_p=1)
inputs a raw text sentence and outputs that sentence with punctuation tokenized, except in the case of abbreviations iff expand_contractions_p == 1, then contractions will be resolved (e.g. "can't"-->"can not")

Data

__author__ = 'Hugo Liu <hugo@media.mit.edu>'
__version__ = '2.1'

Author

Hugo Liu <hugo@media.mit.edu>

Data
		__author__ = 'Hugo Liu <hugo@media.mit.edu>' __version__ = '2.1'