edu.stanford.nlp.tagger.maxent (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package edu.stanford.nlp.tagger.maxent

A Maximum Entropy Part-of-Speech Tagger.

See:
Description

Class Summary
AmbiguityClass	An ambiguity class for a word is the word by itself or its set of observed tags.
AmbiguityClasses	A collection of Ambiguity Class.
ASBCDict
ASBCunkDict
CollectionTaggerOutputs	This class will just hold an array of the outputs of all available taggers.
CountWrapper	A simple data structure for some tag counts.
CtbDict
CTBunkDict
DataWordTag
Dictionary	Maintains a map from words to tags and their counts.
Extractor	This class serves as the base class for classes which extract relevant information from a history to give it to the features.
ExtractorAllTaggerOutputs
ExtractorDistsim	Extractor for adding distsim information
ExtractorFollowing2WClass
ExtractorFollowingWClass
ExtractorFrames	The static eFrames contains an array of all Extractors that are used to define features.
ExtractorFramesRare	Provides arrays of ExtractorFrames for rare words.
ExtractorLastVerb
ExtractorLVdist
ExtractorOutputTag
ExtractorParticlesChris	A class that detects and provides features for common verb particle pairs.
Extractors	Maintains a set of featuror extractors and applies them.
FeatureKey	Stores a triple of an extractor ID, a feature value (derived from history) and a y (tag) value.
GlobalHolder	This class holds many global variables and other things that are used by the Stanford MaxEnt Part-of-speech Tagger package.
History
HistoryTable	Notes: This maintains a two way lookup between a History and an Integer index.
LambdaSolveTagger	This module does the working out of lambda parameters for binary tagger features.
MaxentTagger	The main class for users to run, train, and test the part of speech tagger.
MaxentTaggerGUI	A very simple GUI for displaying the POS tagger tagging text.
OutputTags	This is an array of max 3 tags with probabilities for them, that the taggers have output.
PairsHolder	A simple class that maintains a list of WordTag pairs which are interned as they are added.
ReadDataTagged	Reads tagged data from a file and creates a dictionary.
TaggerConfig	Reads and stores configuration information for a POS tagger.
TaggerExperiments	This class represents the training samples.
TaggerFeature	Holds a Tagger Feature for the loglinear model.
TaggerFeatures	This class conatins POS tagger specific features.
TaggerOutputHolder
TaggerSaxInterface	Based on SAXInterface in TransformXML, but discards internal tags in fields to be tagged, tagging all content in the fields.
TemplateHash
TestClassifier	Tags data and can handle either data with gold-standard tags (computing performance statistics) or unlabeled data.
TestSentence
TTags	This class holds the POS tags, assigns them unique ids, and knows which tags are open versus closed class.

Enum Summary
TaggerConfig.Mode

Package edu.stanford.nlp.tagger.maxent Description

A Maximum Entropy Part-of-Speech Tagger. It can run either a Conditional Markov Model (CMM) aka Maximum Entropy Markov Model (MEMM) tagger or a cyclic dependency network tagger.

If you are only interested in using one of the trained taggers included in the distribution, either from the commandline or via the Java API, look at the documentation of the class MaxentTagger.

If you are interested in training a basic tagger from data using one of the two built-in architectures (CMM or bi-directional dependency network), also look at the documentation for MaxentTagger.

The rest of this document is for more complex situations where you want to define the features/architecture of your own tagger, which requires delving into the code.

The pre-defined feature are for CMMs and bi-directional dependency networks. The local models are log-linear models using features specified via feature templates.

Kinds of Templates

There are two kinds of templates: ones for rare words, and ones for common words. For a context centered at a common word, only common word features are active. For a context centered at a rare word, both common and rare word features are active. Which words are considered common and which rare is determined by the parameter GlobalHolder.rareThreshold. For example, a threshold of five means that words occurring five times or less are considered rare.

The feature templates represent conditions on the words and tags surrounding the current position and also the target tag. For example <t_{_-1},t_₀> is a common word feature template. It is instantiated for various values of the previous and current tag. A feature formed by instantiating this template will look for example like this: <t_{_-1}=DT,t_₀=NN> and will have value 1 for a history h and tag t iff the condition is satisfied. Every feature template includes a specification of the current tag. It is not possible at the moment to include features that are true if the current tag is one of several possible, e.g. NN or NNS or NNP.

To reduce the number of features, cutoffs on the number of times a feature is active are introduced. The cutoff for common word features is GlobalHolder.threshold, and the cutoff for rare word features is GlobalHolder.thresholdRare. The thresholds work like this: the part of the feature that does not include the tag has to be active in the training set at least cutoff+1 times, and the complete feature has to be active at least once in order for the feature to be included in the model. (Note, the cutoff for the current word feature is set to 2 independent of threshold settings; cf. method TaggerExperiments.populated.)

Training a Tagger

In order to train a tagger, we need to specify the feature templates to be used, change the count cutoffs if we want, change the default parameter estimation method if we want, perhaps hand-specify closed class POS tags, and then train given tagged text.

Specifying Feature Templates

Feature templates inherit from the class Extractor. The main job of an Extractor is to extract the value it is interested in from a history. Each instantiating feature for a given template will be true for a specific value extracted from a history and a specific target tag.

For example, this is a common word extractor that extracts the current and next word.

/**
 * This extractor extracts the current and the next word in conjunction.
 */
class ExtractorCWordNextWord extends Extractor {

    private final static String excl="!";

    public ExtractorCWordNextWord() {}

    String extract(History h, PairsHolder pH) {
        String s = pH.get(h, 0, false) + excl + pH.get(h, 1, false);
        return s;
    }

}

The method extract(History h) is defined in the base class Extractor as:

    String extract(History h) {
        return extract(h, GlobalHolder.pairs);
    }

The PairsHolder contains an array of words and the tags. It has a get method that can be used to extract things from the history. In GlobalHolder.pairs , the whole training data is stored. String GlobalHolder.pairs.get(History h,int position, boolean isTag) , will return the tag or word (depending on isTag), at position position relative to the history h.

Using this PairsHolder, we can extract features from the whole sentence including the current word. The History object is basically a specification of the start of the sentence, the current word, and the end of the sentence.

In an extractor we can also specify for which tags to instantiate the template. The method boolean precondition(String tag) is by default true, meaning that a feature can be created for every tag. Sometimes we would like to restrict that, and say that features should be created for only the VB and VBP tags, for example. In this case the method precondition has to be redefined to return false for all other tags.

The extractors for common word features have to be placed in the static array ExtractorFrames.eFrames. The present state of this array for the best tagger is:

public static Extractor[] eFrames={cWord,prevWord,nextWord,prevTag,nextTag,
  prevTwoTags,nextTwoTags,prevNextTag,prevTagWord,nextTagWord,cWordPrevWord,
  cWordNextWord};

The extractors for rare word features commonly inherit from RareExtractor, which inherits from Extractor. RareExtractor provides some nice static methods for manipulating strings, such as seeing whether they contain numbers, etc. The rare word extractors have to be placed in the static array ExtractorFramesRare.eFrames. For example, this is the current state of this array for the best tagger:

public static Extractor[] eFrames={cWordUppCase,cWordNumber,
  cWordDash,cWordSuff1,cWordSuff2,cWordSuff3,cWordSuff4,
  cAllCap,cMidSentence,cWordStartUCase,cWordMidUCase,
  cWordPref1,cWordPref2,cWordPref3,cWordPref4,
  new ExtractorCWordPref(5),new ExtractorCWordPref(6),
  new ExtractorCWordPref(7), new ExtractorCWordPref(8),
  new ExtractorCWordPref(9), new ExtractorCWordPref(10),
  new ExtractorCWordSuff(5),new ExtractorCWordSuff(6),
  new ExtractorCWordSuff(7),new ExtractorCWordSuff(8),
  new ExtractorCWordSuff(9), new ExtractorCWordSuff(10),
  cLetterDigitDash, cCompany,cAllCapitalized,cUpperDigitDash};

Specifying closed-class POS tags

By default, all POS tags are assumed to be open classes. In many cases, it is useful to specify POS tags which are closed class, and can only be applied to words seen in the training data (rather than being possible tags for new words seen at runtime). These closed class tags are specified for a language (where a "language" is really a (language,tag-set) pair: a different system of tagging is a new language). You do this by specifying the language in the properties file, and specifying the closed class tags for that language in TTags.java.