AbstractTreebankLanguagePack (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.trees
Class AbstractTreebankLanguagePack

java.lang.Object
  edu.stanford.nlp.trees.AbstractTreebankLanguagePack

All Implemented Interfaces:: TreebankLanguagePack, Serializable

Direct Known Subclasses:: ArabicTreebankLanguagePack, ChineseTreebankLanguagePack, NegraPennLanguagePack, PennTreebankLanguagePack, TueBaDZLanguagePack

public abstract class AbstractTreebankLanguagePack
extends Object
implements TreebankLanguagePack
extends Object
implements TreebankLanguagePack

This provides an implementation of parts of the TreebankLanguagePack API to reduce the load on fresh implementations. Only the abstract methods below need to be implemented to give a reasonable solution for a new language.

Author:: Christopher Manning
See Also:: Serialized Form

Field Summary
`static String`	`DEFAULT_ENCODING` Use this as the default encoding for Readers and Writers of Treebank data.
`protected static char`	`DEFAULT_GF_CHAR`
`protected char`	`gfCharacter` Default character for indicating that something is a grammatical fn; probably should be overridden by lang specific ones

Constructor Summary
`AbstractTreebankLanguagePack()` Gives a handle to the TreebankLanguagePack.
`AbstractTreebankLanguagePack(char gfChar)` Gives a handle to the TreebankLanguagePack.

Method Summary
`String`	`basicCategory(String category)` Returns the basic syntactic category of a String.
`String`	`categoryAndFunction(String category)` Returns the syntactic category and 'function' of a String.
`Filter<String>`	`evalBIgnoredPunctuationTagAcceptFilter()` Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
`Filter<String>`	`evalBIgnoredPunctuationTagRejectFilter()` Returns a filter that accepts everything except a String that is a punctuation tag that should be ignored by EVALB-style evaluation.
`String[]`	`evalBIgnoredPunctuationTags()` Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language.
`Function<String,String>`	`getBasicCategoryFunction()` Returns a `Function` object that maps Strings to Strings according to this TreebankLanguagePack's basicCategory() method.
`Function<String,String>`	`getCategoryAndFunctionFunction()` Returns a `Function` object that maps Strings to Strings according to this TreebankLanguagePack's categoryAndFunction() method.
`String`	`getEncoding()` Return the input Charset encoding for the Treebank.
`char`	`getGfCharacter()`
`TokenizerFactory<? extends HasWord>`	`getTokenizerFactory()` Return a tokenizer which might be suitable for tokenizing text that will be used with this Treebank/Language pair, without tokenizing carriage returns (i.e., treating them as white space).
`GrammaticalStructureFactory`	`grammaticalStructureFactory()` Return a GrammaticalStructureFactory suitable for this language/treebank.
`GrammaticalStructureFactory`	`grammaticalStructureFactory(Filter<String> puncFilt)` Return a GrammaticalStructureFactory suitable for this language/treebank.
`boolean`	`isEvalBIgnoredPunctuationTag(String str)` Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
`boolean`	`isLabelAnnotationIntroducingCharacter(char ch)` Say whether this character is an annotation introducing character.
`boolean`	`isPunctuationTag(String str)` Accepts a String that is a punctuation tag name, and rejects everything else.
`boolean`	`isPunctuationWord(String str)` Accepts a String that is a punctuation word, and rejects everything else.
`boolean`	`isSentenceFinalPunctuationTag(String str)` Accepts a String that is a sentence end punctuation tag, and rejects everything else.
`boolean`	`isStartSymbol(String str)` Accepts a String that is a start symbol of the treebank.
`char[]`	`labelAnnotationIntroducingCharacters()` Return an array of characters at which a String should be truncated to give the basic syntactic category of a label.
`Filter<String>`	`punctuationTagAcceptFilter()` Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.
`Filter<String>`	`punctuationTagRejectFilter()` Return a filter that rejects a String that is a punctuation tag name, and rejects everything else.
`abstract String[]`	`punctuationTags()` Returns a String array of punctuation tags for this treebank/language.
`Filter<String>`	`punctuationWordAcceptFilter()` Returns a filter that accepts a String that is a punctuation word, and rejects everything else.
`Filter<String>`	`punctuationWordRejectFilter()` Returns a filter that accepts a String that is not a punctuation word, and rejects punctuation.
`abstract String[]`	`punctuationWords()` Returns a String array of punctuation words for this treebank/language.
`Filter<String>`	`sentenceFinalPunctuationTagAcceptFilter()` Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.
`abstract String[]`	`sentenceFinalPunctuationTags()` Returns a String array of sentence final punctuation tags for this treebank/language.
`void`	`setGfCharacter(char gfCharacter)` Sets the grammatical function indicating character to gfCharacter.
`String`	`startSymbol()` Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.
`Filter<String>`	`startSymbolAcceptFilter()` Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.
`abstract String[]`	`startSymbols()` Returns a String array of treebank start symbols.
`String`	`stripGF(String category)` Returns the category for a String with everything following the gf character (which may be language specific) stripped.
`TreeReaderFactory`	`treeReaderFactory()` Returns a TreeReaderFactory suitable for general purpose use with this language/treebank.
`TokenizerFactory<Tree>`	`treeTokenizerFactory()` Return a TokenizerFactory for Trees of this language/treebank.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface edu.stanford.nlp.trees.TreebankLanguagePack
`headFinder, sentenceFinalPunctuationWords, treebankFileExtension`

Field Detail

gfCharacter

protected char gfCharacter

Default character for indicating that something is a grammatical fn; probably should be overridden by lang specific ones

DEFAULT_GF_CHAR

protected static final char DEFAULT_GF_CHAR

See Also:: Constant Field Values

DEFAULT_ENCODING

public static final String DEFAULT_ENCODING

Use this as the default encoding for Readers and Writers of Treebank data.

See Also:: Constant Field Values

Constructor Detail

AbstractTreebankLanguagePack

public AbstractTreebankLanguagePack()

Gives a handle to the TreebankLanguagePack.

AbstractTreebankLanguagePack

public AbstractTreebankLanguagePack(char gfChar)

Gives a handle to the TreebankLanguagePack.

Parameters:: gfChar - The character that sets of grammatical functions in node labels.

Method Detail

punctuationTags

public abstract String[] punctuationTags()

Returns a String array of punctuation tags for this treebank/language.

Specified by:: punctuationTags in interface TreebankLanguagePack

Returns:: The punctuation tags

punctuationWords

public abstract String[] punctuationWords()

Returns a String array of punctuation words for this treebank/language.

Specified by:: punctuationWords in interface TreebankLanguagePack

Returns:: The punctuation words

sentenceFinalPunctuationTags

public abstract String[] sentenceFinalPunctuationTags()

Returns a String array of sentence final punctuation tags for this treebank/language.

Specified by:: sentenceFinalPunctuationTags in interface TreebankLanguagePack

Returns:: The sentence final punctuation tags

evalBIgnoredPunctuationTags

public String[] evalBIgnoredPunctuationTags()

Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:: evalBIgnoredPunctuationTags in interface TreebankLanguagePack

Returns:: Whether this is a EVALB-ignored punctuation tag

isPunctuationTag

public boolean isPunctuationTag(String str)

Accepts a String that is a punctuation tag name, and rejects everything else.

Specified by:: isPunctuationTag in interface TreebankLanguagePack

Parameters:: str - The string to check
Returns:: Whether this is a punctuation tag

isPunctuationWord

public boolean isPunctuationWord(String str)

Accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it maks the best guess that it can.

Specified by:: isPunctuationWord in interface TreebankLanguagePack

Parameters:: str - The string to check
Returns:: Whether this is a punctuation word

isSentenceFinalPunctuationTag

public boolean isSentenceFinalPunctuationTag(String str)

Accepts a String that is a sentence end punctuation tag, and rejects everything else.

Specified by:: isSentenceFinalPunctuationTag in interface TreebankLanguagePack

Parameters:: str - The string to check
Returns:: Whether this is a sentence final punctuation tag

isEvalBIgnoredPunctuationTag

public boolean isEvalBIgnoredPunctuationTag(String str)

Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:: isEvalBIgnoredPunctuationTag in interface TreebankLanguagePack

Parameters:: str - The string to check
Returns:: Whether this is a EVALB-ignored punctuation tag

punctuationTagAcceptFilter

public Filter<String> punctuationTagAcceptFilter()

Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.

Specified by:: punctuationTagAcceptFilter in interface TreebankLanguagePack

Returns:: The filter

punctuationTagRejectFilter

public Filter<String> punctuationTagRejectFilter()

Return a filter that rejects a String that is a punctuation tag name, and rejects everything else.

Specified by:: punctuationTagRejectFilter in interface TreebankLanguagePack

Returns:: The filter

punctuationWordAcceptFilter

public Filter<String> punctuationWordAcceptFilter()

Returns a filter that accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it makes the best guess that it can.

Specified by:: punctuationWordAcceptFilter in interface TreebankLanguagePack

Returns:: The Filter

punctuationWordRejectFilter

public Filter<String> punctuationWordRejectFilter()

Returns a filter that accepts a String that is not a punctuation word, and rejects punctuation. If one can't tell for sure (as for ' in the Penn Treebank), it makes the best guess that it can.

Specified by:: punctuationWordRejectFilter in interface TreebankLanguagePack

Returns:: The Filter

sentenceFinalPunctuationTagAcceptFilter

public Filter<String> sentenceFinalPunctuationTagAcceptFilter()

Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.

Specified by:: sentenceFinalPunctuationTagAcceptFilter in interface TreebankLanguagePack

Returns:: The Filter

evalBIgnoredPunctuationTagAcceptFilter

public Filter<String> evalBIgnoredPunctuationTagAcceptFilter()

Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:: evalBIgnoredPunctuationTagAcceptFilter in interface TreebankLanguagePack

Returns:: The Filter

evalBIgnoredPunctuationTagRejectFilter

public Filter<String> evalBIgnoredPunctuationTagRejectFilter()

Returns a filter that accepts everything except a String that is a punctuation tag that should be ignored by EVALB-style evaluation. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:: evalBIgnoredPunctuationTagRejectFilter in interface TreebankLanguagePack

Returns:: The Filter

getEncoding

public String getEncoding()

Return the input Charset encoding for the Treebank. See documentation for the Charset class.

Specified by:: getEncoding in interface TreebankLanguagePack

Returns:: Name of Charset

labelAnnotationIntroducingCharacters

public char[] labelAnnotationIntroducingCharacters()

Return an array of characters at which a String should be truncated to give the basic syntactic category of a label. The idea here is that Penn treebank style labels follow a syntactic category with various functional and crossreferencing information introduced by special characters (such as "NP-SBJ=1"). This would be truncated to "NP" by the array containing '-' and "=".

Specified by:: labelAnnotationIntroducingCharacters in interface TreebankLanguagePack

Returns:: An array of characters that set off label name suffixes

basicCategory

public String basicCategory(String category)

Returns the basic syntactic category of a String. This implementation basically truncates stuff after an occurrence of one of the labelAnnotationIntroducingCharacters(). However, there is also special case stuff to deal with labelAnnotationIntroducingCharacters in category labels: (i) if the first char is in this set, it's never truncated (e.g., '-' or '=' as a token), and (ii) if it starts with one of this set, a second instance of the same item from this set is also excluded (to deal with '-LLB-', '-RCB-', etc.).

Specified by:: basicCategory in interface TreebankLanguagePack

Parameters:: category - The whole String name of the label
Returns:: The basic category of the String

stripGF

public String stripGF(String category)

Description copied from interface: TreebankLanguagePack

Returns the category for a String with everything following the gf character (which may be language specific) stripped.

Specified by:: stripGF in interface TreebankLanguagePack

Parameters:: category - The String name of the label (may previously have had basic category called on it)
Returns:: The String stripped of grammatical functions

getBasicCategoryFunction

public Function<String,String> getBasicCategoryFunction()

Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's basicCategory() method.

Specified by:: getBasicCategoryFunction in interface TreebankLanguagePack

Returns:: The String->String Function object

categoryAndFunction

public String categoryAndFunction(String category)

Returns the syntactic category and 'function' of a String. This normally involves truncating numerical coindexation showing coreference, etc. By 'function', this means keeping, say, Penn Treebank functional tags or ICE phrasal functions, perhaps returning them as category-function.

This implementation strips numeric tags after label introducing characters (assuming that non-numeric things are functional tags).

Specified by:: categoryAndFunction in interface TreebankLanguagePack

Parameters:: category - The whole String name of the label
Returns:: A String giving the category and function

getCategoryAndFunctionFunction

public Function<String,String> getCategoryAndFunctionFunction()

Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's categoryAndFunction() method.

Specified by:: getCategoryAndFunctionFunction in interface TreebankLanguagePack

Returns:: The String->String Function object

isLabelAnnotationIntroducingCharacter

public boolean isLabelAnnotationIntroducingCharacter(char ch)

Say whether this character is an annotation introducing character.

Specified by:: isLabelAnnotationIntroducingCharacter in interface TreebankLanguagePack

Parameters:: ch - The character to check
Returns:: Whether it is an annotation introducing character

isStartSymbol

public boolean isStartSymbol(String str)

Accepts a String that is a start symbol of the treebank.

Specified by:: isStartSymbol in interface TreebankLanguagePack

Parameters:: str - The str to test
Returns:: Whether this is a start symbol

startSymbolAcceptFilter

public Filter<String> startSymbolAcceptFilter()

Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.

Specified by:: startSymbolAcceptFilter in interface TreebankLanguagePack

Returns:: The filter

startSymbols

public abstract String[] startSymbols()

Returns a String array of treebank start symbols.

Specified by:: startSymbols in interface TreebankLanguagePack

Returns:: The start symbols

startSymbol

public String startSymbol()

Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.

Specified by:: startSymbol in interface TreebankLanguagePack

Returns:: The start symbol

getTokenizerFactory

public TokenizerFactory<? extends HasWord> getTokenizerFactory()

Return a tokenizer which might be suitable for tokenizing text that will be used with this Treebank/Language pair, without tokenizing carriage returns (i.e., treating them as white space). The implementation in AbstractTreebankLanguagePack returns a factory for WhitespaceTokenizer.

Specified by:: getTokenizerFactory in interface TreebankLanguagePack

Returns:: A tokenizer

grammaticalStructureFactory

public GrammaticalStructureFactory grammaticalStructureFactory()

Return a GrammaticalStructureFactory suitable for this language/treebank. (To be overridden in subclasses.)

Specified by:: grammaticalStructureFactory in interface TreebankLanguagePack

Returns:: A GrammaticalStructureFactory suitable for this language/treebank

grammaticalStructureFactory

public GrammaticalStructureFactory grammaticalStructureFactory(Filter<String> puncFilt)

Return a GrammaticalStructureFactory suitable for this language/treebank. (To be overridden in subclasses.)

Specified by:: grammaticalStructureFactory in interface TreebankLanguagePack

Parameters:: puncFilt - A filter which should reject punctuation words (as Strings)
Returns:: A GrammaticalStructureFactory suitable for this language/treebank

getGfCharacter

public char getGfCharacter()

setGfCharacter

public void setGfCharacter(char gfCharacter)

Description copied from interface: TreebankLanguagePack

Sets the grammatical function indicating character to gfCharacter.

Specified by:: setGfCharacter in interface TreebankLanguagePack

Parameters:: gfCharacter - Sets the character in label names that sets of grammatical function marking (from the phrase label).

treeReaderFactory

public TreeReaderFactory treeReaderFactory()

Returns a TreeReaderFactory suitable for general purpose use with this language/treebank.

Specified by:: treeReaderFactory in interface TreebankLanguagePack

Returns:: A TreeReaderFactory suitable for general purpose use with this language/treebank.

treeTokenizerFactory

public TokenizerFactory<Tree> treeTokenizerFactory()

Return a TokenizerFactory for Trees of this language/treebank.

Specified by:: treeTokenizerFactory in interface TreebankLanguagePack

Returns:: A TokenizerFactory for Trees of this language/treebank.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.trees Class AbstractTreebankLanguagePack

gfCharacter

DEFAULT_GF_CHAR

DEFAULT_ENCODING

AbstractTreebankLanguagePack

AbstractTreebankLanguagePack

punctuationTags

punctuationWords

sentenceFinalPunctuationTags

evalBIgnoredPunctuationTags

isPunctuationTag

isPunctuationWord

isSentenceFinalPunctuationTag

isEvalBIgnoredPunctuationTag

punctuationTagAcceptFilter

punctuationTagRejectFilter

punctuationWordAcceptFilter

punctuationWordRejectFilter

sentenceFinalPunctuationTagAcceptFilter

evalBIgnoredPunctuationTagAcceptFilter

evalBIgnoredPunctuationTagRejectFilter

getEncoding

labelAnnotationIntroducingCharacters

basicCategory

stripGF

getBasicCategoryFunction

categoryAndFunction

getCategoryAndFunctionFunction

isLabelAnnotationIntroducingCharacter

isStartSymbol

startSymbolAcceptFilter

startSymbols

startSymbol

getTokenizerFactory

grammaticalStructureFactory

grammaticalStructureFactory

getGfCharacter

setGfCharacter

treeReaderFactory

treeTokenizerFactory

edu.stanford.nlp.trees
Class AbstractTreebankLanguagePack