POS tags
From BioIE Wiki
NOTE: The following examples have been excerpted directly from Santorini and MacIntyre Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd revision, 2nd printing) or adapted from the Santorini guide in order to accurately reflect biomedical text used in the BioIE data extraction project.
|
$
Dollar sign
In the Penn Treebank, this tag is used for currency symbols. It doesn't occur in the Bio-Med corpus.
. and ,
Period and comma
Use these tags for periods and commas used as ordinary English punctuation. Also, the "." tag is used for question marks and exclamation points, in their normal prose use as sentence-ending punctuation.
A period can also appear as a SYM (usually a multiplication sign), as in the following example:
The latter portion of the above phrase would be tagged as follows:
The following uses of periods and commas should not be split out or tagged separately:
- A period used as a decimal point in a number
- A period used as part of a list item
- A comma appearing as part of a complex chemical word:
Leave them as part of the larger token and tag appropriately.
:
Colon
Use this tag for colons (:) and semicolons (;), which are used to separate one statement from another (the colon is also used to introduce lists), and for em-dashes (—, usually realized in ASCII as one or two hyphens), which are used to set off parenthetical comments or asides.
- Ratios are signified by a colon:
Separate the numbers and the : symbol as distinct tokens.
There may be uses of the colon or semicolon in biomedical or chemical notations which should not be given this tag — if we find any we'll mention them here.
``
See below.
''
Quotation marks
Note that the opening quotation mark tag consists of two grave accent ("backtick") characters, and the closing quotation mark tag consists of two single quote characters, not a double quote:
Use these tags for quotation marks, whether single or double.
Do not use these tags for English apostrophes. The correct handling of an apostrophe depends on its use:
- The possessive particle ('s or ') should be tagged as POS.
- Contractions such as can't and it's should be split and tagged as if they were spelled out.
- An apostrophe following a number or mathematical expression (as in e.g. 3' or 5') is a prime symbol and should be split off and tagged as a SYM.
-LRB-
See below.
-RRB-
Left and right round brackets
Notwithstanding that the tags stand for "left round bracket" and "right round bracket", tag as -LRB- and -RRB- any paired symbols used as brackets: ( ) [ ] { } < >, regardless of shape, except (see below) where they are part of a construction that should be tagged as a whole rather than split.
When used as normal English parentheses, split them and tag them accordingly. For example:
- Cases where brackets should NOT be split out:
K(i)
EC(50)
IC(50)Curly braces { } or angle brackets < > are seen infrequently in our abstracts. When not being used as angle brackets, the greater-than > and less-than < symbols should be tagged as a SYM.
All of these mean "gly->val", as do other forms with more or less spacing:
gly - > val
gly - > val
gly -> val
gly - gt val
gly - > valIn each case the entire string between "gly" and "val" -- the hyphen and the misrepresentation of ">" -- should be made into a single token and tagged as a SYM.
AFX
Affix
AFX is for English affixes, such as non-, anti-, pro-, as well as components of (bio)chemical words, like azo-, azoxy-, and hydro-. Medical components, too. Do not split off a prefix when it is connected directly, without a hyphen (or conceivable other punctuation). Examples: nonfunctioning, antimalarial, precancerous, etc.
Here's relevant question that illustrates this point, see e-mail dated 2004-04-10:
I'm not entirely sure how to tag spermato-. It isn't, strictly speaking, an affix, but means nothing when detached from -genesis.
AFX is the correct tag in this case. Spermato- is as much an affix as "hydro-". It means (originally) "seed" or (scientifically) "sperm" just as "hydro-" means water or hydrogen.
Also note that affixes don't necessarily have to be attached to a root or stem, as in this example:
In this case, "micro" should be tagged an an affix (AFX).
CC
Coordinating conjunction
This category includes the following:
- the mathematical operators plus, minus, less, times (in the sense of "multiplied by") and over (in the sense of "divided by"), when they are spelled out
- and, but, nor, or, yet (as in Yet it's cheap, cheap yet good)
- both, either and neither when they are the first members of the double conjunctions both . . . and, either . . . or and neither . . . nor. See CC or DT Guidelines.
Be aware that either or neither can sometimes function as determiners (DT) even in the presence of or or nor. See CC or DT Guidelines.
For in the sense of because is a coordinating conjunction (CC) rather than a subordinating conjunction (IN).
So in the sense of "so that," on the other hand, is a subordinating conjunction (IN).
CD
Cardinal number
Use this tag for expessions that represent specific numbers, including numerals, Roman numerals, and spelled-out number words.
Ordinal numbers, such as first, fifth, and 23rd, should be tagged JJ, not CD, when they refer to order ("the first, the second, the third"), but not when they refer to fractions ("one-third", "one-fifth").
For more details, see here:
- Punctuation in Numbers
Scientific notation: 10(-9)M
Numerical_range: 200-230
Number plus -fold: 1.6-fold
Plus and minus: +.05, -3, 48+, 61-
Numbers in fruit salad: 2,3,4- 8,9- and 14,15-hydroxylation- Numbers Written with Letters
Spelled-Out Numbers: Fifteen samples, forty-five, one hundred twenty
Number Words: a dozen, a hundred, two hundred
"Half" and other fraction wordsDT
Determiner
This category includes the articles a(n), every, no and the, the indefinite determiners another, any and some, each, either (as in either way), neither (as in neither decision), that, these, this and those, and instances of all and both when they do not precede a determiner or possessive pronoun (as in all roads or both times). Instances of all or both that do precede a determiner or possessive pronoun are tagged as predeterminers (PDT).
Formalin-fixed paraffin-embedded tissue blocks were selected from five cases each_DT of uterine and extrauterine leiomyomas (LM).
Since any noun phrase can contain at most one determiner, the fact that such can occur together with a determiner (as in the only such case) means that it should be tagged as an adjective (JJ), unless it precedes a determiner, as in such a problem, in which case it is a predeterminer (PDT).
EX
Existential there
Existential there is the unstressed there that triggers inversion of the inflected verb and the logical subject of a sentence.
For help distinguishing this use of there from its use as an adverb, see EX or RB Guidelines.
FW
Foreign word
The FW tag is used for foreign (non-English) words and phrases. In biomedical text, these are almost all Latin.
Abbreviations of Latin phrases, when printed without a space, will be considered a single FW, following Santorini p. 32: "Abbreviations and initials should be tagged as if they were spelled out." (See also here.) When printed with a space, split and tag both parts as FWs.
e.g._FW
NOTE: but e._FW g._FW (if with a space)
i.e._FW
NOTE: but i._FW e._FW (if with a space)HYPH
Hyphen
Use this tag for the hyphen (-) character when it is used as a normal piece of English punctuation. In general, hyphenated compound words should be split into their components, but there are exceptions. In addition, this character has many other uses, and appears in many confusing or difficult-to-tag constructions. For full details, see here.
IN
Preposition or subordinating conjunction
We make no explicit distinction between prepositions and subordinating conjunctions. (The distinction is not lost, however — a preposition is an IN that precedes a noun phrase or a prepositional phrase, and a subordinate conjunction is an IN that precedes a clause.) Words like though, although, and whereas are subordinating conjunctions.
The preposition to has its own special tag TO and the phrase "due to" should be tagged as follows:
JJ
Adjective
The JJ tag is used in the following cases:
- To describe or modify a noun or pronoun:
- "-ic" Endings
- Ordinal Numbers
Ordinal numbers are tagged as adjectives in the case of as follows:
Note that in the case of "the fourth-largest case study..", "fourth-largest" would be tagged as RB.
JJR
Adjective, comparative
Adjectives with the comparative ending -er and a comparative meaning are tagged JJR. More and less when used as adjectives, as in more or less mail, are also tagged as JJR. More and less can also be tagged as JJR when they occur by themselves.
Adjectives with a comparative meaning but without the comparative ending -er, like superior, should simply be tagged as JJ. Adjectives with the ending -er but without a strictly comparative meaning ("more X"), like further in further details, should also simply be tagged as JJ.
Adjectives (and adverbs — see RBR) are grammatically comparative exactly when they can be followed by than without changing their meaning.
Mutation rate of APC gene in adenoma and carcinoma was higher_JJR than ACF.
JJS
Adjective, superlative
Adjectives with the superlative ending -est (as well as worst) are tagged as JJS. Most and least when used as adjectives, as in the most or the least mail, are also tagged as JJS. Most and least can also be tagged as JJS when they occur by themselves;
Adjectives with a superlative meaning but without the superlative ending -est, like first, last or unsurpassed, should simply be tagged as JJ.
LS
List item
This category includes letters and numerals when they are used to identify items in a list. Any punctuation associated with a list item marker should be included in the same token.
MD
Modal verb
This category includes all verbs that don't take an -s ending in the third person singular present: can, could, (dare), may, might, must, ought, shall, should, will, would.
NN
Noun, singular common
In regular English grammar, a noun refers to "an entity, quality, state, action, or concept". (Merriam-Webster). Since annotators rarely have the the domain knowledge required to fully understand the meaning of the biomedical files, we use NN as a default tag for many unfamiliar terms.
- Hyphenated numbers (not referring to a range) should be tagged NN because they refer to a chemical name.
- Digit/letter combinations should all be tagged NN.
- Gerunds
Watch out for gerunds (-ing forms of verbs), as they can be either NN or VBG. See NN or VBG Guidelines. If a collocation "X-ing N" is not equivalent (or similar) in meaning to "N X-es", then the word is a noun (NN). In such cases, the collocation can often be paraphrased in terms of an infinitive or a more clearly nominal construction.
the mating_NN season (the season for mating, not: the season that is mating)
Sequencing_NN of the remaining_VBG GSK-3beta allele in these cases failed to identify any mutations.
NOTE: "Sequencing" is used nominally; "the allele that remains" is plausible.NNP
Noun, singular proper
There are few proper nouns in these files, mostly names of individual persons or organizations. Proper names of persons, organizations, places, and species names should be tagged as NNP.
Conversly, trademarked drugs or other substances that are capitalized are tagged as NN. This rule also includes gene names and symbols, abbreviations for diseases, and other pieces of biochemical jargon.
- Individuals
- Organizations
- Places
- Species names
A Linnaean name follows the format Genus species. The genus part of the name is always capitalized in this format, the species part never, as in "Homo sapiens". They should both be tagged NNP, capitalized or not.
NNPS
Noun, plural proper
Use this tag for nouns that are both proper (see above) and plural (see below).
NNS
Noun, plural common
For more information, see NN or NNS Guidelines.
PDT
Predeterminer
This category includes the following determiner-like elements such as all, nary, both, quite, half, rather, many, and such when they precede an article or possessive pronoun. The following constructions are commonly seen in biomedical text:
both_PDT the girls
quite_PDT a mess
half_PDT the time
rather_PDT a nuisance
such_PDT an issuePOS
Possessive
The possessive ending on nouns ending in 's or ' is split off by the tagging algorithm and tagged as if it were a separate word.
the parents' distress
the parents_NNS '_POS distress
Alzheimer's disease
PRP
Personal pronoun
This category includes the personal pronouns proper, without regard for case distinctions (I , me, you, he, him, etc.), the reflexive pronouns ending in -self or -selves, and the nominal possessive pronouns mine, yours, his, hers, ours and theirs. The adjectival possessive forms my, your, his, her, its, our and their, on the other hand, are tagged PRP$.
PRP$
Personal pronoun, possessive
This category includes the adjectival possessive forms my, your, his, her, its, one's,* our and their. The nominal possessive pronouns mine, yours, his, hers, ours and theirs are tagged as personal pronouns (PRP).
* Do not separate the 's of one's.
RB
Adverb
This category includes most words that end in -ly as well as degree words like quite, too and very, posthead modifiers like enough and indeed (as in good enough, very well indeed), and negative markers like not, n't and never.
SAs rarely_RB show MSI or any evidence of chromosomal-scale genetic instability.
The sample numbers are too_RB small for definite conclusions.
Changes in known genes do not_RB account for the growth of the majority of SAs.RBR
Adverb, comparative
This tag refers to an adverb that expresses whether or not one item posesses a predetermined quality in a greater or lesser degree in comparision to another item. More and less can be used as comparative adverbs. For example,
K-ras mutations may be less_RBR common in SAs than in classical adenomas.
Adverbs with the comparative ending -er but without a strictly comparative meaning, like later in "The doctors will stop by later", should simply be tagged as an adverb (RB). This also goes for further, as there isn't anything easily defined that further is more of.
RBS
Adverb, superlative
The RBS tag refers to an adverb that expresses whether or not an item has a predetermined quality to a greater or lesser degree.
The therapy we have pioneered is the most_RBS effective of its kind.
RP
Particle
This category includes a number of mostly monosyllabic words that also double as directional adverbs and prepositions. Consult the IN or RB Guidelines, IN or RP Guidelines, and RB or RP Guidelines for further information.
SYM
Symbol
This tag should be used for mathematical, scientific and technical symbols or expressions that aren't words of English. It should not used for any and all technical expressions. For instance, the names of chemicals, units of measurements (including abbreviations thereof) and letters of our ("Roman" or "Latin") alphabet , such as A B C ..., are tagged as nouns (NN).
Names of Greek letters are symbols (SYM). Here's the complete set:
alpha iota rho beta kappa sigma gamma lambda tau delta mu upsilon epsilon nu phi zeta xi chi eta omicron psi theta pi omega
Certain cases of punctuation are also considered to be symbols (SYM), such as:
- The plus sign "+" and minus sign "-" when indicating positive or negative numbers
- The plus-or-minus sign ±, which in our texts generally appears as +/-
- English punctuation used in the name of a mutation
- (R) standing for the registered trademark symbol ®. Similarly © and ™, if they ever appear.
- Arrows
NOTE: In this case, whatever occurs between "gly" and "val" is tagged as a symbol (SYM), including the white spaces. This rule also applies to cases that include "greater than" and "less than" symbols.
TO
To
The word to is tagged TO, regardless of whether it is a preposition or an infinitival marker.
token
Untagged token
This is the tag placed by the tokenizer when it breaks the text up into tokens. If a POS annotator sees it, it means that a token wasn't tagged by the automatic POS tagger, or was split and left untagged by another annotator earlier in the annotation process. This tag is never correct — it should not appear on text that is to be annotated.
UH
Interjection
In the Penn Treebank, this tag is used for exclamations and interjections. This doesn't occur in the Bio-Med corpus.
VB
Verb, base form
This tag subsumes imperatives, infinitives, and subjunctives.
- Imperative
- Infinitive
We want them to do_VB it.
We made them do_VB it.
The aim of this study was to elucidate_VB pathways of carcinogenesis.- Subjunctive
VBD
Verb, past tense
This tag includes all past tense verbs, including the conditional form of the verb to be.
The doctor was_VBD present.
VBG
Verb, present participle
VBG refers to the form of a verb that ends in -ing, also called a gerund. VBGs can be used in their strictly verbal form, as an adjective, or as a noun.
- Used verbally:
- Used as a noun:
If a collocation "X-ing N" is not equivalent (or similar) in meaning to "N X-es", then the word is a noun (NN). In such cases, the collocation can often be paraphrased in terms of an infinitive or a more clearly nominal construction.
Meaning: reductions in spending, 'NOT reductions that are spending)
the mating_NN season
- Used as an adjective: See the JJ or VBG/VBN Guidelines.
NOTE: "the allele that remains" is plausible.
VBN
Verb, past participle
This tag refers to the form a verb that ends in -ed (for regular verbs) or in alternate forms (for irregular verbs) and describes nouns that are the object of the action of the verb.
The point mutation of GGT to GAT in codon 12 was frequently observed_VBN.
There are instances of confusion between JJ and VBN (see JJ or VBG/VBN Guidelines). In the case of bio-medical text, unless it is being used explicitly as an adjective (see JJ), use VBN.
VBP
Verb, present tense, not 3rd person singular
The VBP tag refers to a present tense verb, not 3rd person singular. Take care to correct VB to VBP where appropriate.
VBZ
Verb, present tense, 3rd person singular
The VBZ tag refers to a present tense verb, 3rd person singular.
It has_VBZ been proposed that microsatellite instability (MSI) distinguishes_VBZ SAs from classical adenomas.
WDT
Wh-determiner
This tag includes which, as well as that when it is used as a relative pronoun. For example:
These data suggest that another pathway of colorectal carcinogenesis that_WDT does not involve Ki-ras point mutation might exist.
WP
Wh-pronoun
This tag includes what, who and whom.
WP$
Wh-pronoun, possessive
This category includes the wh-word whose.
WRB
Wh-adverb
This category includes how, where, why, etc.
When in a temporal sense is tagged as a Wh-adverb (WRB). In the sense of if, on the other hand, it is a subordinating conjunction (IN).