POS tags

From BioIE Wiki

Jump to: navigation, search

Main Page : POS  : POS tags


NOTE: The following examples have been excerpted directly from Santorini and MacIntyre Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd revision, 2nd printing) or adapted from the Santorini guide in order to accurately reflect biomedical text used in the BioIE data extraction project.


$

. and ,

:

`` / ''

-LRB- / -RRB-

AFX

CC

CD

DT

EX

FW

HYPH

IN

JJ

JJR

JJS

LS

MD

NN

NNP

NNPS

NNS

PDT

POS

PRP

PRP$

RB

RBR

RBS

RP

SYM

TO

token

UH

VB

VBD

VBG

VBN

VBP

VBZ

WDT

WP

WP$

WRB

$

Dollar sign

In the Penn Treebank, this tag is used for currency symbols. It doesn't occur in the Bio-Med corpus.

. and ,

Period and comma

Use these tags for periods and commas used as ordinary English punctuation. Also, the "." tag is used for question marks and exclamation points, in their normal prose use as sentence-ending punctuation.

A period can also appear as a SYM (usually a multiplication sign), as in the following example:

apparent V(max) = 3460 +/- 3190 pmol. mg(-1)

The latter portion of the above phrase would be tagged as follows:

pmol_NN ._SYM mg(-1)_NN

The following uses of periods and commas should not be split out or tagged separately:

  • A period used as a decimal point in a number
3.14159
1. item 2. item
1,2,4-trimethyl cyclohexane

Leave them as part of the larger token and tag appropriately.

 :

Colon

Use this tag for colons (:) and semicolons (;), which are used to separate one statement from another (the colon is also used to introduce lists), and for em-dashes (, usually realized in ASCII as one or two hyphens), which are used to set off parenthetical comments or asides.

  • Ratios are signified by a colon:

Separate the numbers and the : symbol as distinct tokens.

1_CD :_: 200_CD (when meaning a "1 to 200" ratio)

There may be uses of the colon or semicolon in biomedical or chemical notations which should not be given this tag — if we find any we'll mention them here.

``

See below.

''

Quotation marks

Note that the opening quotation mark tag consists of two grave accent ("backtick") characters, and the closing quotation mark tag consists of two single quote characters, not a double quote:

"this" is `` DT '' , not `` DT ".

Use these tags for quotation marks, whether single or double.

So is 'this'.

Do not use these tags for English apostrophes. The correct handling of an apostrophe depends on its use:

  • The possessive particle ('s or ') should be tagged as POS.
  • Contractions such as can't and it's should be split and tagged as if they were spelled out.
  • An apostrophe following a number or mathematical expression (as in e.g. 3' or 5') is a prime symbol and should be split off and tagged as a SYM.

-LRB-

See below.

-RRB-

Left and right round brackets

Notwithstanding that the tags stand for "left round bracket" and "right round bracket", tag as -LRB- and -RRB- any paired symbols used as brackets: ( ) [ ] { } < >, regardless of shape, except (see below) where they are part of a construction that should be tagged as a whole rather than split.

When used as normal English parentheses, split them and tag them accordingly. For example:

cholangiocellular carcinomas (_-LRB- CCCs_NNS) _-RRB-
  • Cases where brackets should NOT be split out:
V(max)

K(i)

EC(50)

IC(50)

Curly braces { } or angle brackets < > are seen infrequently in our abstracts. When not being used as angle brackets, the greater-than > and less-than < symbols should be tagged as a SYM.

All of these mean "gly->val", as do other forms with more or less spacing:

gly - greater than val

gly - > val

gly - > val

gly -> val

gly - gt val

gly - > val

In each case the entire string between "gly" and "val" -- the hyphen and the misrepresentation of ">" -- should be made into a single token and tagged as a SYM.

gly_NN - greater than_SYM val_NN

AFX

Affix

AFX is for English affixes, such as non-, anti-, pro-, as well as components of (bio)chemical words, like azo-, azoxy-, and hydro-. Medical components, too. Do not split off a prefix when it is connected directly, without a hyphen (or conceivable other punctuation). Examples: nonfunctioning, antimalarial, precancerous, etc.

Here's relevant question that illustrates this point, see e-mail dated 2004-04-10:

I'm not entirely sure how to tag spermato-. It isn't, strictly speaking, an affix, but means nothing when detached from -genesis.
19OHT induces a marked impairment of spermato- and spermiogenesis.

AFX is the correct tag in this case. Spermato- is as much an affix as "hydro-". It means (originally) "seed" or (scientifically) "sperm" just as "hydro-" means water or hydrogen.

Also note that affixes don't necessarily have to be attached to a root or stem, as in this example:

When 8-hydroxyefavirenz (2.5 micro M) was used as a substrate, 8,14-dihydroxyefavirenz was formed

In this case, "micro" should be tagged an an affix (AFX).

CC

Coordinating conjunction

This category includes the following:

  • the mathematical operators plus, minus, less, times (in the sense of "multiplied by") and over (in the sense of "divided by"), when they are spelled out
  • and, but, nor, or, yet (as in Yet it's cheap, cheap yet good)
  • both, either and neither when they are the first members of the double conjunctions both . . . and, either . . . or and neither . . . nor. See CC or DT Guidelines.
We found neither_CC mutations nor_CC complete loss of expression of the BTRC gene in our melanoma series.

Be aware that either or neither can sometimes function as determiners (DT) even in the presence of or or nor. See CC or DT Guidelines.

For in the sense of because is a coordinating conjunction (CC) rather than a subordinating conjunction (IN).

He asked to be transferred, for_CC he was unhappy.

So in the sense of "so that," on the other hand, is a subordinating conjunction (IN).

CD

Cardinal number

Use this tag for expessions that represent specific numbers, including numerals, Roman numerals, and spelled-out number words.

Ordinal numbers, such as first, fifth, and 23rd, should be tagged JJ, not CD, when they refer to order ("the first, the second, the third"), but not when they refer to fractions ("one-third", "one-fifth").

For more details, see here:

  • Punctuation in Numbers
Decimal point or comma: 3.14159, 186,000

Scientific notation: 10(-9)M

Numerical_range: 200-230

Number plus -fold: 1.6-fold

Plus and minus: +.05, -3, 48+, 61-

Numbers in fruit salad: 2,3,4- 8,9- and 14,15-hydroxylation
  • Numbers Written with Letters
Roman Numerals: Hepa I, subgroup II

Spelled-Out Numbers: Fifteen samples, forty-five, one hundred twenty

Number Words: a dozen, a hundred, two hundred

"Half" and other fraction words

DT

Determiner

This category includes the articles a(n), every, no and the, the indefinite determiners another, any and some, each, either (as in either way), neither (as in neither decision), that, these, this and those, and instances of all and both when they do not precede a determiner or possessive pronoun (as in all roads or both times). Instances of all or both that do precede a determiner or possessive pronoun are tagged as predeterminers (PDT).

We assessed serial sections of each_DT tumor.

Formalin-fixed paraffin-embedded tissue blocks were selected from five cases each_DT of uterine and extrauterine leiomyomas (LM).

Since any noun phrase can contain at most one determiner, the fact that such can occur together with a determiner (as in the only such case) means that it should be tagged as an adjective (JJ), unless it precedes a determiner, as in such a problem, in which case it is a predeterminer (PDT).

EX

Existential there

Existential there is the unstressed there that triggers inversion of the inflected verb and the logical subject of a sentence.

There_EX was a medical emergency.

For help distinguishing this use of there from its use as an adverb, see EX or RB Guidelines.

FW

Foreign word

The FW tag is used for foreign (non-English) words and phrases. In biomedical text, these are almost all Latin.

in_FW vivo_FW

in_FW vitro_FW

NOTE:"in" in these phrases is considered FW, not IN: in_FW vivo_FW

corpora_FW lutea_FW

a_FW priori_FW

et_FW cetera_FW

Abbreviations of Latin phrases, when printed without a space, will be considered a single FW, following Santorini p. 32: "Abbreviations and initials should be tagged as if they were spelled out." (See also here.) When printed with a space, split and tag both parts as FWs.

etc._FW

e.g._FW

NOTE: but e._FW g._FW (if with a space)

i.e._FW

NOTE: but i._FW e._FW (if with a space)

HYPH

Hyphen

Use this tag for the hyphen (-) character when it is used as a normal piece of English punctuation. In general, hyphenated compound words should be split into their components, but there are exceptions. In addition, this character has many other uses, and appears in many confusing or difficult-to-tag constructions. For full details, see here.

IN

Preposition or subordinating conjunction

We make no explicit distinction between prepositions and subordinating conjunctions. (The distinction is not lost, however — a preposition is an IN that precedes a noun phrase or a prepositional phrase, and a subordinate conjunction is an IN that precedes a clause.) Words like though, although, and whereas are subordinating conjunctions.

The preposition to has its own special tag TO and the phrase "due to" should be tagged as follows:

due_IN to_TO

JJ

Adjective

The JJ tag is used in the following cases:

  • To describe or modify a noun or pronoun:
Nuclear_JJ accumulation of beta-catenin was present_JJ in the stromal_JJ tumor cells in most cases but not in normal_JJ stroma or mammary_JJ epithelial_JJ cells.
  • "-ic" Endings
N-(3,5-dichlorophenyl)-2-hydroxysuccinamic_JJ acid_NN
  • Ordinal Numbers

Ordinal numbers are tagged as adjectives in the case of as follows:

Our fourth_JJ case study refutes prior findings.

Note that in the case of "the fourth-largest case study..", "fourth-largest" would be tagged as RB.

JJR

Adjective, comparative

Adjectives with the comparative ending -er and a comparative meaning are tagged JJR. More and less when used as adjectives, as in more or less mail, are also tagged as JJR. More and less can also be tagged as JJR when they occur by themselves.

Adjectives with a comparative meaning but without the comparative ending -er, like superior, should simply be tagged as JJ. Adjectives with the ending -er but without a strictly comparative meaning ("more X"), like further in further details, should also simply be tagged as JJ.

Adjectives (and adverbs — see RBR) are grammatically comparative exactly when they can be followed by than without changing their meaning.

Prevalences of K-ras mutation in hyperplasia and carcinoma with AJPBD were greater_JJR than those without AJPBD.
Mutation rate of APC gene in adenoma and carcinoma was higher_JJR than ACF.

JJS

Adjective, superlative

Adjectives with the superlative ending -est (as well as worst) are tagged as JJS. Most and least when used as adjectives, as in the most or the least mail, are also tagged as JJS. Most and least can also be tagged as JJS when they occur by themselves;

Nuclear accumulation of beta-catenin was present in the stromal tumor cells in most_JJS cases.

Adjectives with a superlative meaning but without the superlative ending -est, like first, last or unsurpassed, should simply be tagged as JJ.

LS

List item

This category includes letters and numerals when they are used to identify items in a list. Any punctuation associated with a list item marker should be included in the same token.

1.

1._LS


(b)

(b)_LS

MD

Modal verb

This category includes all verbs that don't take an -s ending in the third person singular present: can, could, (dare), may, might, must, ought, shall, should, will, would.

NN

Noun, singular common

In regular English grammar, a noun refers to "an entity, quality, state, action, or concept". (Merriam-Webster). Since annotators rarely have the the domain knowledge required to fully understand the meaning of the biomedical files, we use NN as a default tag for many unfamiliar terms.

  • Hyphenated numbers (not referring to a range) should be tagged NN because they refer to a chemical name.
Mab 1-68-11 Mab_NN 1-68-11_NN
  • Digit/letter combinations should all be tagged NN.
P-450 2D

P-450_NN 2D_NN


subclass 5a

subclass_NN 5a_NN


karyotype 45,XY,-7

karyotype_NN 45,XY,-7_NN
  • Gerunds

Watch out for gerunds (-ing forms of verbs), as they can be either NN or VBG. See NN or VBG Guidelines. If a collocation "X-ing N" is not equivalent (or similar) in meaning to "N X-es", then the word is a noun (NN). In such cases, the collocation can often be paraphrased in terms of an infinitive or a more clearly nominal construction.

spending_NN reductions (reductions in spending, not: reductions that are spending)

the mating_NN season (the season for mating, not: the season that is mating)

Sequencing_NN of the remaining_VBG GSK-3beta allele in these cases failed to identify any mutations.

NOTE: "Sequencing" is used nominally; "the allele that remains" is plausible.

NNP

Noun, singular proper

There are few proper nouns in these files, mostly names of individual persons or organizations. Proper names of persons, organizations, places, and species names should be tagged as NNP.

Conversly, trademarked drugs or other substances that are capitalized are tagged as NN. This rule also includes gene names and symbols, abbreviations for diseases, and other pieces of biochemical jargon.

  • Individuals
as Jones explained in his report as Jones_NNP in his report
  • Organizations
Scheie Eye Institute
Schieie_NNP Eye_NNP Institute_NNP
  • Places
Philadelphia Philadelphia_NNP
  • Species names

A Linnaean name follows the format Genus species. The genus part of the name is always capitalized in this format, the species part never, as in "Homo sapiens". They should both be tagged NNP, capitalized or not.

Drosophila melanogaster

Drosophila_NNP melanogaster_NNP

E. coli

E._NNP coli_NNP

NNPS

Noun, plural proper

Use this tag for nouns that are both proper (see above) and plural (see below).

NNS

Noun, plural common

For more information, see NN or NNS Guidelines.

PDT

Predeterminer

This category includes the following determiner-like elements such as all, nary, both, quite, half, rather, many, and such when they precede an article or possessive pronoun. The following constructions are commonly seen in biomedical text:

All_PDT the samples

both_PDT the girls

quite_PDT a mess

half_PDT the time

rather_PDT a nuisance

such_PDT an issue

POS

Possessive

The possessive ending on nouns ending in 's or ' is split off by the tagging algorithm and tagged as if it were a separate word.

John's symptoms

John_NNP 's_POS symptoms


the parents' distress

the parents_NNS '_POS distress


Alzheimer's disease

Alzheimer_NNP's_POS disease

PRP

Personal pronoun

This category includes the personal pronouns proper, without regard for case distinctions (I , me, you, he, him, etc.), the reflexive pronouns ending in -self or -selves, and the nominal possessive pronouns mine, yours, his, hers, ours and theirs. The adjectival possessive forms my, your, his, her, its, our and their, on the other hand, are tagged PRP$.

PRP$

Personal pronoun, possessive

This category includes the adjectival possessive forms my, your, his, her, its, one's,* our and their. The nominal possessive pronouns mine, yours, his, hers, ours and theirs are tagged as personal pronouns (PRP).

* Do not separate the 's of one's.

RB

Adverb

This category includes most words that end in -ly as well as degree words like quite, too and very, posthead modifiers like enough and indeed (as in good enough, very well indeed), and negative markers like not, n't and never.

No mutation in the mildly_RB dysplastic duct epithelium.

SAs rarely_RB show MSI or any evidence of chromosomal-scale genetic instability.

The sample numbers are too_RB small for definite conclusions.

Changes in known genes do not_RB account for the growth of the majority of SAs.

RBR

Adverb, comparative

This tag refers to an adverb that expresses whether or not one item posesses a predetermined quality in a greater or lesser degree in comparision to another item. More and less can be used as comparative adverbs. For example,

P53 mutations and mdm2 amplifications appear to be more_RBR frequent in EULMS.
K-ras mutations may be less_RBR common in SAs than in classical adenomas.

Adverbs with the comparative ending -er but without a strictly comparative meaning, like later in "The doctors will stop by later", should simply be tagged as an adverb (RB). This also goes for further, as there isn't anything easily defined that further is more of.

It was further_RB discovered that the patient was infected.

RBS

Adverb, superlative

The RBS tag refers to an adverb that expresses whether or not an item has a predetermined quality to a greater or lesser degree.

Most_RBS suprisingly, the patient made a quick recovery.
The therapy we have pioneered is the most_RBS effective of its kind.

RP

Particle

This category includes a number of mostly monosyllabic words that also double as directional adverbs and prepositions. Consult the IN or RB Guidelines, IN or RP Guidelines, and RB or RP Guidelines for further information.

SYM

Symbol

This tag should be used for mathematical, scientific and technical symbols or expressions that aren't words of English. It should not used for any and all technical expressions. For instance, the names of chemicals, units of measurements (including abbreviations thereof) and letters of our ("Roman" or "Latin") alphabet , such as A B C ..., are tagged as nouns (NN).

Names of Greek letters are symbols (SYM). Here's the complete set:

   alpha     iota      rho
   beta      kappa     sigma
   gamma     lambda    tau
   delta     mu        upsilon
   epsilon   nu        phi
   zeta      xi        chi
   eta       omicron   psi
   theta     pi        omega

Certain cases of punctuation are also considered to be symbols (SYM), such as:

  • The plus sign "+" and minus sign "-" when indicating positive or negative numbers
+48

+_SYM 48_CD


-14

-_SYM 14_CD
  • The plus-or-minus sign ±, which in our texts generally appears as +/-
5.83 +/- 0.15
5.83_CD +/-_SYM 0.15_CD
  • English punctuation used in the name of a mutation
Wingless/Wnt
Wingless_NNP /_SYM Wnt_SYM
  • (R) standing for the registered trademark symbol ®. Similarly © and ™, if they ever appear.
Excedrin_NN (R)_SYM
  • Arrows
gly -> val

gly_NN ->_SYM val_NN

NOTE: In this case, whatever occurs between "gly" and "val" is tagged as a symbol (SYM), including the white spaces. This rule also applies to cases that include "greater than" and "less than" symbols.

TO

To

The word to is tagged TO, regardless of whether it is a preposition or an infinitival marker.

token

Untagged token

This is the tag placed by the tokenizer when it breaks the text up into tokens. If a POS annotator sees it, it means that a token wasn't tagged by the automatic POS tagger, or was split and left untagged by another annotator earlier in the annotation process. This tag is never correct — it should not appear on text that is to be annotated.

UH

Interjection

In the Penn Treebank, this tag is used for exclamations and interjections. This doesn't occur in the Bio-Med corpus.

VB

Verb, base form

This tag subsumes imperatives, infinitives, and subjunctives.

  • Imperative
Do_VB it.
  • Infinitive
You should do_VB it.

We want them to do_VB it.

We made them do_VB it.

The aim of this study was to elucidate_VB pathways of carcinogenesis.
  • Subjunctive
We suggested that he do_VB it.

VBD

Verb, past tense

This tag includes all past tense verbs, including the conditional form of the verb to be.

If you were_VBD to undergo this therapy, there may be various side effects.

The doctor was_VBD present.

We analyzed_VBD rates of Ki-ras codon 12 mutations.

VBG

Verb, present participle

VBG refers to the form of a verb that ends in -ing, also called a gerund. VBGs can be used in their strictly verbal form, as an adjective, or as a noun.

  • Used verbally:
the N- and Ki-ras as well as the p53 involvement was investigated by exploring_VBG their structure.
  • Used as a noun:

If a collocation "X-ing N" is not equivalent (or similar) in meaning to "N X-es", then the word is a noun (NN). In such cases, the collocation can often be paraphrased in terms of an infinitive or a more clearly nominal construction.

spending_NN reductions

Meaning: reductions in spending, 'NOT reductions that are spending)

the mating_NN season

Meaning: the season for mating, 'NOT the season that is mating)
Sequencing_NN of the remaining_VBG GSK-3beta allele in these cases failed to identify any mutations.
NOTE: "the allele that remains" is plausible.

VBN

Verb, past participle

This tag refers to the form a verb that ends in -ed (for regular verbs) or in alternate forms (for irregular verbs) and describes nouns that are the object of the action of the verb.

Direct sequencing was performed_VBN to detect mutations in codon 12 or 13 of K-ras.
The point mutation of GGT to GAT in codon 12 was frequently observed_VBN.

There are instances of confusion between JJ and VBN (see JJ or VBG/VBN Guidelines). In the case of bio-medical text, unless it is being used explicitly as an adjective (see JJ), use VBN.

Besides MPP(+), only the 2[N]-methylated_VBN compounds 2[N]-methyl-IQ(+), 2[N]-methyl-norsalsolinol and 2[N]-methyl-salsolinol showed enhanced_VBN cytotoxicity.

VBP

Verb, present tense, not 3rd person singular

The VBP tag refers to a present tense verb, not 3rd person singular. Take care to correct VB to VBP where appropriate.

You have_VBP the mumps.
They look_VBP sick.

VBZ

Verb, present tense, 3rd person singular

The VBZ tag refers to a present tense verb, 3rd person singular.

Aberrant activation of the Wnt signaling pathway has_VBZ been reported.
It has_VBZ been proposed that microsatellite instability (MSI) distinguishes_VBZ SAs from classical adenomas.

WDT

Wh-determiner

This tag includes which, as well as that when it is used as a relative pronoun. For example:

Among the carcinomas with Ki-ras point mutation in which_WDT both adenomatous and carcinomatous tissue were examined...

These data suggest that another pathway of colorectal carcinogenesis that_WDT does not involve Ki-ras point mutation might exist.

WP

Wh-pronoun

This tag includes what, who and whom.

WP$

Wh-pronoun, possessive

This category includes the wh-word whose.

WRB

Wh-adverb

This category includes how, where, why, etc.

When in a temporal sense is tagged as a Wh-adverb (WRB). In the sense of if, on the other hand, it is a subordinating conjunction (IN).

When_WRB the obstetrician arrived, an emergency c-section was performed.
The condition occurs in children, particularly when_IN there is coexistent developmental delay.


Main Page : POS  : POS tags

Personal tools