Abstract:
 As we continue to be flooded with increasing amounts of text data, we
 have growing need for tools that will not only allow us to retrieve
 documents, but also mine the structured data buried inside their
 natural language text.  Such structured representation then enables
 automated methods to model outliers, predict the future, and provide
 decision support.  For example, U.S. ports would be safer if automated
 processes could detect suspicious patterns in the shipping manifests
 of cargo container ships; a system could suggest which DNA array
 experiments may be most fruitful by mining facts from biology research
 articles; we could better predict long-term weather trends by building
 a weather database from large collections of thousand-year-old Chinese
 diary entries.

 Information extraction is the process of filling a structured database
 from unstructured text.  It is a difficult statistical and
 computational problem often involving hundreds of thousands of
 variables, complex algorithms, and noisy and sparse data.

 In this talk I will briefly review previous work in finite state,
 conditionally-trained Markov random field models for information
 extraction, and then describe three pieces of recent work: (1) the
 application of conditional Markov random fields to extraction of
 tables from government reports, (2) feature induction for these
 models, applied to named entity extraction, (3) a new, random field
 method for noun co-reference resolution that has strong ties to graph
 partitioning.