9:30AM Friday March 1 in CS 2310
Michael Molla
UW. Madison Department of Computer Science
Interpreting Microarray Expression Data Using Text Annotating the Genes

Microarray expression data is being generated by the gigabyte all over the
world with undoubted exponential increases to come.  Annotated genomic data is
also rapidly pouring into public databases.  Our goal is to develop automated
ways of combining these two sources of information to produce insight into
the operation of cells under various conditions.  Our approach is to use
machine learning techniques to identify characteristics of genes that are
up-regulated or down-regulated in a particular microarray experiment. We
seek models that are (a) accurate, (b) easy to interpret, and (c) stable to
small variations in the training data.  This paper explores the effectiveness
of two standard machine learning algorithms for this task: Naive Bayes (based
on probability) and PFOIL (based on building rules).  Although we do not
anticipate using our learned models to predict expression levels of genes, we
cast the task in a predictive framework, and evaluate the accuracy of the
models in terms of their predictive power on genes not used in the training.
The paper reports on experiments using actual E. coli microarray data,
discussing the strengths and weaknesses of the two algorithms and
demonstrating the trade-offs between accuracy, comprehensibility, and
stability.

This is joint work with Jude Shavlik and Peter Andrae.