University of Wisconsin-Madison

Skip navigationUW-Madison Home PageMy UW-MadisonSearch UW

Computer Sciences

 


 

Home

 

Overview
 

Funding

 

 

News
 

People
 

Downloads
 

Publications
 

Reading List
 

Contacts

UW-Madison
Computer Sciences Dept.

EDAM  Project

Exploratory Data Analysis and Monitoring

A Piece of Tasty Cheese from Wisconsin . . .

 

 

 

 

 

 

 

 

EDAM project is a collaborative effort between computer scientists and environmental chemists at Carleton College and UW-Madison, supported by an NSF Medium ITR grant. The goal is to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments that provide continuous or semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. In the EDAM project, we are investigating techniques for automatically labeling mass spectra from different kinds of aerosol mass spectrometers, and then analyzing and exploring the rich spatiotemporal information collected from multiple geographically distributed instruments.

Beyond the specific problems that arise in the context of atmospheric aerosols, we are interested in addressing some fundamental challenges in data mining and monitoring:

  • Support for multi-step analyses: Most data analysis tasks involve several steps, and indeed, several analysis steps. Yet, most of the literature on the topic addresses algorithms for some one step, such as constructing a decision tree or identifying clusters. We are investigating ways to compositionally specify, optimize, and provide life-cycle support for multi-step analyses. In particular, we are exploring a stylized generate-and-test paradigm for identifying interesting subsets of data, which we call subset mining. In terms of optimization, our focus is on developing cost-based strategies, analogous to database query optimization, for a broad class of compositionally specified multi-step analyses. The underlying objective is to enable to specify a large number of potential analyses (differing in carefully selected parameterized choices), and to develop a system that can automatically, perhaps over days in a Condor-style environment, evaluate these different analyses and bring "interesting" instances to the attention of the analyst. Thus, we hope to minimize the human burden in the iterative process of data exploration. We are also interested in techniques for maintaining detailed provenance information for all data, including instrument readings and other source data as well as results of intermediate analysis steps, and for reasoning with this information for data validation.
  • Adaptive monitoring: An important class of problems is to track the state of a complex real-world system in terms of analytical models based on semi-continuous observations. An obvious challenge is the development of analytical models that best describe the aspects of the real-world system that we wish to track; the models that are best suited to track the levels of mercury and ozone in the atmosphere, using data from aerosol monitoring instruments, will naturally differ from the models that are best for tracking customer behavior on an e-commerce site, using click-stream and transactional data. Another important, but often overlooked, issue is continuous validation of the data that is gathered. The challenges include instrument calibration, accurately recording and accounting for ambient conditions at the time of each measurement, handling noise, and missing data. Finally, monitoring systems offer a unique new opportunity, in that we can direct the collection of future data. How to do this intelligently raises a number of interesting issues; the monitoring system must be aware of its current knowledge, and use this information to direct future data gathering, taking into account resource constraints and monitoring objectives.

For more details of project background, objectives, etc., please refer to our overview paper

Funding

      The EDAM project is funded by an NSF Medium ITR grant  ITR IIS-0326328.
      NSF Program Announcement

News

  • A mailing list 'edam' is created. Please subscribe to it through CSL.
  • An old reading list is setup for your reference.

People

Faculty

    Raghu Ramakrishnan (Computer Sciences, UW-Madison)
   
James J.Schauer (Civil and Environmental Engineering, UW-Madison)
   
Martin M. Shafer (Water Science and Engineering Laboratory, UW-Madison)
   
Deborah S. Gross (Department of Chemistry, Carleton College)
   
David R. Musicant (Computer Sciences, Carleton College)

 

Additionally, Stephen Wright and Jin-Yi Cai  are involved in conjunction with their other interests.

 

Academic Staff

    Martin Shafer (Associate scientist, Environmental Chemistry and Technology, UW-Madison)

Students

 

Lei Chen, graduate student, Computer Science Department, UW-Madison

Bee-chung Chen, graduate student, Computer Science Department, UW-Madison

Pradeep Tamma, graduate student, Computer Science Department, UW-Madison

Vuk Ercegovac, graduate student, Computer Science Department, UW-Madison

Doug Burdick, graduate student, Computer Science Department, UW-Madison

Zheng Huang (in 2004), graduate student, Computer Science Department, UW-Madison

Gregory Cipriano, graduate student, Computer Science Department, UW-Madison

Rachel Duvall (in 2004-2005), graduate student, Civil and Environmental Engineering, UW-Madison

David Snyder, graduate student, Environmental Chemistry and Technology, UW-Madison

 

Andrew Ault, undergraduate student, Chemistry Department, Carleton College

Renee Frontiera, undergraduate student (graduated 6/2004), Chemistry Department, Carleton College

Margrith Mattmann (in 2003), undergraduate student, Chemistry Department, Carleton College

Alexandra Schmitt, undergraduate student, Chemistry Department, Carleton College

Melanie Yuen (graduated 6/2006), undergraduate student, Chemistry Department, Carleton College

John Choiniere, undergraduate student, Chemistry Department, Carleton College

Katherine Barton, undergraduate student, Chemistry Department, Carleton College

Catherine Nelson, undergraduate student (graduated 6/2004), Mathematics and Computer Science Department, Carleton College

Ben Anderson, undergraduate student (graduated 6/2005), Mathematics and Computer Science Department, Carleton College

Anna Ritz, undergraduate student (graduated 6/2006), Mathematics and Computer Science Department, Carleton College

Jon Sulman, undergraduate student (graduated 6/2006), Mathematics and Computer Science Department, Carleton College

Leah Steinberg, undergraduate student, Mathematics and Computer Science Department, Carleton College

Thomas Smith, undergraduate student, Mathematics and Computer Science Department, Carleton College

Jamie Olson, undergraduate student, Mathematics and Computer Science Department, Carleton College

Janara Christensen, undergraduate student, Mathematics and Computer Science Department, Carleton College

 

Ilari Shafer, high school student, entering Olin College in fall, 2006

 

Publications

  • Bee-Chung Chen, Raghu Ramakrishnan, Kristen LeFevre: Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge. VLDB 2007.
  • Doug Burdick, AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007.
  • Douglas Burdick, Prasad M. Deshpande, T. S. Jayram, Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over uncertain and imprecise data. VLDB J. 16(1): 123-144 (2007).
  • Bee-Chung Chen, Lei Chen, Raghu Ramakrishnan, David R. Musicant: Learning from Aggregate Views. ICDE 2006: 3.
  • Bee-Chung Chen, Vinod Yegneswaran, Paul Barford, Raghu Ramakrishnan: Toward a Query Language for Network Attack Data. ICDE Workshops 2006: 28.
  • Douglas Burdick, Prasad M. Deshpande, T. S. Jayram, Raghu Ramakrishnan, Shivakumar Vaithyanathan: Efficient Allocation Algorithms for OLAP Over Imprecise Data. VLDB 2006: 391-402.
  • Lei Chen, Raghu Ramakrishnan, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran: Composite Subset Measures. VLDB 2006: 403-414.
  • Bee-Chung Chen, Raghu Ramakrishnan, Jude W. Shavlik, Pradeep Tamma: Bellwether Analysis: Predicting Global Aggregates from Local Regions. VLDB 2006: 655-666.
  • Hctor Corrada Bravo, David Page, Raghu Ramakrishnan, Jude W. Shavlik, Vtor Santos Costa: A Framework for Set-Oriented Computation in Inductive Logic Programming and Its Application in Generalizing Inverse Entailment. ILP 2005: 69-86
  • Kristen LeFevre, David J. DeWitt, Raghu Ramakrishnan: Incognito: Efficient Full-Domain K-Anonymity. SIGMOD Conference 2005: 49-60
  • Douglas Burdick, Prasad Deshpande, T. S. Jayram, Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP Over Uncertain and Imprecise Data. VLDB 2005: 970-981
  • Bee-Chung Chen, Lei Chen, Yi Lin, Raghu Ramakrishnan: Prediction Cubes. VLDB 2005: 982-993
  • Zheng Huang, Lei Chen, Jin-Yi Cai, David Musicant, Deborah S. Gross, Raghu Ramakrishnan, James J. Schauer, Stephen J. Wright, "Mass Spectra Labeling : Theory and Practice", ICDM, 2004
  • Zheng Huang, Lei Chen, Jin-Yi Cai, Raghu Ramakrishnan, James J. Schauer, Stephen J. Wright, Mass Spectrum Labeling: Theory and Practice, University of Wisconsin-Madison, Technical Report, 2004
  • Raghu Ramakrishnan, James J. Schauer, Lei Chen, Zheng Huang, Martin M. Shafer, Deborah S. Gross: The EDAM project: Mining atmospheric aerosol datasets. Int. J. Intell. Syst. 20(6): 759-787 (2005)2004
  • Raghu Ramakrishnan: The EDAM project: mining mass spectra and more. CIKM 2004
  • Zheng Huang, Lei Chen, Jin-yi Cai, Deborah S. Gross, David R. Musicant, Raghu Ramakrishnan, James J. Schauer, Stephen J. Wright: Mass Spectrum Labeling: Theory and Practice. ICDM 2004: 122-129
  • Raghu Ramakrishnan: The EDAM Project: Exploratory Data Analysis and Monitoring at Wisconsin. SBBD 2004: 23-32
  • Lei Chen, Zheng Huang, Raghu Ramakrishnan: Cost-Based Labeling of Groups of Mass Spectra. SIGMOD Conference 2004: 167-178
  • Kristen LeFevre, Rakesh Agrawal, Vuk Ercegovac, Raghu Ramakrishnan, Yirong Xu, David J. DeWitt: Limiting Disclosure in Hippocratic Databases. VLDB 2004: 108-119
  • Navin Kabra, Raghu Ramakrishnan, Vuk Ercegovac: The QUIQ Engine: A Hybrid IR DB System. ICDE 2003
  • Raghu Ramakrishnan: Data Mining: Fast Algorithms vs. Fast Results. ISMIS 2003: 12-13

Reading List

An old reading list for your reference.

Software

We are currently developing a open-source software call enchilada which support interactive and exploratory data mining for the mass spectra and time series data.

Downloads

ILP

ILP Slides by Hector

Dzeroski, Saso. "Multi-Relational Data Mining: An Introduction" SIGKDD Explorations, 2003

Blockeel, Hendrik and Sebag, Michele. "Scalability and Efficiency in Multi-Relational Data Mining" SIGKDD Explorations, 2003

Data Mining Query Language
   
        DM Query Slides by Kristen

        Tomasz Imielinski, Aashu Virmani, "MSQL: A Query Language for Database Mining", DMKD, 1999

        Amir Netz, Surajit Chaudhuri, Usama Fayyad, Jeff Bernhardt. "Integrating Data Mining with SQL Databases: OLE DB for Data Mining"

Contacts


All text and images Copyright 2004-2005, EDAM Group, Madison, WI.
Last modified :12/14/2005

 

 

 

 

Computer Sciences | UW Home