Skip to main content.

IMW: Interactive Master-Worker Style Parallel Data Analysis Tool on the Grid

by the Department of Computer Sciences, University of Wisconsin - Madison
04/16/2002

Motivation

The task of analyzing huge amount of data residing at different locations can be greatly eased by the Grid. First, the analyzing tools can be executed on machines close to the data, without incurring the cost of moving raw data around; second, the computing and I/O resources provided by the Grid can be utilized to perform the analysis task in parallel, which leads much shorter experiment turn-around time; third, the various services provided by the Grid infrastructure can simply to the job of managing distributed resources. In order to fully exploit such opportunities, we are implementing a framework to build interactive parallel data analysis tools on the Grid - IMW.
 

Our Approach

Data analysis tasks lend themselves naturally to master-worker style parallel processing, with its special requirements on user-interaction, visualization and data locality support. Previously, we've developed a framework called MW [1] to support master-worker style parallel applications on meta-computing environments like HTCondor and Globus. It has been widely used and helped solve some really hard computational problems [2]. MW has a layered architecture to encapsulate the functionalities of underlying resource manager and communication mechanisms, it also provides portability across different underlying infrastructures. It exposes a set of simple APIs to ease the application programmers' work.

IMW is based on our experiences of developing MW, it has been rewritten (mostly in Java) to

  1. add support of user interaction and computation steering;
  2. directly support data analyzing tasks, such as, histogram and parallel execution of scripts;
  3. provide a lightweight Java implementation to ease porting and maintenance.
The current prototype divides the system into  three different components:
  1. a client that interacts with human user and controls the analysis task;
  2. a master that manages the analysis task by decomposing the whole task into smaller pieces and assign them to many workers in parallel;
  3. workers that get commands from the master and returns the result.
Different components communicate with each other using Socket. HTCondor, as the resource manager, is used to gather worker hosts, spawn worker processes, and report dynamic worker changes such as worker exit, suspension/resume, etc. Fault-tolerance can be achieved by user-provided checkpoint functions, to save partially completed result and recover after restart.
 

Status and Future Work

The bare-bone IMW prototype is working now, which can execute shell scripts in parallel, and handles the worker dynamic changes. More work needs to be done to add features and make it easy-to-use.

After providing the basic functionality of parallel data analysis and user interaction, we can also exploit two other tools developed at University of Wisconsin.

  1. DEVise: DEVise is an environment for data exploration and visualization written in, with flexible data selection and layout mechanisms. By using DEVise as an visualization frontend, IMW's GUI development can be greatly simplified, while it still provides customizable histogram support.
  2. ClassAds: Classads are used by HTCondor to describe resources and task requirements. It would be a powerful tool for IMW to describe the data category and analysis requirement, so that analyzing processes can be directed and run on machines that exploit data locality and equipped with sufficient resources.

References

[1] Jeff Linderoth, Sanjeev Kulkarni, Jean-Pierre Goux, and Michael Yoder, "An Enabling Framework for Master-Worker Applications on the
    Computational Grid", Proceedings of the Ninth IEEE Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, Pennsylvania,
    August 2000, pp 43-50.

[2] S. J. Wright, "Solving optimization problems on computational grids," November, 2000. Optima 65, May, 2001.