Date
Topic and Speaker
Monday
September 14th
4:00 PM
1221 CS
Scalable Middleware for Large (Really Large) Scale Systems

I will discuss the problem of developing tools for large scale parallel environments. We are especially interested in systems, both leadership class parallel computers and clusters that have 10,000's or even millions of processors. The infrastructure that we have developed to address this problem is called MRNet, the Multicast/Reduction Network. MRNet's approach to scale is to structure control and data flow in a tree-based overlay network (TBON) that allows for efficient request distribution and flexible data reductions.

The second part of this talk will present an overview of the MRNet design, architecture, and computational model and then discuss several of the applications of MRNet. The applications include scalable automated performance analysis in Paradyn, a vision clustering application and, most recently, an effort to develop our first petascale tool, STAT, a scalable stack trace analyzer running currently on 100,000's of processors.

Wednesday
September 16th
4:00 PM
1221 CS
Addressing Usability Challenges in Current and Future Supercomputers
Luiz Derose Programming Environments Director, Cray, Inc.

The scale of current and future high end systems, as well as the increasing system software and architecture complexity, brings a new set of challenges for application developers. In order to achieve high performance on peta-scale systems, application developers need a programming environment that can address and hide the issues of scale and complexity of high end HPC systems. Users must be supported by intelligent compilers, automatic performance analysis tools, adaptive libraries, and scalable software. In this talk I will present the Cray programming environment features that are being developed and deployed to improve user's productivity on Cray Supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed to build the system software for peta-scale systems.

Bio: Dr. Luiz DeRose is a Senior Principal Engineer and the Programming Environments Director at Cray Inc, where he is responsible for the programming environment strategy for all Cray systems. Before joining Cray in 2004, he was a research staff member and the Tools Group Leader at the Advanced Computing Technology Center at IBM Research. Dr. DeRose had a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. With more than 20 years of high performance computing experience and a deep knowledge of its programming environments, he has published more than 40 peer-review articles in scientific journals, conferences, and book chapters, primarily on the topics of compilers and tools for high performance computing.

Monday
October 5th
4:00 PM
2310 CS
Heterogeneous Multiprocessing with Satellite Kernels

In this talk I will describe Helios, which is a new operating system designed to simplify the task of writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, uniform set of OS abstractions across CPUs of disparate architectures and performance characteristics. Access to I/O services such as file systems are made transparent via remote message passing, which extends a standard microkernel message-passing abstraction to a satellite kernel infrastructure. Helios retargets applications to available ISAs by compiling from an intermediate language. To simplify deploying and tuning application performance, Helios exposes an affinity metric to developers. Affinity provides a hint to the operating system about whether a process would benefit from executing on the same platform as a service it depends upon.

We developed satellite kernels for an XScale programmable I/O card and for cache-coherent NUMA architectures. We offloaded several applications and operating system components, often by changing only a single line of metadata. We show up to a 28% performance improvement by offloading tasks to the XScale I/O card. On a mail-server benchmark, we show a 39% improvement in performance by automatically splitting the application among multiple NUMA domains.

Bio: Ed Nightingale is a researcher at Microsoft Research in Redmond, WA. His interests include just about anything involving operating systems, distributed systems, or mobile computing.

Tuesday
October 6th
4:00 PM
2310 CS
Tolerating Hardware Device Failures in Software

Hardware devices can fail, but many drivers assume they do not. When confronted with real devices that misbehave, these assumptions can lead to driver or system failures. While major operating system and device vendors recommend that drivers detect and recover from hardware failures, we find that there are many drivers that will crash or hang when a device fails. Such bugs cannot easily be detected by regular stress testing because the failures are induced by the device and not the software load.

This paper describes Carburizer, a code-manipulation tool and associated runtime that improves system reliability in the presence of faulty devices. Carburizer analyzes driver source code to find locations where the driver incorrectly trusts the hardware to behave. Carburizer identified almost 1000 such bugs in Linux drivers with a false positive rate of less than 8 percent. With the aid of shadow drivers for recovery, Carburizer can automatically repair 840 of these bugs with no programmer involvement. To facilitate proactive management of device failures, Carburizer can also locate existing driver code that detects device failures and inserts missing failure-reporting code. Finally, the Carburizer runtime can detect and tolerate interrupt-related bugs, such as stuck or missing interrupts.

This is a practice talk for SOSP'09.

Friday
December 4th
10:30 AM
2310 CS
Group File Operations for Scalable Tools and Middleware

(Practice Talk for HiPC 2009 - 25 minutes)

Group file operations are a new, intuitive idiom for tools and middleware - including parallel debuggers and runtimes, performance measurement and steering, and distributed resource management - that require scalable operations on large groups of distributed files. The idiom provides new semantics for using file groups in standard file operations to eliminate costly iteration. A filebased idiom promotes conciseness and portability, and eases adoption. With explicit semantics for aggregation of group results, the idiom addresses a key scalability barrier. We have designed TBON-FS, a new distributed file system that provides scalable group file operations by leveraging tree-based overlay networks (TBONs) for scalable communication and data aggregation. We integrated group file operations into several tools: parallel versions of common utilities including cp, grep, rsync, tail, and top, and the Ganglia Distributed Monitoring System. Our experience verifies the group file operation idiom is intuitive, easily adopted, and enables a wide variety of tools to run efficiently at scale.