Stack Trace Analysis for Large Scale Debugging
Dorian Arnold, Dong H. Ahn, Bronis R. De Supinski, Gregory Lee, Barton Miller, Martin Schulz
There are few runtime tools for modestly sized computing systems, with 10^3 processors, and above this scale, they work poorly. We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT can reduce the problem exploration space from thousands of processes to a few by sampling application stack traces to form process equivalence classes,, groups of processes exhibiting similar behavior. In typical parallel computations, large numbers of processes exhibit a small number of different behavior classes, manifested as common patterns in their stack traces. The problem space is reduced to representatives from these common behavior classes upon which we can use full-featured debuggers for root cause analysis.
STAT scalably collects stack traces over a sampling period to assemble a profile of the application's behavior. STAT routines process the trace samples to form a call graph prefix tree that depicts the program's behavior over the program's process space and over time. The prefix tree encodes common behaviors among the various stack samples, distinguishing classes of behavior from which representatives can be targeted for deeper analysis. STAT leverages MRNet, an infrastructure for tool control and data analyses, to overcome scalability barriers encountered by heavy-weight debuggers.
We present STAT's design and an evaluation that shows STAT gathers informative process traces from thousands of processes with sub-second latencies, a significant improvement over existing tools. Our case studies of production codes verify that STAT supports the quick identification of errors that were previously difficult to locate.
Download this report (PDF)
Return to tech report index