The Design, Modeling, and Evaluation of the Relax Architectural Framework
Marc de Kruijf, Shuou Nomura, Karthikeyan Sankaralingam
As transistor technology scales ever further, hardware reliability is becoming harder to manage. The effects of soft errors, variability, wear-out, and yield are intensifying to the point where it becomes difficult to harness the benefits of deeper scaling without mechanisms for hardware fault detection and correction. We observe that the combination of emerging applications and emerging many-core architectures makes software recovery a viable and interesting alternative to traditional, hardware-based fault recovery. Emerging applications tend to have few I/O and memory side-effects, which limits the amount of information that needs checkpointing, and they allow discarding individual sub-computations with typically minimal qualitative impact. Software recovery can harness these properties in ways that hardware recovery cannot. Additionally, emerging many-core architectures comprised of many simple, in-order cores pay heavily in terms of power and area for hardware checkpointing resources. Software recovery can be more efficient while it simultaneously simplifies hardware design complexity.
In this paper, we describe Relax, an architectural framework for software recovery of hardware faults. We describe Relax's language, compiler, ISA, and hardware support, develop analytical models to project performance, and evaluate an implementation of the framework on the compute kernels of seven emerging applications. Applying Relax to counter the effects of process variation, we find that Relax can enable a 20% energy efficiency improvement for more than 80% of an application's execution with only minimal source code changes.
Download this report (PDF)
Return to tech report index