Sampling + DMR: Practical and Low-overhead Permanent Fault Detection
| Sorted by Date | Classified by Publication Type | Classified by Project |
Shuou Nomura, Matthew D. Sinclair, Chen-Han Ho, Venkatraman Govindaraju, Marc de Kruijf, and Karthikeyan Sankaralingam. Sampling + DMR: Practical and Low-overhead Permanent Fault Detection. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA), June 2011.
Download
Abstract
With technology scaling, manufacture-time and in-field permanent faultsare becoming a fundamental problem. Multi-core architectures with spares cantolerate them by detecting and isolating faulty cores, but the requiredfault detection coverage becomes effectively 100% as the number ofpermanent faults increases. Dual-modular redundancy(DMR) can provide 100 of the time for example) of each periodic execution window(5 million cycles for example). Although Sampling-DMR can leave some errorsundetected, we argue the permanent fault coverage is 100 because it can detectall faults eventually. Sampling-DMR thus introduces a systemparadigm of restricting all permanent faults' effects to small finitewindows of error occurrence.We prove an ultimate upper bound exists on total missed errors anddevelop a probabilistic model to analyze the distribution of thenumber of undetected errors and detection latency. The model isvalidated using full gate-level fault injection experiments for anactual processor running full application software.Sampling-DMR outperformsconventional techniques in terms of fault coverage, sustains similardetection latency guarantees, and limits energyand performance overheads to less than 2%.
BibTeX
@inproceedings{isca11:sdmr,
author={Shuou Nomura and Matthew D. Sinclair and Chen-Han Ho and Venkatraman Govindaraju and Marc de Kruijf and Karthikeyan Sankaralingam},
title={{Sampling + DMR: Practical and Low-overhead Permanent Fault Detection}},
booktitle="{Proceedings of the 38th International Symposium on Computer Architecture (ISCA)}",
year={2011},
abstract = {
With technology scaling, manufacture-time and in-field permanent faults
are becoming a fundamental problem. Multi-core architectures with spares can
tolerate them by detecting and isolating faulty cores, but the required
fault detection coverage becomes effectively 100\% as the number of
permanent faults increases. Dual-modular redundancy(DMR) can provide 100\%
coverage without assuming device-level fault models, but its overhead
is excessive.
In this paper, we explore a simple and low-overhead mechanism we call Sampling-
DMR: run in DMR mode for a
small percentage (1\% of the time for example) of each periodic execution window
(5 million cycles for example). Although Sampling-DMR can leave some errors
undetected, we argue the permanent fault coverage is 100 because it can detect
all faults eventually. Sampling-DMR thus introduces a system
paradigm of restricting all permanent faults' effects to small finite
windows of error occurrence.
We prove an ultimate upper bound exists on total missed errors and
develop a probabilistic model to analyze the distribution of the
number of undetected errors and detection latency. The model is
validated using full gate-level fault injection experiments for an
actual processor running full application software.
Sampling-DMR outperforms
conventional techniques in terms of fault coverage, sustains similar
detection latency guarantees, and limits energy
and performance overheads to less than 2\%.
},
bib_dl_pdf = {http://bit.ly/hALzOF},
bib_dl_ppt = {http://bit.ly/l8ZbQW},
bib_pubtype = {Refereed Conference},
bib_rescat = {proj-relax},
MONTH = {June}
}
Generated by bib.pl (written by Patrick Riley ) on Sun Sep 26, 2021 16:14:28 time=1207019082