Pubs

Sampling + DMR: Practical and Low-overhead Permanent Fault Detection

| Sorted by Date | Classified by Publication Type | Classified by Research Category |

Shuou Nomura, Matthew D. Sinclair, Chen-Han Ho, Venkatraman Govindaraju, Marc de Kruijf, and Karthikeyan Sankaralingam. Sampling + DMR: Practical and Low-overhead Permanent Fault Detection. In Proceedings of the 38th International Symposium on Computer Architecture, June 2011.

Download

[PDF] [Slides]

Abstract

With technology scaling, manufacture-time and in-field permanent faultsare becoming a fundamental problem. Multi-core architectures with spares cantolerate them by detecting and isolating faulty cores, but the requiredfault detection coverage becomes effectively 100% as the number ofpermanent faults increases. Dual-modular redundancy(DMR) can provide 100 of the time for example) of each periodic execution window(5 million cycles for example). Although Sampling-DMR can leave some errorsundetected, we argue the permanent fault coverage is 100 because it can detectall faults eventually. Sampling-DMR thus introduces a systemparadigm of restricting all permanent faults' effects to small finitewindows of error occurrence.We prove an ultimate upper bound exists on total missed errors anddevelop a probabilistic model to analyze the distribution of thenumber of undetected errors and detection latency. The model isvalidated using full gate-level fault injection experiments for anactual processor running full application software.Sampling-DMR outperformsconventional techniques in terms of fault coverage, sustains similardetection latency guarantees, and limits energyand performance overheads to less than 2%.

BibTeX

 @inproceedings{isca11:sdmr,
   author={Shuou Nomura and Matthew D. Sinclair and Chen-Han Ho and Venkatraman Govindaraju and Marc de Kruijf and Karthikeyan Sankaralingam},
   title={{Sampling + DMR: Practical and Low-overhead Permanent Fault Detection}},
   booktitle="{Proceedings of the 38th International Symposium on Computer Architecture}",
   year={2011},
   abstract = {
 With technology scaling, manufacture-time and in-field permanent faults
 are becoming a fundamental problem. Multi-core architectures with spares can
 tolerate them by detecting and isolating faulty cores, but the required
 fault detection coverage becomes effectively 100\% as the number of
 permanent faults increases.  Dual-modular redundancy(DMR) can provide 100\%
 coverage without assuming device-level fault models, but its overhead
 is excessive.
 In this paper, we explore a simple and low-overhead mechanism we call Sampling-
 DMR: run in DMR mode for a
 small percentage (1\% of the time for example) of each periodic execution window
 (5 million cycles for example). Although Sampling-DMR can leave some errors
 undetected, we argue the permanent fault coverage is 100 because it can detect
 all faults eventually. Sampling-DMR thus introduces a system
 paradigm of restricting all permanent faults' effects to small finite
 windows of error occurrence.
 We prove an ultimate upper bound exists on total missed errors and
 develop a probabilistic model to analyze the distribution of the
 number of undetected errors and detection latency. The model is
 validated using full gate-level fault injection experiments for an
 actual processor running full application software.
 Sampling-DMR outperforms
 conventional techniques in terms of fault coverage, sustains similar
 detection latency guarantees, and limits energy
 and performance overheads to less than 2\%.
 },
   bib_dl_pdf = {http://bit.ly/hALzOF},
   bib_dl_ppt = {http://bit.ly/l8ZbQW},
   bib_pubtype = {Refereed Conference},
   bib_rescat = {Architecture},
   MONTH = {June}
 }

Generated by bib.pl (written by Patrick Riley ) on Thu Mar 04, 2021 10:09:29 time=1207019082