Sampling + DMR: Practical and Low-overhead Permanent Fault Detection
| Sorted by Date | Classified by Publication Type | Classified by Project |
Shuou Nomura, Matthew D. Sinclair, Chen-Han Ho, Venkatraman Govindaraju, Marc de Kruijf, and Karthikeyan Sankaralingam. Sampling + DMR: Practical and Low-overhead Permanent Fault Detection. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA), June 2011.
Download
Abstract
With technology scaling, manufacture-time and in-field permanent faultsare becoming a fundamental problem. Multi-core architectures with spares cantolerate them by detecting and isolating faulty cores, but the requiredfault detection coverage becomes effectively 100% as the number ofpermanent faults increases. Dual-modular redundancy(DMR) can provide 100 of the time for example) of each periodic execution window(5 million cycles for example). Although Sampling-DMR can leave some errorsundetected, we argue the permanent fault coverage is 100 because it can detectall faults eventually. Sampling-DMR thus introduces a systemparadigm of restricting all permanent faults' effects to small finitewindows of error occurrence.We prove an ultimate upper bound exists on total missed errors anddevelop a probabilistic model to analyze the distribution of thenumber of undetected errors and detection latency. The model isvalidated using full gate-level fault injection experiments for anactual processor running full application software.Sampling-DMR outperformsconventional techniques in terms of fault coverage, sustains similardetection latency guarantees, and limits energyand performance overheads to less than 2%.
BibTeX
@inproceedings{isca11:sdmr, author={Shuou Nomura and Matthew D. Sinclair and Chen-Han Ho and Venkatraman Govindaraju and Marc de Kruijf and Karthikeyan Sankaralingam}, title={{Sampling + DMR: Practical and Low-overhead Permanent Fault Detection}}, booktitle="{Proceedings of the 38th International Symposium on Computer Architecture (ISCA)}", year={2011}, abstract = { With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100\% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100\% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling- DMR: run in DMR mode for a small percentage (1\% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100 because it can detect all faults eventually. Sampling-DMR thus introduces a system paradigm of restricting all permanent faults' effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2\%. }, bib_dl_pdf = {http://bit.ly/hALzOF}, bib_dl_ppt = {http://bit.ly/l8ZbQW}, bib_pubtype = {Refereed Conference}, bib_rescat = {proj-relax}, MONTH = {June} }
Generated by bib.pl (written by Patrick Riley ) on Sun Sep 26, 2021 16:14:28 time=1207019082