A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors
Loading...
Date
2007
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association
Abstract
General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD s Close to the Metal (CTM) and NVIDIA s Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5× performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.
Description
@inproceedings{:10.2312/EGGH/EGGH07/055-064,
booktitle = {SIGGRAPH/Eurographics Workshop on Graphics Hardware},
editor = {Mark Segal and Timo Aila},
title = {{A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors}},
author = {Sheaffer, Jeremy W. and Luebke, David P. and Skadron, Kevin},
year = {2007},
publisher = {The Eurographics Association},
ISSN = {1727-3471},
ISBN = {978-3-905673-47-0},
DOI = {/10.2312/EGGH/EGGH07/055-064}
}