La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers
dc.contributor.author | Guo, Hanqi | en_US |
dc.contributor.author | Di, Sheng | en_US |
dc.contributor.author | Gupta, Rinku | en_US |
dc.contributor.author | Peterka, Tom | en_US |
dc.contributor.author | Cappello, Franck | en_US |
dc.contributor.editor | Hank Childs and Fernando Cucchietti | en_US |
dc.date.accessioned | 2018-06-02T18:02:44Z | |
dc.date.available | 2018-06-02T18:02:44Z | |
dc.date.issued | 2018 | |
dc.description.abstract | We design and implement La VALSE-a scalable visualization tool to explore tens of millions of records of reliability, availability, and serviceability (RAS) logs-for IBM Blue Gene/Q systems. Our tool is designed to meet various analysis requirements, including tracing causes of failure events and investigating correlations from the redundant and noisy RAS messages. La VALSE consists of multiple linked views to visualize RAS logs; each log message has a time stamp, physical location, network address, and multiple categorical dimensions such as severity and category. The timeline view features the scalable ThemeRiver and arc diagrams that enables interactive exploration of tens of millions of log messages. The spatial view visualizes the occurrences of RAS messages on hundreds of thousands of elements of Mira-compute cards, node boards, midplanes, and racks-with viewdependent level-of-detail rendering. The multidimensional view enables interactive filtering of different categorical dimensions of RAS messages. To achieve interactivity, we develop an efficient and scalable online data cube engine that can query 55 million RAS logs in less than one second. We present several case studies on Mira, a top supercomputer at Argonne National Laboratory. The case studies demonstrate that La VALSE can help users quickly identify the sources of failure events and analyze spatiotemporal correlations of RAS messages in different scales. | en_US |
dc.description.sectionheaders | Session 4 | |
dc.description.seriesinformation | Eurographics Symposium on Parallel Graphics and Visualization | |
dc.identifier.doi | 10.2312/pgv.20181099 | |
dc.identifier.isbn | 978-3-03868-054-3 | |
dc.identifier.issn | 1727-348X | |
dc.identifier.pages | 91-100 | |
dc.identifier.uri | https://doi.org/10.2312/pgv.20181099 | |
dc.identifier.uri | https://diglib.eg.org:443/handle/10.2312/pgv20181099 | |
dc.publisher | The Eurographics Association | en_US |
dc.title | La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers | en_US |