Fault and efficiency prediction in high-performance computing
High use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. This project studies log data of the DFKI Kaiserslautern high performance cluster to consider the predictability of adverse events (node failure, GPU freeze), energy usage and identify the most relevant data within.
The second supervisor for this work is Joachim Folz.
Keywords
- deep learning
- event data modelling
- time series
- survival modelling
Data available
Prometheus-kompatibles System is und wir haben