Fault and efficiency prediction in high-performance computing

High use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. This project studies log data of the DFKI Kaiserslautern high performance cluster to consider the predictability of adverse events (node failure, GPU freeze), energy usage and identify the most relevant data within.

The second supervisor for this work is Joachim Folz.


  • deep learning
  • event data modelling
  • time series
  • survival modelling

Data available

Prometheus-kompatibles System is und wir haben


Sebastian Vollmer
Sebastian Vollmer
Professor for Applications of Machine Learning

My research interests lie at the interface of applied probability, statistical inference and machine learning.