In this talk we will present an HPC framework that provides new strategies for resource monitoring and job scheduling.
This framework includes a scalable lightweight monitoring tool that is able to analyze the platform's compute nodes and to detect any risks of contention between them. This monitoring tool is designed for large-scale systems. It can be mapped to the system topology statically, but it also has self-organizing capacity transparent to the system users. This capacity, together with fault-tolerance, make our monitor a tool with strong resiliency.
Our framework also includes an application scheduler that can subscribe to monitor events, such as congestion thresholds, and use this information, in combination with application-level information, to enhance the application execution applying dynamic process migration and malleability.
A description of the architecture, as well as a practical evaluation of the proposal will be presented in the talk.
- Presentation