14th Scheduling for Large Scale Systems Workshop

26-28 Jun 2019 Bordeaux (France)

sciencesconf.org:scheduling2019:271456

In this talk we will present an HPC framework that provides new strategies for resource monitoring and job scheduling.

This framework includes a scalable lightweight monitoring tool that is able to analyze the platform's compute nodes and to detect any risks of contention between them. This monitoring tool is designed for large-scale systems. It can be mapped to the system topology statically, but it also has self-organizing capacity transparent to the system users. This capacity, together with fault-tolerance, make our monitor a tool with strong resiliency.

Our framework also includes an application scheduler that can subscribe to monitor events, such as congestion thresholds, and use this information, in combination with application-level information, to enhance the application execution applying dynamic process migration and malleability.

A description of the architecture, as well as a practical evaluation of the proposal will be presented in the talk.

Subject :	:	oral
Topics	:	Scheduling
Keywords	:	Adaptive scheduling ; monitoring ; malleability
PDF version	:	PDF version

Presentation

Online user: 1