ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ИНХС РАН |
||
On average, as much as 10% of computational resources of a supercomputer may be idle. These resources can be utilized by low-priority jobs if the required number of computational nodes and computational time are sufficiently low to fit into the schedule of the supercomputer. Many problems can be solved using non-parallel methods such as Monte Carlo simulation in physics, so the respective jobs require at most one node. Thus, it is possible to run such jobs on idle supercomputer nodes if their execution time is sufficiently short. We propose a system that executes non-parallel jobs in containers and uses container migration tools to save the current state of the containers, interrupt the execution, and resume it on different nodes. The system has two components: an agent program running on computational nodes that executes non-parallel jobs inside containers and saves them before the allotted time is over, and a control program maintaining that gives the jobs to instances of the agent program and keeps track of their status. Our simulation shows that the system can utilize most of the idle resources with relatively small impact on regular jobs. We implemented a prototype of the system using Docker containers and checkpoint mechanism available in Docker as an experimental feature. The work was supported by RFBR grant 18-37-00502.