I've been peripherally involved in our institutions HPC cluster recently. One of the persistent issues we've faced is resource allocation on the cluster, in two ways: constraining jobs to only the resources allocated for them, and in being able to migrate jobs between nodes for optimal node usage and overall cluster throughput.
For example, when scheduling a HPC job, the job requester is asked to give a estimate of what resources they think they'll consume. Let's say that job A is scheduled with a request for 8 cpu's and 64 gb of ram. The job scheduling software then has to find a spot on the cluster where it can reserve that capacity for job A. Job A is started, but so far, it's only used 2 cpu's and 8 gb.
The scheduling software knows this, and it'd be nice if it could start another job, say job B, in that reserved but currently underutilized resource slot. However, the scheduler can't be sure that job A won't kick up and use the full, allocated resource at some time in the future, and if it tries to slip in another small job in that underutilized spot, then it risks starving both job A and job B.
If, however, the scheduler could migrate job B to another spot, or even checkpoint job B to disk if there's not another spot available and restart it later, then it could start job B and know it could still run job A to completion even if job A suddenly starts using it's full resource allocation.
Also a problem is the case of a job trying to use more than it's requested resource allocation. Lets say that job A requested 2 cpus and 8 gb. Then it suddenly starts to consume 12 cpus and 64 gb. The scheduler is aware of this and can go in and kill the job, but this is a reactive process and in the meanwhile, job A may be starving other jobs sharing the same node on the assumption that job A would only use the resources it originally requested. There are certain hard constraints that can be applied (ulimits, for instance), but they only work for certain types of resources (e.g. memory) but not others (e.g. cpu).
Being able to place hard limits around a job would have several advantages: it could be prevented from interfering with other jobs; approaching a hard limit could be used as a signal to the scheduler to migrate one ore more jobs to better balance resource consumption, or to freeze one or more jobs; also approaching a hard limit could be used as a signal to allocate more resources to a job.
Basically, hard resource limits for all job resources, plus the ability to migrate jobs around and/or freeze/restore them, makes a HPC cluster scheduler look a lot more like a single machine NUMA scheduler, with all the advantages thereof.
I've been thinking a bit about this. On linux a couple of ideas which might be useful are:
a) Use containers to constrain jobs, and use process checkpoint / restart to migrate them between cluster nodes.
- Problem: both container and the process checkpoint / restart technologies are still pretty immature, and still require out of tree kernel patches.
b) Use virtualization, and run jobs in virtual containers:
- Virtualization is very mature and well tested. Cloning virtual machines can be very fast indeed. Migrating running virtual machines around a virtualization cluster is mature and well supported. Even checkpointing / restarting machines is well supported.
- Resource constraints are a natural fit: limiting a virtual machine's resources is very mature and baked into the virtualization model. There's also some pretty good tools to be able to dynamically add memory and/or vCpus to a virtual guest if that facility is desired. So being able to either constrain or dynamically grow per job (per VM) resources is well supported.
- Question: exactly how much of a full virtual machine would be needed? Or put another way, how minimal of an environment and/or kernel would be needed. Could we run with a really minimal user environment, maybe consisting of just a few libraries, a minimal filesystem and no other processes than that required for the job itself? It'd be instructive here to look at some different chroot sandboxing solutions. Also, how minimal of a kernel could be used? Assume a fully paravirtualized guest I/O here.
- Question: how much performance impact would there be? There's a lot of recent performance improvements in KVM's networking and I/O infrastructure, e.g. the virtio and device passthrough stuff. Put another way, what percentage performance hit would be tolerable in exchange for the potential extra job managability.
- Question: with this resource allocation and migration abilities, scheduling jobs on a cluster starts to sound more and more like NUMA scheduling. Could cluster job scheduler start using more NUMA derived code? What about stuff like MPI jobs, where multiple dozens or hundreds of sub-processes are scheduled to run in parallel all as part of the same job?
Designs for testing:
- Ceph is getting to be a quite interesting distributed HPC filesystem, and even has an explicit facility designed to run KVM machines on top of it
- Use Opengrid or the like for scheduling
-- Pat