Thursday, December 27, 2012

Idea: HPC cluster using virtualization to run jobs...


I've been peripherally involved in our institutions HPC cluster recently.  One of the persistent issues we've faced is resource allocation on the cluster, in two ways: constraining jobs to only the resources allocated for them, and in being able to migrate jobs between nodes for optimal node usage and overall cluster throughput.

For example, when scheduling a HPC job, the job requester is asked to give a estimate of what resources they think they'll consume.  Let's say that job A is scheduled with a request for 8 cpu's and 64 gb of ram.  The job scheduling software then has to find a spot on the cluster where it can reserve that capacity for job A.  Job A is started, but so far, it's only used 2 cpu's and 8 gb.

The scheduling software knows this, and it'd be nice if it could start another job, say job B, in that reserved but currently underutilized resource slot.  However, the scheduler can't be sure that job A won't kick up and use the full, allocated resource at some time in the future, and if it tries to slip in another small job in that underutilized spot, then it risks starving both job A and job B.

If, however, the scheduler could migrate job B to another spot, or even checkpoint job B to disk if there's not another spot available and restart it later, then it could start job B and know it could still run job A to completion even if job A suddenly starts using it's full resource allocation.

Also a problem is the case of a job trying to use more than it's requested resource allocation.  Lets say that job A requested 2 cpus and 8 gb.  Then it suddenly starts to consume 12 cpus and 64 gb.  The scheduler is aware of this and can go in and kill the job, but this is a reactive process and in the meanwhile, job A may be starving other jobs sharing the same node on the assumption that job A would only use the resources it originally requested.  There are certain hard constraints that can be applied (ulimits, for instance), but they only work for certain types of resources (e.g. memory) but not others (e.g. cpu).

Being able to place hard limits around a job would have several advantages: it could be prevented from interfering with other jobs; approaching a hard limit could be used as a signal to the scheduler to migrate one ore more jobs to better balance resource consumption, or to freeze one or more jobs; also approaching a hard limit could be used as a signal to allocate more resources to a job.

Basically, hard resource limits for all job resources, plus the ability to migrate jobs around and/or freeze/restore them, makes a HPC cluster scheduler look a lot more like a single machine NUMA scheduler, with all the advantages thereof.

I've been thinking a bit about this.  On linux a couple of ideas which might be useful are:

a) Use containers to constrain jobs, and use process checkpoint / restart to migrate them between cluster nodes.
  • Problem: both container and the process checkpoint / restart technologies are still pretty immature, and still require out of tree kernel patches.

b) Use virtualization, and run jobs in virtual containers:
  • Virtualization is very mature and well tested.  Cloning virtual machines can be very fast indeed.  Migrating running virtual machines around a virtualization cluster is mature and well supported.  Even checkpointing / restarting machines is well supported. 
  • Resource constraints are a natural fit: limiting a virtual machine's resources is very mature and baked into the virtualization model.  There's also some pretty good tools to be able to dynamically add memory and/or vCpus to a virtual guest if that facility is desired.  So being able to either constrain or dynamically grow per job (per VM) resources is well supported.
  • Question: exactly how much of a full virtual machine would be needed?   Or put another way, how minimal of an environment and/or kernel would be needed.  Could we run with a really minimal user environment, maybe consisting of just a few libraries, a minimal filesystem and no other processes than that required for the job itself?  It'd be instructive here to look at some different chroot sandboxing solutions.  Also, how minimal of a kernel could be used?  Assume a fully paravirtualized guest I/O here.
  • Question: how much performance impact would there be?  There's a lot of recent performance improvements in KVM's networking and I/O infrastructure, e.g. the virtio and device passthrough stuff.  Put another way, what percentage performance hit would be tolerable in exchange for the potential extra job managability.
  • Question: with this resource allocation and migration abilities, scheduling jobs on a cluster starts to sound more and more like NUMA scheduling.  Could cluster job scheduler start using more NUMA derived code?  What about stuff like MPI jobs, where multiple dozens or hundreds of sub-processes are scheduled to run in parallel all as part of the same job?

Designs for testing:
  • Ceph is getting to be a quite interesting distributed HPC filesystem, and even has an explicit facility designed to run KVM machines on top of it
  • Use Opengrid or the like for scheduling
Something like this:

[Image]



-- Pat

Idea: profiling gcj vrs jvm memory use on linux

So, gcj is supposedly a compiler which compiles java code to standard elf format libraries and executables.  How does this effect memory use in java code?  Specifically, each instance of a jvm would normally load local copies all it's java libraries, right?  And since standard java libraries are just other jars and/or class files, they wouldn't be shared.   So multiple jvm instances would load multiple copies of the java libs, one per each jvm.

But, since gcj could compile libraries to standard linux style shared libries, aka .so's, this might be different.  Remember that .so's are only loaded into memory once, no matter how many processes libk these into their local address space.

Ergo, if I'm correct about this, there could be a significant memory savings for gcj compiled libraries vrs. standard jvm instances where there are many instances running.  This means there may be effects on total memory use, on the amount of i/o to start processes, on the amount of i/o for paging and swapping, and possibly on memory cache locality.

This could have impact for things like hpc (yeah I know, hpc in java?  I've seen it, though ...), or maybe even allow something interesting like a web / app service portal where different EAR's or WAR's are loaded as different linux processes. 

I'll see if I can do something around this in the next couple of months.

BTW -- I don't know offhand how practical this would be, but I rather like the idea of isolating different EAR's / WAR's into different linux processes.  Right now it's annoying to look at a trashing JBOSS instance, say, and isolate performance effects to a particular EAR.  It'd be notably easier to do this if each EAR was a different process.  Comments on the practicality of this?

-- Pat


Avadon: The Black Fortress review

[Image]

So I just spent some time playing Avadon, the Black Fortress

This is one of the games I bought from Humble Indie Bundle: neat stuff!  The games are always DRM free, you pay what you decide to pay, and you decide how to split your payment between the game devs, a suite of charities, and Humble itself.  The games are cross platform, including Mac and Android, and recently integrated to your Steam account.   They offer a new bundle a couple times a year, please check them out!

Type of game: fantasy style computer RPG.  You control a main player and up to three companions, who you choose from a pool and who you largely control the development of.

Gameplay: combat is turn based, the controls are pretty easy to use.   I like how the turn based system makes combat less stressful.  There's a hard cap on character growth, so figuring out how to train and develop your characters can be important to get the best end result, although later in the game you get an option to rebuild (retrain) your characters.

World: The world is reasonably large and complex.  More areas open to exploration as you do different quests and develop the storyline.   In general, monsters do not respawn; that means that options to do experience point farming for character growth are limited.  There are more locked doors and chests then you'll be able to open, so save lockpicks for occasions where there's a plot consequence.  Also, it's easy to miss some hidden options or to just plain not find everything.  I recommend going back over areas a second time using one of the online maps.

Storyline: it feels like bait and switch, ending up more and more morally ambiguous.  The dialog options along the way offer hints of changing the outcome of events, but it seems like things happen nearly the same no matter what option you choose, especially early and mid game.  This makes the game feel very railroaded. 

It seems like only real effect of your choices is whether or not keeps loyalty of companions.  To do this means you have to take some pretty morally ambiguous choices yourself, such as slaughtering a clan who are the blood enemies of one companion, in vengeance of what they did to him.  Again, this feels railroaded. 

It turns out you can choose a couple of different end game options, and you really want to keep your companions loyalty to have certain choices be practical in the end game.


Summary:  good gameplay, but a little bit of a frustrating experience.   Still worth the time spent.

-- Pat