Just another voice

Thursday, December 27, 2012

Idea: HPC cluster using virtualization to run jobs...

I've been peripherally involved in our institutions HPC cluster recently. One of the persistent issues we've faced is resource allocation on the cluster, in two ways: constraining jobs to only the resources allocated for them, and in being able to migrate jobs between nodes for optimal node usage and overall cluster throughput.

For example, when scheduling a HPC job, the job requester is asked to give a estimate of what resources they think they'll consume. Let's say that job A is scheduled with a request for 8 cpu's and 64 gb of ram. The job scheduling software then has to find a spot on the cluster where it can reserve that capacity for job A. Job A is started, but so far, it's only used 2 cpu's and 8 gb.

The scheduling software knows this, and it'd be nice if it could start another job, say job B, in that reserved but currently underutilized resource slot. However, the scheduler can't be sure that job A won't kick up and use the full, allocated resource at some time in the future, and if it tries to slip in another small job in that underutilized spot, then it risks starving both job A and job B.

If, however, the scheduler could migrate job B to another spot, or even checkpoint job B to disk if there's not another spot available and restart it later, then it could start job B and know it could still run job A to completion even if job A suddenly starts using it's full resource allocation.

Also a problem is the case of a job trying to use more than it's requested resource allocation. Lets say that job A requested 2 cpus and 8 gb. Then it suddenly starts to consume 12 cpus and 64 gb. The scheduler is aware of this and can go in and kill the job, but this is a reactive process and in the meanwhile, job A may be starving other jobs sharing the same node on the assumption that job A would only use the resources it originally requested. There are certain hard constraints that can be applied (ulimits, for instance), but they only work for certain types of resources (e.g. memory) but not others (e.g. cpu).

Being able to place hard limits around a job would have several advantages: it could be prevented from interfering with other jobs; approaching a hard limit could be used as a signal to the scheduler to migrate one ore more jobs to better balance resource consumption, or to freeze one or more jobs; also approaching a hard limit could be used as a signal to allocate more resources to a job.

Basically, hard resource limits for all job resources, plus the ability to migrate jobs around and/or freeze/restore them, makes a HPC cluster scheduler look a lot more like a single machine NUMA scheduler, with all the advantages thereof.

I've been thinking a bit about this. On linux a couple of ideas which might be useful are:

a) Use containers to constrain jobs, and use process checkpoint / restart to migrate them between cluster nodes.

Problem: both container and the process checkpoint / restart technologies are still pretty immature, and still require out of tree kernel patches.

b) Use virtualization, and run jobs in virtual containers:

Virtualization is very mature and well tested. Cloning virtual machines can be very fast indeed. Migrating running virtual machines around a virtualization cluster is mature and well supported. Even checkpointing / restarting machines is well supported.
Resource constraints are a natural fit: limiting a virtual machine's resources is very mature and baked into the virtualization model. There's also some pretty good tools to be able to dynamically add memory and/or vCpus to a virtual guest if that facility is desired. So being able to either constrain or dynamically grow per job (per VM) resources is well supported.
Question: exactly how much of a full virtual machine would be needed? Or put another way, how minimal of an environment and/or kernel would be needed. Could we run with a really minimal user environment, maybe consisting of just a few libraries, a minimal filesystem and no other processes than that required for the job itself? It'd be instructive here to look at some different chroot sandboxing solutions. Also, how minimal of a kernel could be used? Assume a fully paravirtualized guest I/O here.
Question: how much performance impact would there be? There's a lot of recent performance improvements in KVM's networking and I/O infrastructure, e.g. the virtio and device passthrough stuff. Put another way, what percentage performance hit would be tolerable in exchange for the potential extra job managability.
Question: with this resource allocation and migration abilities, scheduling jobs on a cluster starts to sound more and more like NUMA scheduling. Could cluster job scheduler start using more NUMA derived code? What about stuff like MPI jobs, where multiple dozens or hundreds of sub-processes are scheduled to run in parallel all as part of the same job?

Designs for testing:

Ceph is getting to be a quite interesting distributed HPC filesystem, and even has an explicit facility designed to run KVM machines on top of it
Use Opengrid or the like for scheduling

Something like this:

-- Pat

Idea: profiling gcj vrs jvm memory use on linux

So, gcj is supposedly a compiler which compiles java code to standard elf format libraries and executables. How does this effect memory use in java code? Specifically, each instance of a jvm would normally load local copies all it's java libraries, right? And since standard java libraries are just other jars and/or class files, they wouldn't be shared. So multiple jvm instances would load multiple copies of the java libs, one per each jvm.

But, since gcj could compile libraries to standard linux style shared libries, aka .so's, this might be different. Remember that .so's are only loaded into memory once, no matter how many processes libk these into their local address space.

Ergo, if I'm correct about this, there could be a significant memory savings for gcj compiled libraries vrs. standard jvm instances where there are many instances running. This means there may be effects on total memory use, on the amount of i/o to start processes, on the amount of i/o for paging and swapping, and possibly on memory cache locality.

This could have impact for things like hpc (yeah I know, hpc in java? I've seen it, though ...), or maybe even allow something interesting like a web / app service portal where different EAR's or WAR's are loaded as different linux processes.

I'll see if I can do something around this in the next couple of months.

BTW -- I don't know offhand how practical this would be, but I rather like the idea of isolating different EAR's / WAR's into different linux processes. Right now it's annoying to look at a trashing JBOSS instance, say, and isolate performance effects to a particular EAR. It'd be notably easier to do this if each EAR was a different process. Comments on the practicality of this?

-- Pat

Avadon: The Black Fortress review

So I just spent some time playing Avadon, the Black Fortress.

This is one of the games I bought from Humble Indie Bundle: neat stuff! The games are always DRM free, you pay what you decide to pay, and you decide how to split your payment between the game devs, a suite of charities, and Humble itself. The games are cross platform, including Mac and Android, and recently integrated to your Steam account.   They offer a new bundle a couple times a year, please check them out!

Type of game: fantasy style computer RPG. You control a main player and up to three companions, who you choose from a pool and who you largely control the development of.

Gameplay: combat is turn based, the controls are pretty easy to use.   I like how the turn based system makes combat less stressful. There's a hard cap on character growth, so figuring out how to train and develop your characters can be important to get the best end result, although later in the game you get an option to rebuild (retrain) your characters.

World: The world is reasonably large and complex. More areas open to exploration as you do different quests and develop the storyline.   In general, monsters do not respawn; that means that options to do experience point farming for character growth are limited. There are more locked doors and chests then you'll be able to open, so save lockpicks for occasions where there's a plot consequence. Also, it's easy to miss some hidden options or to just plain not find everything. I recommend going back over areas a second time using one of the online maps.

Storyline: it feels like bait and switch, ending up more and more morally ambiguous. The dialog options along the way offer hints of changing the outcome of events, but it seems like things happen nearly the same no matter what option you choose, especially early and mid game. This makes the game feel very railroaded.

It seems like only real effect of your choices is whether or not keeps loyalty of companions. To do this means you have to take some pretty morally ambiguous choices yourself, such as slaughtering a clan who are the blood enemies of one companion, in vengeance of what they did to him. Again, this feels railroaded.

It turns out you can choose a couple of different end game options, and you really want to keep your companions loyalty to have certain choices be practical in the end game.

Summary: good gameplay, but a little bit of a frustrating experience.   Still worth the time spent.

-- Pat

Monday, November 5, 2012

I've been looking at mapping tools for tabletop rpg gaming, and by far the most capable one I've found to date is Maptool, from rptools: http://www.rptools.net/index.php?page=maptool. I've been hoping to make use of this with a couple of laptops in a face to face game. While this is conceived as an aid to long distance gaming, it's capabilities (especially the fog of war and vision handling) seem useful even for an in-person game.

It's really quite a powerful tool, and offers:

shared maps visible to all players
movement measuring
movable tokens representing players and other, non players
vision blocking
light sources
fog of war with automatic reveal

The demos at their tutorial site are pretty exciting, and got me really pumped to start using this.

I've spent quite a bit of time working to convert a traditional module, which I have in paper form, into a Maptool campaign. I chose one of the basic dungeon crawl style adventures, the old AD&D module B2 Keep on the Borderlands.

I started by scanning the 2 page map from the module, stitching the maps together, and then creating maps in the Maptool campaign. Actually, I created two maps, one for the surface, and one for the dungeon layers.

Then for the dungeon layer in particular, I thought the easiest path forward was to just import the scan as a background / map image. This has the risk of revealing information to the players like traps or secret doors, which I'm dealing with by either covering up key stuff on the map or just extending trust to the players.

I then draw a huge vision blocking rectangle over the whole dungeon map, and carved out exceptions on a tunnel by room basis. I further covered the whole map with fog of war.

That said, there's been a number of issues that've prevented me from using this map so far:

The user interface is, at best, clunky and non-friendly:

Consistently, operations that should be obvious in the UI are non-obvious.

Scrolling the map is right click drag, instead of left click drag on any open map area. There are no visual controls to scroll (e.g. the google maps rosette)
To reverse the sense of an operation, hold down the shift key. For example, drawing a vision blocking rect with shift held down removes vision blocking from an area. Hold the shift key, click a token and drag to change it's facing. Things like this should be a separate control.

The UI is inconsistent. For example, clicking a token to select it doesn't always work on the first click.
When moving tokens, enabling snap to grid has the effect of visually displacing the token destination from the apparent token position, and tokens can end up somewhere different than expected. With large vision blocking areas set, players can easily accidentally lose their token in an area where they can't see it to click on it again.
If there's multiple 'player' tokens on the map, it's not always obvious which player token's vision is being used to reveal stuff on each player's map. Nor does there appear to be a way to select which player token is used for which connected player.

I've had problems with different java versions on linux
I've had problems with loading a large campaign map from network connected clients, on a lan.
The interactions between vision blocking and fog of war are weird. I have instances on my map where stuff which is completely covered by a vision blocking rect are shown as having been revealed to the player. Reapplying fog of war to these areas have no effect. Note that players can't see into the area.

Unfortunately, the problems above have been significant enough that so far it's stopped me from using this. I'm really bummed, 'cause I'm enthused about the possibilities here and have put a lot of time into making this map. I'm still working on it, and hope to be successful at some point. I'm also willing to put in quite a bit of time and learning on my own sake. However I'm not going to impose a tool that requires a considerable UI learning curve on my players. Have to see how this goes.

-- Pat

Sunday, October 28, 2012

Port Knocking

I just implemented port knocking on one of my internet facing servers. A decent article on it is here: http://www.linuxjournal.com/magazine/implement-port-knocking-security-knockd

The server setup was pretty dang simple. Make sure your firewall has a few failback rules before you implement this, of course. In particular, hand enter one for the subnet of the workstation you're ssh'ing is from, and make sure you have an ESTABLISHED, RELATED rule. My default rules look like this:

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p icmp -j ACCEPT
iptables -A INPUT -i lo -j ACCEPT

iptables -A INPUT -m state --state NEW -m tcp -p tcp --source MYNET/MYNETMASK --dport 22 -j ACCEPT

iptables -A INPUT -j REJECT --reject-with icmp-host-prohibited

The only wrinkle I found was writing scripts for doing the knocking. I had used the package recommended in the article, 'knockd', and it comes with a client 'knock'. Unfortunately, I found the 'knock' utility knocked too fast. Additionally, macports didn't appear to have a port of it for OSX. What I did was to use a script using "nc -z" with some sleeps, which is both portable and worked.

Here's a (sanitized) copy of my knock script. No, these aren't the real ports I'm using :-)

#!/bin/sh

for p in 12345 23456 34567 45678 56789 ; do
    nc -v -z myserver.domain.top $p
    sleep 1
done

Just remember to put enough sequence time (parameter 'seq_timeout') in your /etc/knockd.conf script for the above to finish. With a 1 second sleep, try about 2-3 times the sequence time as the number of ports you're knocking.

Luck!

Friday, October 26, 2012

My review of Square / SquareUp

So, some while ago, I downloaded a copy of the Square payment application for my android phone. For those unfamiliar with it, it's a small credit card payment app for your Apple or Android smart phone. Home page is here: https://squareup.com/. I should note that Intuit also has a competing product.

The basics of it is that you set up an account with Square, you link it to a bank account, and you accept payments via your phone. They send you a little card reader plugin that is supposed to plug into your smart phone's microphone/headset jack, but you can also manually enter a credit card number, which is fortunate.

With the default account, they charge a small per transaction fee, I think 2.75%. Less than traditional credit cards, but they limit how much per week you can process.

I generally like it - my only complaint is that I've never gotten the funky card reader to operate, and have always had to manually key in the credit card numbers. I've tried this with 3 different card reader widgets (two from Square, and one I bought at Radio Shack), and on three different android devices: my HTC Evo 4G, my wife's HTC Evo Slide, and a hacked Nook Color tablet running CyanogenMod. This is doubly a bummer, 'cause Square charges you an extra 1/4% on manually entered card numbers.

Nonetheless, give it a try. It's a nifty tool.

-- Pat

Thursday, October 18, 2012

A short rant about Fedora:

Sorry Fedora, you've lost me.

I've been a linux devotee since around '93 or so, my first install was the SLS linux distro from a stack 'o floppies downloaded painfully via modem. I've been a professional sysadmin specializing mostly in linux (with some other unix flavors) for about a decade and a half now.

Ergo, I've gotten very familiar with lots of different linux setups.

Unfortunately, though, fedora is moving sharply away from what I considered to be one of the best features of linux and unix like os's overall: discoverability.

Don't know where something is configured? A "grep -R" or "find . -type f | xargs grep" has a very high likelyhood of finding it. Once you get a hit, then you have some context you can use in a "man -k" or google search. Voila, one more configuration issue solved.

Unfortunately, with the sharp turn toward opaque tools (journal, systemd, network-manager) fedora is breaking away from this, and sadly the rest of linux land appears to be following.

I always rather despised having to use special tools to find errors in windows (event viewer), now I have to do the same in fedora (journalWhateverTheHeckItsNamed). I hate not having the system config and startup in a an obvious, known, transparent and readable location (what, /etc/rc.d/init.d/ is empty?! Blargh, I hate having to find, and then try to read XML.) Whups -- I need to do something slightly out of the 'standard' laptop network setup, like a static IP and a bridge for a VM setup, uh, where do I configure that again? Oh, my system moved and needs to be renamed, uh, where is that at?

Sure change is good, yadda yadda. Transparency, though, is better.

-- Pat

p.s. and da** solaris for doing the same, but earlier and worse. Can't even edit /etc/nsswitch.conf on sol 11 anymore ...