Monday Mar 16, 2009

In my previous post, I talked about the new installer that is included with Sun Grid Engine 6.2u2. Lubos, one of our core team (as opposed to Service Domain Manager or QA) engineers in Prague, has just posted a couple of videos of the new installer. The first one shows how to make sure the new installer can be used with the machines you're planning to use for your cluster. Because the new installer can install an entire cluster at once, it has to be able to contact all the machines destined for the cluster, and that's where the setup comes in. The second one actually shows off the new installer. Lubos also has some screenshots of the new installer posted.

Thursday Mar 05, 2009

Sun Grid Engine 6.2u2 is now available. If you're not excited, you should be. First off, don't let the name fool you. 6.2u2 is not just bug fixes. It's a full feature release, and contains some great features. What features? Glad you asked.

First and foremost, job submission verifiers (JSVs). It's a feature we added specifically for TACC, but it's one that will be useful for almost everyone. In fact, I suspect that we'll discover it's the answer to some of the classic Sun Grid Engine problems. What is it? Before 6.2u2, there was no way to prevent a job from being submitted. It was (and still is) possible to choose not to schedule a job after it's been submitted, but before 6.2u2, that's all you could do. With 6.2u2 and JSV, you now have the option to insert a step between submission and acceptance. With that step, you can choose to accept or reject the job submission, but you can also choose to modify the job before accepting it, and that's where the magic comes in.

The verification step is handled through scripts or binaries. There's a new submission option, -jsv, that adds a JSV to the submission. That means you can pick up JSVs from anywhere that you can stash a submission option: most notably the global sge_request file, your user sge_request file, and the directory's sge_request file, but also DRMAA native specification, DRMAA job category, the enigmatic -@ switch, and, of course, the command line itself. The -jsv switch is cumulative, so if you have one in several of those places, several JSVs will be run for your submission. It's worth noting that all of the above listed JSV sources are controlled by the user, except the global sge_request file, and even that can be overridden with the -clear switch.

So far, we've only talked about the client side. JSVs can also come in on the server side. In the global host configuration an administrator can configure a single JSV. Unlike on the client side where every JSV is started from scratch with every job submission, on the server side the JSV is started once and queried repeatedly. The reason is that on the client side, performance isn't a big issue, but on the server side, the cost of forking and execing the JSV for every job submission can have a huge impact. By keeping the JSV running, we save that cost. The big advantage of the server-side JSV is that users can't circumvent it. If you really need to enforce a policy with a JSV, the server side is that place to do it.

Now, if you're thinking fast, you might question the point of the server-side JSV when users can change everything about the job using qalter after it's submitted. Well, so did we. When you configure a server-side JSV, users are no longer allowed to modify jobs after submission unless you specifically grant the ability to do so, and even then it's limited to the job attributes that you allow them to modify.

JSV is a huge topic, and I could probably go on for days about it. Instead I'll save it for a white paper and move on.

The next big feature in 6.2u2 is the new installer. You now have the option of using the old interactive text-based installer or a new graphical installer. The graphical installer has several important advantages. First, it lets you install an entire cluster at once. It actually sits on top of the auto-installer and reuses that same functionality to install remote nodes. The graphical installer, however, will first verify that all the nodes are reachable before the installation starts, so the installation won't quietly hang on an unreachable node. It also accepts wildcarded host name and IP address ranges, which makes installing a huge cluster much simpler.

The third major feature is that we've added support for Microsoft Windows Vista (Ultimate and Enterprise) and Server 2003R2 and 2008. Both 32-bit and 64-bit version are available. Harald (who you should encourage to start blogging!) worked really hard on ironing out the issues with the changes in the OS. We still rely on SFU for the Windows execution daemons, except that it's now called SUA.

The fourth big feature is job-level parallel job resource requests. Before 6.2u2, whenever a parallel job requested a resource, SGE would implicitly multiply that resource request by the number of assigned slaves (because each slave requests the resource on the host where it runs). That makes sense with, say, memory, where requesting 4GB really means that every slave should have 4GB. It doesn't make any sense for other things, like some software licenses. Now with 6.2u2, the administrator can flag a resource as job level, meaning that it is not multiplied by the number of assigned slaves when requested by a parallel job. In most cases, a resource that shouldn't be multiplied in for one job, shouldn't be multiplied for any job. There may be exceptions to the rule, but I doubt there will be many. I'd love to hear your feedback, though.

The last two new features aren't so much features as improvements. Starting with 6.2u2, the 64-bit Linux binaries use the jemalloc library instead of the default Linux malloc. The performance and memory footprint impact is significant, in some cases as much as 20% improvement. Also, starting with 6.2u2, the Linux binaries use poll() instead of select() in the commlib. For some flavors of Linux, the use of select() made it difficult to scale past a couple thousand hosts. With the commlib now using poll(), I've seen SGE scale well over 6000 Linux nodes.

And on top of all that, there is the usual pile of bug fixes. A handful of qmaster and scheduler issues cropped up recently in 6.2 and 6.2u1, but with 6.2u2 those should all now be resolved.

I highly recommend giving 6.2u2 a try, if for no reason other than JSV. Let me know what you think!

Thursday Feb 05, 2009

One of the new features coming in Grid Engine 6.2u2 is job submission verification (JSV). The basic idea is that on both the client side and the server side, you have the ability to add scripts that can read through all the job submission options and accept, reject, or modify the job. JSV will open up a whole new world of possibilities that didn't exist before, and it will largely end the need for qsub wrapper scripts.

Because the server-side JSV scripts are executed by the qmaster for every job, there are performance considerations that must be taken into account. In order to limit the performance impact, the qmaster will manage the JSV scripts the same way load sensor scripts are handled, i.e. they are started once and kept alive as a separate process instead of starting them once per job. Nonetheless, what happens inside the scripts can still have a big impact on qmaster performance.

In a test, Roland (who still isn't blogging!) set up some DRMAA submission clients to hammer the master with job submissions. With no server-side JSV scripts, the clients were able to do 900 job submissions per second\*. With a simple server-side Perl JSV script to change the job name, the clients were only able to submit 700 jobs per second. A similar JSV script written in Tcl yielded the same results. With a similar JSV script written in Borne shell, however, the clients were only able to submit 3 jobs per second. No, that isn't a typo. While languages like Perl and Tcl are able to process numbers and strings natively, Borne shell has to rely on forking off other commands. Those forks are expensive and, even in a simple JSV script, yield major performance penalties. For these reasons, I actually recommend the Java™ language for writing server-side JSV scripts. Not only do you get access to the all the great built-in and external libraries, but you also get access to JGDI, letting you talk to the qmaster without forking an SGE command-line tool. (Thanks to Jython, JRuby, Rhino, et al, you can get the same benefits from languages other than just the Java language.)

Let me repeat that point to make sure it comes through loud and clear. If you use a shell script as a server-side JSV script, you will trash your cluster's job submission rate. That's not just for DRMAA jobs or for certain users. That's for the entire cluster.

On the client side, the story is a bit simpler but still similar. For every job submission, each client-side JSV will be started. (An array job counts as a single job in this regard.) That makes sense because qsub is started once for every job submission, and the JSV scripts can't outlive the qsub that launched them.

For DRMAA the implications are a little different. The JSV scripts are still started for every job submission, even though the DRMAA client remains running between submissions. (A DRMAA bulk job is an array job and hence still counts as a single job in this respect.) Roland used DRMAA clients in his test because they're very fast at job submissions. Using client-side JSV scripts affects that in much the same way as on the server side. And as with the server-side scripts, shell scripts have more of an effect than scripts written in a higher-level language. If you figure there's about 200ms overhead for every fork & exec, you could easily add several seconds to each job submission. A DRMAA client without client-side JSV scripts can easily submit over 100 jobs per second. With even a single client-side JSV script that runs no further commands, your submission rate drops to less than 5 jobs per second. Use with caution!

The JSV feature in 6.2u2 is extremely powerful, but as I've explained, you have to use it with care. When used with appropriate caution, however, JSV provides a fairly easy answer to some of the traditionally thorny issues for SGE administrators.

\* Roland achieved that submission rate with a Sun x4100M2 with two dual core AMD 2.8GHz processors running Solaris 10 as the master.

Wednesday Jan 07, 2009

Owen Taylor (formerly) of GigaSpaces has put together an excellent proof of concept using GigaSpaces XAP and Sun Grid Engine. Using Sun Grid Engine, the PoC is able to grow and shrink the size of the GigaSpaces cluster dynamically according to changing load conditions. The PoC monitors GigaSpaces via JMX and then uses DRMAA to submit new instances to SGE or stop existing ones. Read more about it.

Tuesday Jan 06, 2009

The last couple of weeks before the holidays I worked on an interesting project. It involved assembling pretty much everything Sun offers for HPC into a single coherent demo and throwing in Amazon EC2 to boot. This post will explain what I did and how I did it. Let's start at the beginning.

One of the new offerings from Sun is the Sun HPC Software. Beneath the excessively generic name is a complete, integrated stack of HPC software components. Currently there are two editions: the Sun HPC Software, Linux Edition (aka Project Giraffe) and the Sun HPC Software, Solaris Developer Edition. (A Sun HPC Software, Solaris Edition and Sun HPC Software, OpenSolaris Edition will be following shortly.) The Linux edition is exactly what the name implies. It's a full stack of open source HPC tools bundled into a Centos image, ready to push out to your cluster. The Solaris developer edition is a slightly different animal. It is targeted at developers interested in writing HPC applications for Solaris. The Solaris developer edition is a virtual machine image (available for VMware and Virtual Box) that includes Solaris 10 and a pre-installed suite of Sun's HPC products, including Sun Grid Engine, Sun HPC ClusterTools, Sun Studio, and Sun Visualization, all integrated together.

For this demo, I used the Solaris developer edition. The end goal was to produce a version of the virtual machine image that was capable of automatically borrowing resources from a local pool or from the cloud in order to test or deploy developed HPC applications. Inside the developer edition virtual machine, there are already two Zones that act as virtual execution nodes for testing applications. That's a nice start, but what about testing on real machines or a larger number of machines? That's where the resource borrowing comes in. In the end, I had a VM image that was capable of automatically borrowing and releasing resources first from a local pool and later from the cloud, on demand.

The first step was to get the developer edition running as-is. Sounded simple enough. The first wrinkle was that I was doing this demo on a Mac. The regular VMware Player is not available for Mac, so I had to download an eval copy of VMware Fusion. Once I had Fusion installed, I was able to bring up the developer edition VM without a hitch.

Step 2 was to get the VM networked. The network configuration for the developer edition beta 1 is such that the global and non-global Zones can see each other, but nobody can get into or out of the VM. Getting the networking working was probably the hardest part of the demo, and honestly, I can't tell you how I finally did it. Per the suggestion of the pop-up dialogs from VMware, I installed the VMware Tools in the VM's Solaris instance. That changed the name of the primary interface from pcn0 to vmxnet0, but didn't actually help. Solaris was still unable to plumb the interface. After twiddling the VM's network settings several times and doing several reconfiguration boots, I eventually ended up with a working vmxnet1 interface (and a dead pcn0 and vmxnet0). As usual in such adventures, I'd swear that the last thing I did before it started working should not have had any appreciable effect. Oh, well. It worked, and I wasn't interested in understanding why.

Now that I had a functional network interface, the next step was to reinstall the Sun Grid Engine product. The VM comes with a preinstalled instance, but this demo requires features not enabled in a default installation, like what the VM provides. I left the original cell (default) intact and installed a new cell (hpc) with the -jmx and -csp options. -jmx enables the Java thread in the qmaster that serves up the JGDI API over JMX. I needed JGDI so that the demo GUI that I was building could receive event updates from the qmaster about job and host changes. With Sun Grid Engine 6.2, I was unable to successfully connect to the JMX server unless I installed the qmaster with certificate-based security, hence the -csp option. After the installation was complete, I then had to do the usual CSP certificate juggling, plus a new wrinkle. In order to connect to the JMX server, I also had to create a keystore for the connecting user with $SGE_ROOT/util/sgeCA/sge_ca -ks <user>. There's a quirk to the sge_ca -ks command, though. By default, it fails, explaining that it can't find the certificates. The reason is that the path to the certificates is hard-coded in the sge_ca script to a ridiculous default value. To change it to the correct value, I had to use the -calocaltop switch. After the certificates were squared away, I installed execution daemons in both Zones. At least that part was easy.

The next thing I did was to create some more Zones. Yes, I know this demo was supposed to be using real machines from a local pool and the cloud. Because it's a demo on a laptop, the "local machines" had to be equally portable. Because of firewall issues, I also wanted to have a backup for the cloud. In an effort to be clever, I moved the file systems for the two existing Zones onto their own ZFS volumes. I wanted to create the new Zones as cloned snapshots of the old Zones. Unfortunately, it turns out that even though the man page for zfs(1M) says that it's possible, the version of Solaris installed in the VM is the last version on which it isn't possible. After chasing my tail a bit, I decided to just do it the old fashioned way instead of trying to force the new fangled way to work.

Now that I had six non-global Zones running, the next step was to get Service Domain Manager installed. It is neither installed nor included in the developer edition VM, so I had to scp it over from my desktop. Technically, I could probably have managed to download it directly from the VM, but I had already downloaded it to my desktop before I started. For the Service Domain Manager installation, I followed Chansup's blog rather than the documentation. Chansup's blog posts detail exactly what steps to follow without the distraction of all the other possibilities that the docs explain. Following the steps in the blog, I was able to get the Service Domain Manager master and agents installed with little difficulty. The hardest part is that the sdmadm command has extremely complicated syntax, and it took a while before I could execute a command without having the docs or blog in front of me as a reference. To prove that the installation worked, I manually forced Service Domain Manager to add one of the new Zones to the existing Sun Grid Engine cluster, and much to my shock and wonderment, it worked.

The last step of VM (re)configuration was to configure the Service Domain Manager with a local spare pool and a cloud spare pool and a set of policies to govern when resources should be moved around. This step proved about as tricky as I expected. As one of the original architects and developers of the product, I had a good idea of what I wanted to do and how to make it happen, but the syntax and the details were still problematic. The syntax was the first hurdle. The docs have issues with both understandability and accuracy, and Chansup's blog was too narrowly focused for my purposes. After I poked around a bit, I figured out how to do what I wanted, but actually doing it was the next challenge. What I wanted to do was create two MaxPendingJobsSLO's...

We interrupt your regularly scheduled blog post to bring you a public service announcement. Please, for your own well being and the well being of others who might use your software, test all of your code contributions thoroughly on all supported platforms, and have them reviewed by an experienced member of the development team before committing, especially if you're working on the Firefox source base. This point in the blog post is the last time I saved my text before completing the post. Before I could save it, Firefox segfaulted causing me to loose a significant amount of work. What follows is a downtrodden, half-hearted attempt to complete the post again. We now return you to your regularly scheduled blog post.

What I wanted to do was create two MaxPendingJobsSLO's for the Sun Grid Engine instance. The first would post a moderate need (50) when the pending job list was more than 6 jobs long. The second would post a high need (99) when the pending job list was more than 12 jobs long. I also wanted to have a local spare pool with a low (20) PermanentRequestSLO and a low FixedUsageSLO, and a cloud spare pool with a moderate (60) PermanentRequestSLO and a moderate FixedUsageSLO. The idea was that when the Sun Grid Engine cluster was idle, all the resources would stay where they were. When the pending job list was longer than 6 jobs, resources would be taken from the local spare pool. When the pending job list was longer than 12 jobs, additional resource would be taken from the cloud spare pool. When the pending job list grew shorter, the resources would be returned to their spare pools. In theory. (The philosophy of setting up Service Domain Manager SLOs is a full topic unto itself and will have to wait for another blog post.)

The first problem I ran into was that Service Domain Manager does not allow a spare pool to have a FixedUsageSLO. An issue has been filed for the problem, but that didn't help me set up the demo. The result was that I had no way to force Service Domain Manager to take the local spare pool resources before the cloud spare pool resources. The best I could do was set the averageSlotsPerHost value for the SLO for the MaxPendingJobsSLO's to a high number so that Service Domain Manager only would take hosts one at a time, rather than one from each spare pool simultaneously.

The nest problem was quite unexpected. With the SLOs in place, I submitted an array job with 100 tasks. I waited. Nothing happened. I waited some more. Still nothing happened. I turns out that the MaxPendingJobsSLO only counts whole jobs, not job tasks like DRMAA would. The work-around was easy. I just had to be sure the demo submitted enough individual jobs instead of relying on array tasks.

The last problem was one that I had been expecting. After a long pending job list had caused Service Domain Manager to assign all the available resources to the cluster, when the pending job list went to zero, the borrowed resources didn't always end up where they started. Service Domain Manager does not track the origin of resources. Fortunately, the issue is resolved by an easy idiom. I created a source property for every resource, and I set the value of the property to either "cloud", "spare", or "sge". I then set up the spare pools' PermanentRequestSLO's to only request resources with appropriate source settings. I also added a MinResourceSLO for the cluster that wants at least 2 resources that didn't come from a spare pool, just to be complete.

With the SLOs in place, the configuration actually did what it was supposed to. When the cluster had enough pending jobs, hosts were borrowed first from the local spare pool and then from the cloud. When the pending jobs were processed, the resources went back to the appropriate spare pools. To make the configuration more demo-friendly, I changed the sloUpdateInterval for the Sun Grid Engine instance to a few seconds (from the default of a few minutes). I also changed the quantity for the spare pools' PermanentRequestSLO's to 1 so that they would only reclaim their resources one at a time, rather than all at once. With those last changes made, I was ready to move on to the UI.

The idea of the demo was to present a clear graphical representation of what was going on with Sun Grid Engine and Service Domain Manager. From past experience building a similar demo for SuperComputing, I knew that JavaFX™ Script was the best tool for the job. (OK. It's not the best tool for the job in a general sense, but I'm a long-time Java™ geek, I don't know Flash, and I didn't have any budget to buy tools. Under those constraints, it was the best I could do.) Before I could get to building the UI, though, I first needed a JGDI shim to talk to the qmaster. Richard kindly provided me with some JGDI sample code, and from there it was pretty easy. The hardest part was figuring out what the events actually meant. In the end, my shim registered for job add events (to recognize job submissions), task modified events (to recognize job tasks being scheduled), and job deleted events (to recognize job completions). It also registered for host added and deleted events to recognize when Service Domain Manager reassigned a host.

With the shim working smoothly, I turned to the actual UI. Given the complexity of the animations that I wanted to do, it was shockingly simple to achieve with JavaFX Script, especially considering that there was not yet a graphical tool equivalent to Matisse for Swing. Every bit of it was hand-coded, but it still was fast, easy, and came out looking great. In the end, the whole UI, counting the shim, was about 1500 lines of code, and about 500 lines of that was the shim. (JGDI is rather verbose, especially when establishing a connection to the qmaster.)

And with that, I ran out of time. The next step would have been to actually populate the cloud spare pool with machines provisioned from the cloud. Torsten graciously provided me a Solaris AMI that included Sun Grid Engine and Service Domain Manager. The plan was to pre-provision two hosts to populate the pool and then create a script that would provision an additional host each time the cloud pool dropped below two hosts and release a host every time it grew larger than two hosts. Now that the demo has been presented, the pressure is off, and other things are higher priority. I do plan, however, to eventually come back and put the last piece of the puzzle in place.

Below is a video of the demo, showing how jobs can be submitted from the Sun Studio IDE, and how Sun Grid Engine and Service Domain Manager work together with the local spare pool and the cloud to handle the workload. The job that is being submitted is a short script that submits eight sleeper jobs. Because the MaxPendingJobsSLO ignores array tasks, I needed to submit a bunch of individual jobs, but I didn't want to have to click the submit button multiple times in the demo.

Filming the video turned out to be an interesting challenge unto itself. I did the screencap using Snapz Pro on the Mac. It has no problem with JavaFX Script or with VMware VMs, but it apparently can't film JavaFX Script running inside a VMware VM. I ended up having to twiddle the UI a bit so that I could run it directly on the Mac. That's why in the demo, when I switch from Sun Studio to the UI, I swap Mac desktops instead of Solaris workspaces. The voice over and zooming effects are courtesy of Final Cut, by the way.

Thursday Dec 18, 2008

Wondering what to get for that special someone who has everything? How about a sneak peek at soon-to-be-released Sun Grid Engine 6.2 update 2? That's right! Nothing says 'I love you,' like the SGE 6.2u2 Beta, and it's available just in time for the holidays. It makes a great stocking stuffer, and it's fun for the whole family. Download the SGE6.2u2 Beta today!

Monday Nov 03, 2008

There's a new Grid Engineblog aggregator on planets.sun.com. The idea is to capture all of the relevant Grid Engine blogs in a single place for easy access. It's similar to the aggregator on the OpenSolarisHPC Community site, except that the HPC one also contains general HPC blogs and blogs on other Sun HPC products as well. If you have suggestions for a blog that should be included in either, let me know.

Wednesday Oct 01, 2008

I recently rediscovered a hidden qconf option. I remember talking with the engineer when he implemented the option years ago, but because it was never documented, I forgot that it existed. A recent customer eval reminded me that it's there, and I think it's one worth sharing.

The hidden option is qconf -bonsai. It is a human-readable equivalent of qconf -sstree, which if you've looked at you'll know isn't even remotely human-readable. It prints the current share tree configuration using spacing to represent hierarchy.

Let's look at an example. This is the output from qconf -sstree for my home test cluster:

Now, as for why it's an undocumented feature, I suspect it's historical. It was originally added on a whim by one of the engineers and was just never fully embraced. I remember there being talk about changing the name of the switch and making it a documented feature, but I suspect that plan just got lost in the shuffle.

Thursday Sep 18, 2008

We've just added pricing to the Sun Store for Sun Grid Engine 6.2. Just go to the Get It tab, scroll down to the media kit, perpetual licenses, or subscription licenses section, and click the Get It button. On the Sun Store page you'll find the complete pricing information for that option.

Monday Aug 11, 2008

Since there's actually quite a lot of information out there about the Sun Grid Engine 6.2 release, I thought it might be useful to provide a single source for where to find it. (Actually, the completely revamped Sun Grid Engine is already a single source for this information, but you have to browse a bit to find it all.) Here ya go:

Friday Aug 08, 2008

Congratulations to the Sun Grid Engine team! The new 6.2 release is finally out. (Actually, it's been out for a couple of days. I'm just a tardy blogger.) To find out more about what's in it, check out my previous post, see the new, improved Sun Grid Engine product page, or listen to the podcast that Miha and I recorded. To download a copy of the software, pop on over to the Sun Download Center. The open source courtesy binaries will be made available shortly.

As if that were not exciting enough, there's also a chance to win a free t-shirt! Andy, our non-blogging engineering manager (Encourage him to start blogging next time you see him!), has put a bounty on 6.2 production clusters. For more information, see Chris' post on gridengine.info or Andy's original email. Act now! Supplies are limited!

Thursday Jul 31, 2008

In a previous post I gave a high-level overview of what features each new release of Grid Engine has brought to the table, including what's coming in 6.2. Since 6.2 is now just around the corner, I wanted to go into a bit more detail on why you want to be the first kid on your block to upgrade.

Let's just go through the features in detail, one by one:

Advance Reservation

The reason for advance reservation is that sometimes it's important to coordinate the availability of compute resources with external factors, such as people, facility, and/or equipment availability. If, for example, you're trying to process data from some celestial event during the event to help further focus the data gathering, you want the compute resources available while the event is occurring. That is exactly what advance reservation enables.

With 6.2, we introduce three new commands: qrsub, qrdel, and qrstat. qrsub lets users create new advance reservations. A reservation must have a duration or an end time. If a reservation does not request a certain start time, the start time is assumed to be now. When a user runs qrsub, the scheduler will attempt to insert the reservation into its resource schedule. If there's room, the reservation will be granted and assigned an id. If the resources are not available at that time, the reservation will be denied.

Once a user has been granted a reservation, there are several things he can do with it. qsub now has an option that allows users to submit a job to a given reservation. If the reservation is not yet active, i.e. it's for a future time, the job will remain pending until the reservation's start time. A job submitted to a reservation can only run on the resources that were assigned to the reservation. If a job submitted to a reservation is still running when that reservation ends, it will automatically be terminated. When the reservation is first requested, the requesting user can include a list of users and groups who are also allowed to user the reservation. Any user in that list is allowed to submit jobs to a reservation. An advance reservation could alternatively be used to block off a set of machines for some out-of-band purpose, such as taking them down for maintenance or logging into them directly to do some work.

Once a reservation is no longer needed, the creating user can delete it using the qrdel command. Once a reservation is deleted, it's gone. If a user needs to recreate the reservation, she will have to effectively create a new reservation requesting the same (or similar) resources.

In order to see the scheduler's master reservation plan, users can run the qrstat command. qrstat shows what resources are reserved when.

In the time between when a reservation is created and when the reservation becomes active, the scheduler will attempt to backfill the resources with jobs with durations that fit into the available time window. By default, the scheduler will not backfill with jobs that do not specify a wallclock time limit.

There are a couple of limits on users' ability to create reservations. First, a new scheduler parameter controls the maximum number of allowed reservations. Second, reservations can only be made on resources that the scheduler can determine will be available at the desired time of the reservation. The scheduler knows that the resource will be available either 1) because the resource is currently unused, or 2) because the job currently running on the resource has a wallclock time limit that says the job will end before the reservation is supposed to begin.

Multi-clustering with Service Domain Manager (Project Hedeby)

Service Domain Manager (SDM) or Project Hedeby is a framework for managing resource sharing among services. It enables an administrator to define service level objectives (SLOs) that govern the distribution of resources. As workloads change, resources are automatically migrated from one service to another, in order to continue satisfying the SLOs. A service in this context is any application that can scale across multiple nodes.

With Grid Engine 6.2 we're including a feature-limited version of SDM to enable a form of multi-clustering. Using SDM, several Grid Engine clusters can share their resources. The clusters' users continue to use the individual clusters as before. Some just get larger, while others get smaller, as workloads change.

The multi-clustering capability of 6.2 has multiple applications. Any time that you need to have multiple masters for any reason, 6.2's multi-clustering will enable you to combine the individual clusters into a larger "meta-cluster," which will help you keep your resource utilization up.

Scalability to 63,000 cores

A tremendous amount of work has gone into scalability improvements for 6.2. Let's talk about them one at a time.

Scheduler as a thread

Perhaps the biggest change with 6.2 is that the scheduler is no longer its own process. Instead, it's another thread in the qmaster. By bringing the scheduler into the qmaster, we've laid the groundwork for significant scalability improvements. Instead of having to communicate all of the necessary data over the wire between the qmaster and scheduler, the scheduler is able to simply share the qmaster's internal data structures. For now, the performance impact is very modest, but as we're able to refine the data locking, we should be able to squeeze out some significant performance gains.

Improved interactive job support

Prior to 6.2, interactive jobs required external binaries to run. By default, qrsh used rlogin/rsh and rlogind/rshd to run an interactive job. For example, the command form of qrsh would submit rshd as a job and then fork off an rsh to connect to that rshd. The actual running of the command is handled by rsh/rshd rather than Grid Engine. That has several disadvantages. First, even if Grid Engine is installed securely the rsh/rshd connection isn't secure. Second, rsh has a limit of 512 ports, meaning that a single machine cannot start more than 512 interactive jobs. Because Grid Engine handles tight integration of parallel jobs via the interactive job framework, that means rsh limits the size of parallel jobs to 512 slave tasks.

We do, however, let you configure which interactive job utilities to use. For example, you can use ssh/sshd to overcome the two problems mentioned above, but that creates new problems. First, because ssh is secure, it's slower. All communications have to encrypted and then decrypted, meaning more time is spent just processing the traffic. Also, in order for Grid Engine to keep accurate accounting logs, the sshd binary has to be patched for Grid Engine. (Grid Engine actually uses its own patched rshd by default.)

With 6.2, we offer a new option for interactive job support. By default with 6.2, interactive jobs are handled through a built-in process. Instead of submitting an rshd and forking off an rsh to connect to it, all of the communications are handled internally by Grid Engine. qrsh talks to the Grid Engine daemon on the execution node, which forwards the traffic to/from the job shell. No external binaries, no external communications. All of the above problems go away. As an added bonus, interactive jobs now get a PTY, which will make a lot of people's lives easier. The only downside to the new interactive job support is that X11 forwarding is not yet supported. (I should point out that X11 forwarding is different from xhosting. xhosting is supported.) Using the new interactive job support, 10k+ task parallel jobs should be no problem.

Streamlined communications

When you're trying to support a cluster with thousands or tens of thousands of nodes, even the most innocuous network chatter came become a big problem. With 6.2 we're done our best to reduce that chatter to a minimum. One thing that has been done is a review of the qmaster/execd communications to eliminate any unnecessary messages. Another big change is that the execution daemons now only report resource state diffs rather than reporting the entire state of all resources, even the ones that never change, every load report interval. In small clusters, you may not see the difference, but in huge clusters, the difference is noteworthy.

Other "large cluster" improvements

A variety of other scalability enhancements have been done, mostly with regards to reducing memory consumption, reducing qmaster startup time, and eliminating unnecessary overhead. Again, the effects on small clusters will be small, but large clusters will benefit tremendously.

Array Task Dependency

Since before I joined the team, Grid Engine has been able to manage job dependencies. A user can submit a job and specify that the job cannot be started until a set of other jobs have exited. This works for batch jobs, array jobs, parallel jobs, and even interactive jobs. In the case of array jobs, a job dependent on an array job must wait for all the array job's tasks to exit, and an array job that is dependent on another job cannot start any tasks until that other job has exited. If an array job depends on another array job, no task of the second array job can start until every task of the first array job has exited. For most purposes, that behavior is sufficient.

Imagine for a moment that you work for a visual effects company that uses Grid Engine to render video effects. (If you're imagination is vivid, imagine you work for an Australian visual effects company that has done work for several blockbuster films.) In your day-to-day rendering, you have two choices for how to approach the task given the way Grid Engine works (before 6.2). One option is to have an array job per rendering step, with each job task representing a frame. You could then use job dependencies to make sure that step 2 doesn't start until step 1 finishes. That works, but if one frame takes a lot longer than the others to render, all the other frames are stuck in the current step when they could have moved on to the next step. Another option would be to have a batch job for each frame. That way, as soon as a frame finishes a step, it can move on to the next step, regardless of what step the other frames are on. That's less wasteful, but it's also considerably more difficult to manage (millions of jobs instead of tens), and it makes it hard to take advantage of special resources for individual steps. Yet another option would be to do the rendering as an array job of array jobs. That solves all the technical issues, but is practically impossible to manage.

What you'd really want if you were that visual effects company is that ability to have a task in one array job depend on a task in another array job. That way, you could submit each step as an array job where each task represents a frame, and each task could depend on the corresponding task in the previous step. That feature is exactly what 6.2 provides. (Actually, the feature was implemented and contributed by that not-so-imaginary Australian visual effects company.)

With 6.2 a user can declare that an array job's tasks are dependent on the tasks of another array job. Each task of the second job will then each depend on the task of the first job with the same task number, i.e. job 2 task 1 will depend on job 1 task 1. In addition, array task dependencies support "chunking." Chunking means grouping tasks together for efficiency. For example, step one might be really light weight, making it more efficient to have each task process three frames instead of just one. The way chunking is representing in an array task dependency is by the array job's step size. By default, array job tasks are numbered in increments of 1, i.e. 1, 2, 3, 4, 5, etc. It is possible, however, to declare a step size for the task numbers other than 1. A step size of 3 would result in tasks numbered 1, 4, 7, etc. In an array task dependency, if the corresponding task number in the previous job doesn't exist because of chunking, the dependency falls to the chunked task that contains the corresponding task number. For example, tasks 1, 2, and 3 from an array job with a step size of 1 might all depend on task 1 of the previous array job with a step size of 3. It works the other way around as well. Task 1 of an array job with a step size of 3 might depend on tasks 1, 2, and 3 of the previous array job with a step size of 1. It even works for uneven combinations, such as task 1 of an array job with a step size of 3 depending on tasks 1 and 3 of the previous array job with a step size of 2.

ARCo enhancements

One of the major areas of focus for 6.2 was improving the Accounting and Reporting Console (ARCo), In previous releases, the ARCo infrastructure was a little pokey, and it was not very difficult to produce a stream of accounting data fast enough to completely swamp the DBWriter component. (The DBWriter's job is to transfer data from the accounting logs into the ARCo database.) With 6.2 that has been fixed, along with a number of other performance-related issues. ARCo is now fast, and it will continue to get better. Another important change for ARCo is that you can now have more than one cluster write into the same database without conflict. ARCo will even let you run queries against the data from all the clusters. That is important, of course, because of the new multi-clustering support that was also adding in 6.2 (as described above).

Solaris Enhancements

Every release we add a few more features to take advantage of what the Solaris 10 operating environment has to offer. In 6.1 we added a DTrace script and declared support for Solaris Zones and ZFS. With 6.2 we're adding support for Service Tags and the Service Management Framework.

Service Tags are a way for you as an administrator to keep track of everything in your network. When a machine has service tags enabled, it responds to broadcast requests for information from the service tag client. When you install 6.2, you have the option of allowing Grid Engine to register a service tag on the master machine to indicate that you have Grid Engine running in your network. You can then see that information from the service tags client. You can also upload that information to Sun's service management repository, and we'll keep track of it for you.

The Service Management Framework (SMF) is a replacement for the traditional UNIX init scripts. Instead of startup and shutdown scripts, services get an entry in a services database that lists how to start and stop the service, among other things. When 6.2 is installed on a Solaris host that supports SMF, if you choose to have Grid Engine start when the machine boots, the installer will create an SMF entry instead of an init script. If you need to make changes to the way Grid Engine is started, you can edit the $SGE_ROOT/$SGE_CELL/common/sgemaster file just like you would have with the old init scripts. Perhaps the most useful part of SMF is that you get an automatic watchdog for your services. If one of your Grid Engine daemons dies or is killed (not using qconf or the sgemaster and sgeexecd scripts), the watchdog process will restart the service automatically.

Are you totally stoked now, or what?

Pretty impressive feature list, eh? And that list didn't even include the myriad major and minor bug fixes that are delivered with 6.2. If you can't wait to try it out, you have two options. First, the beta2 courtesy binaries are still available on the open source site. Second, you can grab the V62_TAG tag from the CVS repository and build it yourself. Have fun, and let us know how it turns out!