Oracle Blog

It's hip to be square!

Tuesday Jan 18, 2011

Last week I had the opportunity to do a webcast with Moe Fardoost, our marketing director, on the future direction for the Oracle Grid Engine product. If you're curious about where Grid Engine is headed, take a look. For the very lazy among you, the summary is that we're focused on three major themes: core infrastructure and feature improvements, tighter integrations with other Oracle products, and a richer cloud feature set.

Thursday Dec 23, 2010

For the past decade, Oracle Grid Engine has been helping thousands of customers marshal the enterprise technical computing processes at the heart of bringing their products to market. Many customers have achieved outstanding results with it via higher data center utilization and improved performance. The latest release of the product provides best-in-class capabilities for resource management including: Hadoop integration, topology-aware scheduling, and on-demand connectivity to the cloud.

Oracle Grid Engine has a rich history, from helping BMW Oracle Racing prepare for the America’s Cup to helping isolate and identify the genes associated with obesity; from analyzing and predicting the world's financial markets to producing the digital effects for the popular Harry Potter series of films. Since 2001, the Grid Engine open source project has made Oracle Grid Engine functionality available for free to open source users. The Grid Engine open source community has grown from a handful of users in 2001 into the strong, self-sustaining community that it is now.

Today, we are entering a new chapter in Oracle Grid Engine’s life. Oracle has been working with key members of the open source community to pass on the torch for maintaining the open source code base to the Open Grid Scheduler project hosted on SourceForge. This transition will allow the Oracle Grid Engine engineering team to focus their efforts more directly on enhancing the product. In a matter of days, we will take definitive steps in order to roll out this transition. To ensure on-going communication with the open source community, we will provide the following services:

The Oracle Grid Engine engineering team will be available to answer questions and provide guidance regarding the open source project and Oracle Grid Engine via the online product forum

The Open Grid Scheduler project will be continuing on the tradition of the Grid Engine open source project. While the Open Grid Scheduler project will remain independent of the Oracle Grid Engine product, the project will have the support of the Oracle team, including making available artifacts from the original Grid Engine open source project.

Oracle is committed to enhancing Oracle Grid Engine as a commercial product and has an exciting road map planned. In addition to developing new features and functionality to continue to improve the customer experience, we also plan to release game-changing integrations with several other Oracle products, including Oracle Enterprise Manager and Oracle Coherence. Also, as Oracle's cloud strategy unfolds, we expect that the Oracle Grid Engine product's role in the overall strategy will continue to grow. To discuss our general plans for the product, we would like to invite you to join us for a live webcast on Oracle Grid Engine’s new road map. Click here to register.

Tuesday Nov 30, 2010

I am very pleased to announce that we've signed up our first partner to the Oracle Validated Integration program for Oracle Grid Engine. Jaryba's SmartSuspend product is a clever way to allow jobs suspended by Grid Engine to release all of the resources they're holding, even memory and FLEXlm licenses. And it works without requiring any changes to the applications. You don't even have to recompile.

If you've ever run into the issue of running out of swap space because of preempted jobs holding onto their memory, SmartSuspend might be the answer you're looking for. It works by inserting itself between the application and the OS so that it can track the memory and license usage. When a job is suspended, SmartSuspend first uses its knowledge of the resources requested by the application to let all of those resources go. When the job is resumed, SmartSuspend first attempts to recapture those resources before allowing the application to run. From the application's perspective, nothing changes. From the administrator's perspective, the difference is huge.

Friday Nov 12, 2010

After much internal... discussion, Oracle has decided to have a booth at SC10 after all, and as usual, I will be there waving the Grid Engine banner. If you're at the show, please come by and say high. I believe they've scheduled some office hours of sorts for me on Tuesday afternoon, but I should be hanging around the Oracle booth for most of the show. (Except Thursday, so don't wait until the last minute!) I think I'll also be making an appearance at the Univa UD booth on Tuesday morning at 11:00.

I also want to mention the RCE Podcast Brock Palen and Jeff Squyers were kind enough to invite me to record. If you're interested in an intro to OGE or a high level status check, go have a listen.

I guess since I have your attention, I should also point out that the presentation I did at Oracle OpenWorld '10 about using Grid Engine for large-scale data-oriented computing (e.g. Hadoop) with Tom White from Cloudera is now available on the Grid Engine OTN page.

Wednesday Oct 06, 2010

I've said it before: being adopted into the Oracle family has been a great thing for the Oracle Grid Engine product. One of the many reasons is that we get to take advantage of the amazing partner program that Oracle has, the Oracle Partner Network.

Over the years, a number of companies have built products that include, build on, or use either the Grid Engine product or the Grid Engine open source project. While we were Sun, there really was little that we could offer these companies in terms of useful partnership opportunities. Now that we're Oracle, there are actually several very active, very interesting programs available for partners. If your company is working with Grid Engine, and you'd like to investigate a closer relationship with Oracle, there's never been a better time!

Here's just a quick overview of some of the programs Oracle has to offer:

Oracle Validated Integration -- I love this program. It's a way to has Oracle certify and swear to the fact that your product is validated on Grid Engine and that the combination works as designed. It gives your customers an extra boost of confidence in your product, and it gets your product listed on the OVI partner solutions page. (Note that the program information says it's only for a limited set of Oracle products. Since Grid Engine is now under the Oracle Enterprise Manager product family, we do indeed qualify.)

Application-Specific Full Use & Embedded licensing -- We now have the ability to negotiate OEM contracts to include or embed Grid Engine in your product. It was possible before, but now it's actually a normal thing to do. There's even a standard program and process for it, including some very nice discounts. You can find out more about the program on page 54 of the Software Investment Guide.

Oracle Partner Network -- The OPN is your one-stop shop for hitching your wagon to the Oracle engine. With multiple levels and a huge number of benefits, the OPN is a great way to develop a closer relationship with Oracle.

OPN Specialization for Cloud computing and SaaS -- OPN has this concept of partner specializations. It's a way for you to distinguish yourself by demonstrating your deeper knowledge in specific areas. There's now a specialization for the cloud and SaaS.

If any of these programs sound interesting, you know where to find me. You can also send a Tweet or DM to my partner partner, Susan Wu, susanwu88 on Twitter.

I was really surprised at the turn-out for JavaOne this year. Judging by the packed halls and empty goodie carts, I think the conference organizers were a little surprised as well. Excellent! Well done.

As you may have noticed, I always seem to have my fingers in the JavaOne hands-on labs pie. This year my contribution was to bring Cloudera into the fold to run a Hadoop lab. Needless to say, that generated a lot of interest. Well before the conference, the slot we had for the lab was booked solid. Taking that as a sign, I had the conference organizers give us a second slot on Monday for the lab. That slot was also booked solid before the conference even began. Unfortunately, however, that Monday lab slot ended up getting canceled for <INSERT OFFICIAL REASON HERE>. As a concession to the folks who didn't get to attend because of the cancellation, I got the conference organizers to give me permission to have Cloudera host the lab materials from their site before it's available from the official Oracle JavaOne site.

You can find the semi-official JavaOne Hands-on Lab S314413: Extracting Real Value from Your Data With Apache Hadoophere under the training section of the Cloudera site. The file is not yet linked from anywhere but here, but they're working on it.

If you download the zip file, in it you will find a lab workbook. At the back of the workbook, you will find an appendix that describes how to set up your own lab environment. The lab was written for Solaris 11 Express and NetBeans, but the OS and IDE really play little role in the lab. If you refuse to see the light and accept Solaris as the one true OS, you can still do the lab on some other OS with some other IDE (but it won't be as satisfying).

The lab did run in its originally assigned slot at JavaOne, and it was really successful. Turnout was good and the comments were great! I've already incorporated lots of great feedback from that session into the lab materials that Cloudera is now hosting, but I'm always happy to hear any additional comments and/or feedback. Happy coding!

Tuesday Sep 21, 2010

Just wanted to point out this interview that came out yesterday. The summary is: really, honestly, really, Grid Engine is alive and well and has a bright future in front of it. The rumors of Grid Engine's death have been greatly exaggerated.

Wednesday Sep 15, 2010

In case any of you will be visiting Oracle Open World next week, be sure to come check out my sessions. I have two OpenWorld sessions and one JavaOne hands-on lab. (The lab isn't actually directly related to Grid Engine, but there's a tie-in via our Hadoop support.)

First, I have to apologize for my long lack of blog updates. Now that I've taken over the Grid Engine product management role, I've been up to my elbows non-stop. Maybe this post will get me back into the habit of blogging regularly. I still have one more post to write about what's new in 6.2u5.

Second, the Grid Engine team is still here, as is the Oracle Grid Engine product. In fact, in my almost a decade working on this product, we've never been in a better position. One thing Oracle does very well is to be clear about their intentions. Either your product has a road map, or it doesn't. We have a road map. We have a rather exciting road map, in fact, and I'm looking forward to using our new home in Oracle as a launching pad for the next generation of the Grid Engine technology.

Lastly, just to add a little credence to the above statement, let me share a little about where we have landed in Oracle. The Oracle Grid Engine team now sits in the Oracle Enterprise Manager organization, directly under the Ops Center team. Enterprise Manager is Oracle's product for managing the data center from top to bottom, the entire software stack, down through the OS, all the way to the hardware and storage. Software. Hardware. Complete. Interestingly, the Enterprise Manager group would seem to be one of the key components in Oracle cloud strategy. Hmmm... Cloud... Grid Engine... One could imagine there being some kind of fit there. Odd that we should land in the same group...

The technology that Grid Engine brings to the Oracle product family is unique. Not only do we not compete with any other existing Oracle product, there are several other Oracle products with which Grid Engine has a very natural synergy. I have very high hopes for the role Grid Engine will play at Oracle going forward. Without getting into any details, look for good things coming from our direction in the future.

Oracle policy prevents me from saying anything concrete or specific about our plans or positioning or anything else, really, but I hope I've been able to give you 1) confidence that we're alive and doing quite well, thank you, and 2) we have a long and exciting road ahead of us.

Thursday Feb 11, 2010

Let's take a break from the Sun Grid Engine 6.2u5 feature posts and talk about something that's been in the product since 6.2. (It's actually the foundation of two of the remaining three features, so consider this ground work for finishing my u5 features series.)

Service Domain Manager (or the open source Project Hedeby (formerly Project Haithabu)) is an add-on component for Sun Grid Engine that enables multiple clusters to share resources. It was designed to allow for services of all types to share resources with each other. The basic idea is this: each cluster has a set of performance metrics specified via service level objectives (SLOs). If at any point a cluster is in violation of its SLOs, it appeals to the SDM resource provider service for additional resources. The resource provider will look for resources wherever they're available: in spare resource pools, from cloud service providers, or from other less-loaded clusters. If resources are available, the resource provider will (re)assign the resources to the cluster in need. From the users' perspective, nothing really changes, except that the overloaded cluster is now feeling better. Let's get into a little more detail.

A Little More Detail

The resource provider is the heart and brain of SDM. It's job is to keep track of services and resources and adjust resource assignments as needed. At the level of the resource provider, everything is very abstract. It doesn't know (or care) what any of its managed services do, as long as they implement the required interface. It also doesn't care about the details of the resources its managing, beyond the fact that there are details, and that the services it's managing may care about those details.

One other abstract concept that the resource provider understands is a need. When a service managed by the resource provider needs more resources, it tells the resource provider about its need. That need is expressed as a description of the desired resources to satisfy the need (including quantity), and how important the need is. For example, a managed service might say to the resource provider, "Hey! I want two OpenSolaris x86 resources with at least 4GB memory each. This need is critical to me continuing to service my users!" To satisfy this request, the resource provider will look around at the other services it's managing to see who could potentially give up the requested resources. Among the other services there might be spare pools (basically just holding tanks for idle resources), cloud service providers (e.g. Amazon EC2), or other services. If the requested resources are free, they will be reassigned to the requesting service. With a spare pool, the decision is easy: any resources in the spare pool are fair game. Same for the cloud. With other services, though, it's not so simple. In general, if a service is still holding a resource, that's because it's still using it to some degree. How do we know when it's OK to take a resource away from a service? Well, the resource provider has a set of policies that govern the relative importance of the services. Using those policies, the resource provider will decide if the importance of the requesting service plus the criticality of its need outweighs the importance of the potential donor service and how much it's using the resources in question. If, in the end, there are no resources that can reasonably be reassigned to the needy cluster, then the request stays pending and will be reevaluated again later.

On the service side of things there is a service adapter. The job of the service adapter is to be the shim between the service itself and the resource provider. It implements that abstracted service interface that the resource provider expects and translates those abstract concepts we just talked about into concrete artifacts understood by the service. In particular, it's up to the service adapter to define and implement the SLOs for the service. Why? Well, consider this use case. Imagine you have a cluster of application servers and a Sun Grid Engine cluster, and you want to share resources between them. The service level criteria will be very different between them, and it wouldn't make any sense to expect the service provider to understand them all. Instead, it's more flexible and more scalable to allow the service adapters to manage the SLOs and only report the results (e.g. needs) to the resource provider.

Let's use the Sun Grid Engine adapter to illustrate how a service adapter works. Starting with 6.2, the Sun Grid Engine qmaster includes a JMX interface known as JGDI. (While JGDI is openly accessible, we don't really advertise it because it's not really abstract enough for public consumption.) The Sun Grid Engine service adapter uses the JGDI interface to monitor the state of the qmaster. The service adapter implements one unique policy: maximum number of pending jobs. (It actually inherits a couple other policies from the service adapter SDK that are universally applicable, such as the minimum number of resources that should be assigned.) When the state of the cluster changes, the qmaster sends an event to the service adapter. The service adapter then checks the new cluster state against the SLOs that have been configured to see if any SLO has been violated. If an SLO has been violated, the SLO configuration specifies what kind of resource is needed to address the issue. For example, suppose there's an SLO that states that there should never be more than 100 pending Solaris x86 jobs. If the service adapter finds out that the 101st Solaris job is pending, it will appeal to resource provider and request an additional Solaris x86 resource.

When the resource provider assigns a resource to the service, the service adapter is responsible for prepping the resource and adding it into the service. Now, here's the interesting part. After the new resource takes on its share of the workload and the service is happy again, we don't take the resource away. The resource stays with the service until someone else needs it more. Resources are shared, not leased. It is possible to configure SDM to behave in a fashion that is in effect leasing, but it's something you have to explicitly set up.

On the other side of the coin, when the resource provider is asked for a resource, it talks to the service adapters for the managed services to find out who has something that can be borrowed. The resource provider keeps a map of where all the resources are assigned, so it can immediately tell which services are currently holding resources that are candidates for reassignment. It then contacts those services' service adapters to find out whether the resources are in use. The service adapter's job is to look at the service and place a numerical value of how well the resources are being used by the service. Once the resource provider has collected the usage values for all the candidate resources, it applies policies (such as relative importance of the services) and picks the resources that seem most available. This process applies equally to services, spare pools\*, and cloud service providers. (\* There is a built-in spare pool in the resource provider that doesn't actually have its own service adapter, but it works as though it did.)

With the 6.2u5 release, we have two service adapter implementations. One is for the Sun Grid Engine software itself. The other is a generic cloud adapter that comes with integration scripts for use with Amazon EC2 and for use with IPMI power management. Out of the box, you can use SDM to manage Sun Grid Engine clusters and to resource those clusters on demand from EC2. You can also configure a spare pool\* that powers down idle or underutilized machines. (\* It's not technically a spare pool, but it behaves like one.) The intention is to add additional service adapter implementations as we uncover the concrete demand for them. In addition, the original plan was to make the service adapter API clean, public, and well-documented. So far, it's fairly clean, fairly well documented, but only public in so far as the Hedeby Project is open source. If you have interest in seeing or (even better) developing a service adapter for a particular service, please do let us know, and we'll see what we can do to help.

Hopefully this overview gives you a pretty good idea of what SDM does and at least an inkling of how it does it. If not, let me know!

Wednesday Feb 03, 2010

Good day, and welcome to week four of my continuing attempt to cover all the features added in the latest release (6.2u5) of Sun Grid Engine. This week we'll talk about array task throttling.

Sun Grid Engine supports four classes of jobs. Interactive jobs are the equivalent of doing an rsh/rlogin/ssh to a node in the cluster, except that the connection is managed by Sun Grid Engine. Batch jobs are your traditional "go run this somewhere" type of job. They represent a single instance of an executable. Parallel jobs consist of multiple processes working in collaboration. All of the processes need to be scheduled and running at the same time in order for the job to run. Parametric or array jobs are like what you see in Apache Hadoop, where multiple copies of the same executable are run across multiple nodes against different parts of the data set. The important characteristic that distinguishes array jobs from parallel jobs is that the tasks of an array job are completely independent from each other and hence do not need to all be running together.

The way that Sun Grid Engine processes array jobs is particularly efficient. In fact, a common trick to improve cluster throughput is to bundle many batch jobs together to be submitted as a single array job. Because array jobs are so efficient, users use lots of them, sometimes with huge task counts. There is no explicit limit on the number of tasks that an array job can contain. Hundreds of thousands of tasks in a single array job are not uncommon.

There is a problem, however. From the Sun Grid Engine scheduler's perspective, all of the tasks of an array job are equal. That means that if the highest priority job waiting to execute is an array job, then all of that job's tasks are higher priority than any other job (or task) waiting to run. If that job has a million tasks, then the cluster is going to have to process all million of those tasks before anything else will be executed. Now, the policies do come into play here, and if a higher priority job is submitted or if the array job loses priority through some policy (like the fair share policy), then it and its remaining tasks will fall back in the execution order. Nonetheless, this approach makes it possible for a user to unintentionally execute a denial of service attach on the cluster.

For quite some time there has been an option that an administrator can configure to set a limit on the maximum number of tasks that can be simultaneously executed from a single array job (max_aj_instances in sge_conf(5)). That solves the problem, but only in a very general and somewhat suboptimal way. As with any such global setting, the administrator has to make a trade-off between having a limit that works well for the majority and having a limit that doesn't unduly restrict certain users. (The default is 2000 tasks per array job.) Well, it turns out that given the opportunity, most users will willing set such a limit themselves, both to avoid being bonked on the head by the administrator for abusing the cluster, and for reasons of self interest, such as by allowing multiple of their array jobs to share cluster time rather than being processed sequentially. So, with 6.2u5, we've given users exactly that ability.

Let's look at an example:

% qsub -t 1-100000 myjob.sh

will submit an array job that will run the myjob.sh script one hundred thousand times. Each time it runs, an environment variable ($SGE_TASK_ID) will be set to tell that instance which task number it is. The myjob.sh script must be able to translate that task ID into a pointer to its portion of the data set. In a cluster with default settings, up to 2000 of the tasks of this job will be allowed to be running at a time. If the cluster only has 2000 slots, that could be a bad thing.

% qsub -t 1-100000 -tc 20 myjob.sh

submits the same job, except that it places a limit of 20 on the number of tasks allowed to be running simultaneously. In our fictitious 2000-slot cluster, that's a quite neighborly thing to do. If you try to set the limit above the global limit set by the administrator, the global limit prevails.

While this feature is pretty simple, it can mean a large difference in job throughput for some clusters. I know one customer in particular that went way out of their way to implement this feature themselves using clever configuration tricks. The massive headache of hacking together a solution was worth it to them to be able to set per-job task limits.

Thursday Jan 28, 2010

Continuing with the new feature theme, this week we're talking about slotwise subordination (AKA slotwise preemption). Preemption is the notion that a higher priority job can bump a lower priority job out of the way so it can execute. Pretty simple notion. Some workload managers have an implicit concept of preemption. Sun Grid Engine does not. We have what I like to call "after-market preemption". The net result is the same. In a workload manager with "built-in" preemption, like Platform LSF, it works by temporarily relaxing the slot count limit on a node and then resolving the oversubscription by bumping the lowest job on the totem pole to get the number of jobs back under the slot count limit. In Sun Grid Engine, the same thing happens, except that instead of the scheduler temporarily relaxing the slot count limits, you as the administrator configure the host with more slots than you need and a set of rules that create an artificial lower limit on the job count that is enforced by bumping the lowest priority jobs. It nets out to the same thing. With Sun Grid Engine you have a little more control over the process, but you pay for it with some added complexity.

That set of rules that defines the artificial limit is called subordination. By subordinating one queue to another, you tell the master that jobs running in the subordinated queue are lower priority and should be preempted when necessary. Specifically, all jobs in the subordinated queue are suspended when a threshold number of jobs (usually 1) are scheduled into the queue to which it is subordinated.

Queue subordination in Sun Grid Engine was implemented long ago, when single-socket, single-core machines still roamed the Earth. Back in those days, there was generally only one job running per host, so the queuewise subordination scheme worked out just fine. Now that we're in the era of multi-core machines, suspending the entire subordinate queue tends to be a bad idea. Enter slotwise preemption. In a nutshell, slotwise preemption lets you set a specific limit on the number of jobs allowed to be running on a host, regardless of how many queues and slots there are. If too many jobs land on the host, jobs in the lowest ranking queue(s) will be suspended until the number of running jobs is under the limit.

(Note that slotwise subordination deals only with the running job count. If you want to limit the active job count (running + suspended), you can do that by making the slots complex a host-level resource and setting it to the desired limit.)

This configuration says that there are four slots available on each host (2 in each queue), but that only 2 jobs may be running on any host at any given time. If more than 2 jobs end up on a node, it will result in the excess jobs being suspended. Because B.q is subordinated to A.q, the excess jobs will always come from B.q.

Let's talk about the difference between queue-wise and slot-wise suspension for this example. With queue-wise suspension, you'd have two choices: either a single job in A.q would suspend all jobs in B.q, or two jobs in A.q would suspend all jobs in B.q. The choice is either undersubscribing (with one running job in A.q and two suspended jobs in B.q) or oversubscribing (with one running job in A.q and two running jobs in B.q). With slot-wise suspension, a job running in A.q will only suspend a job running in B.q if there are more than two running jobs on the host. We will therefore never oversubscribe, and we'll never undersubscribe as long as there's a job available to run.

We've added a third queue, and we now have a very simple tree. Both B.q and C.q are subordinated to A.q, but there are still only 2 slots available for running jobs. If a host is scheduled with more than two running jobs, jobs will be suspended until we get down to two, just like before. What's different is that there's now a pecking order for the subordinated queues. Because B.q has a lower sequence number (1) than C.q (2), it is higher priority. That means we'll suspend jobs from C.q first, before suspending jobs from B.q. What's also different is how we pick the job to suspend. In B.q in both examples, the action is listed as "sr", which means to suspend the shortest running job. In C.q in this example, the action is "lr", which means to suspend the longest running job.

Now we have a tree with more than a two levels: C.q is subordinated to B.q is subordinated to A.q. Between B.q and C.q up to two jobs are allowed to be running, with B.q's jobs taking priority. Among A.q, B.q, and C.q, up to three jobs are allowed to be running, with A.q's jobs taking priority over B.q's jobs, and B.q's jobs taking priority over C.q's jobs. Now look carefully. Where did I specify that C.q should be subordinated to A.q? I didn't. It's implicit. Whenever you have a multi-level subordination tree, a node has its entire subtree subordinated to it, whether it's explicitly specified or not, with priority handled between nodes according to depth in the tree and priority with levels handled according to sequence numbers. Because of this implicit subordination, it does not make sense to ever have a higher slot limit lower down in the tree. The higher-level lower slot limit will always take precedence.

Hopefully slotwise subordination now makes sense, and you can see why it's important. Basically it brings Sun Grid Engine's preemption capabilities up to date with modern hardware, making it more efficient and more useful.

There is, however, one notable caveat I have to point out. With queue-wise suspension, when a subordinated queue has its jobs suspended, the queue itself is also suspended, preventing any other jobs from landing in that queue. That's not the case with slotwise subordination. It's possible for the scheduler to place a job into a subordinated queue where that job will immediately be suspended. Imagine in our first example above that A.q has two running jobs in it while B.q is empty. B.q remains a valid scheduling target, and any job that lands there will immediately be suspended because it violates the slotwise limit. The workaround is to use job load adjustments to make sure that hosts with running jobs are appropriately unattractive scheduling targets. Not a show-stopper, but definitely important to be aware of. We will address the issue in the next couple of releases.

Wednesday Jan 20, 2010

Continuing in my feature deep dives, let's talk about topology-aware scheduling. Some applications have serious resource needs. Not only do they need raw CPU cores, but they also beat the snot out of the local cache or burn up the I/O channels. These sorts of applications don't play well with others. In fact, they often don't play well with themselves. For these applications, how the threads/processes are distributed across the CPUs makes a huge difference. If, for example, all the threads/processes have their own core but are all sharing a socket, they might end up fighting over cache space or I/O bandwidth. Depending on the CPU architecture, the conflicts may be more subtle, such as only the processes on specific groups of cores colliding. The price for making a bad choice of how to assign these applications to cores is poor performance, in some cases doubling the time to completion.

It's not just the powerhouse apps that care about CPU topology, though. Most operating systems will schedule processes and threads to execute on available cores rather willy-nilly, with no sense of core affinity. Because an average OS does context switches at a rather high frequency, an application may find itself executing on a different CPU and core every time it gets the chance to run. If that application makes any use of the CPU cache, for example, its performance will suffer for it. The performance might not suffer much, but the difference is usually measurable.

For these reasons, we've added topology-aware scheduling to Sun Grid Engine 6.2 update 5. With topology-aware scheduling, the user who submits the job can specify how that job should be laid out across a machine's CPUs. Users are allowed to specify three different flavors of distribution strategy: linear, striding, or explicit. In linear distribution, the execution daemon will place the job's threads/processes on consecutive cores if possible. If it can't fit the entire job on a single socket, it will span the job across sockets. The striding strategy tells the execution daemon to place the job on every nth core, e.g. every 4th core or every other core. The explicit strategy lets the user decide exactly which cores will be assigned to the job. Note that the core binding is a request, not a requirement. If for some reason the execution daemon can't fulfill the request, the job will still be executed; it just won't be bound.

In addition to the three binding strategies, there are also three possible binding mechanisms. You can either allow Sun Grid Engine to do the binding automatically as part of the job execution, or you can have Sun Grid Engine add the binding parameters to the machines file for OpenMPI jobs, or you can have Sun Grid Engine just describe the intended binding in an environment variable with the expectation that the job will bind itself based on that information. When the job is bound by Sun Grid Engine during execution, the job will be tied to specific CPU cores using an OS-specific system call. On Linux, the bound processors may be shared with other processes. On Solaris, the bound processors are used exclusively for the job. In either case, the job will only be allowed to execute on the bound processors.

In order to allow users to tell what kinds of topologies are provided by the machines in the cluster, some new default complexes have been added that describe the socket/core/thread layouts of the machines. These new complexes can be used during job submission to request specific topologies, or they can be used with qhost to report what's available.

This example will look for a machine with 8 cores and 2 sockets (i.e. dual-socket, quad-core) and try to bind to four consecutive cores. The execution daemon will try to put all four cores on the same socket, but if that's not possible, it will spread the job out over as many sockets as required (but as few as possible).

This example will again look for a dual-socket, quad-core machine, but this time the job will occupy the third core on both sockets. (The first core is number 0.) If the third core on either socket is occupied, the job will not be bound.

This last example will yet again look for a dual-socket, quad-core machine. This time the job will be bound to the first and fourth cores on both sockets. Again, if any of those core are already bound to another job, the job will not be bound.

It's clear that jobs that benefit from specific process placement with respect to CPU cores will perform much better in a 6.2u5 cluster, thanks to this new feature. Even for regular old run-of-the-mill jobs, though, submitting with -binding linear:1 should provide a small performance bump because it will keep them from being jostled around between context switches. In fact, I won't be surprised if 12 months from now I include adding that switch to the sge_request file in my top 10 list of best practices.

Friday Jan 15, 2010

I've always had a tendency to say "tomorrow," when I really mean "next working day." Most of the time there's no difference, but Fridays and the day before a holiday, people look at me funny. I find it silly that I have to figure out the next day I'll be at work so I can reference it by name. Instead, I'm now coining a new word:

to⋅wor⋅row [tuh-wawr-oh, -wor-oh]–noun

the working day following today: Toworrow I have a big meeting.

Your homework is to use "toworrow" in a sentence at least three times this weekend. We'll compare results toworrow.

I'm going to assume that you already understand the many virtues of Hadoop. (And if you don't, Cloudera will be happy to tell you all about it.) Instead, to set the stage, let's talk about what Hadoop doesn't do so well. I currently see two important deficiencies in Hadoop: it doesn't play well with others, and it has no real accounting framework. Pretty much every customer I've seen running Hadoop does it on a dedicated cluster. Why? Because the tasktrackers assume they own the machines on which they run. If there's anything on the cluster other than Hadoop, it's in direct competition with Hadoop. That wouldn't be such a big deal if Hadoop clusters didn't tend to be so huge. Folks are dedicating hundreds, thousands, or even tens of thousands of machines to their Hadoop applications. That's a lot of hardware to be walled off for a single purpose. Are those machines really being used? You may not be able to tell. You can monitor state in the moment, and you can grep through log files to find out about past usage (Gah!), but there's no historical accounting capability there.

Coincidentally, these two issues are things that most workload managers (like Sun Grid Engine) do really well. And I'm not the first to notice that. The Hadoop on Demand project, which is included in the Hadoop distribution, was an attempt to integrate Hadoop first with Condor and then with Torque, probably for those same reasons. It's easy enough to have the Hadoop framework started on demand by a workload manager. The problem is that most workload managers know nothing about HDFS data block locality. When a typical workload manager assigns a set of nodes to a Hadoop application, it's picking the nodes it thinks are best, generally the ones with the least load, not the ones with the data. The result is that most of your data is going to have to be shipped to the machines where the tasks are executing. Since the great innovation of Map/Reduce is that we move the execution to the data instead of vice versa, bringing a workload manager into the picture shoots Hadoop in the foot.

Enter Sun Grid Engine 6.2 update 5. One of the main strengths of the Sun Grid Engine software is its ability to model just about anything as a resource (called a "complex" in SGE terms) and then use those resources to make scheduling decisions. Using that capability, we've modeled HDFS rack location and the locally present HDFS data blocks as resources. We then taught SGE how to translate an HDFS path into a set of racks and blocks. Finally, we taught SGE how to start up a set of jobtrackers and tasktrackers, and voila! Ze Hadoop integration iz born. More detail? Glad you asked.

The Gory Details

The Hadoop integration consists of two halves. The first half is a parallel environment that can start up a jobtracker and tasktrackers on the nodes assigned by the scheduler. (HDFS is assumed to already be running on all the nodes in the cluster.) In the end, it's really no different than an MPI integration (especially MPICH2). The second half is called Herd. It's the part that talks to HDFS about blocks and racks, and it has two parts. The first part is the load sensor that reports on the block and rack data for each execution host (hosts that are running the SGE execution daemon). The second part is a Job Submission Verifier that translates requests for HDSF data paths into requests for racks and blocks.

The process for running a Hadoop job as an SGE job looks something like this:

At regular intervals, all of the execution hosts report their load. This includes the rack and block information collected by the Herd load sensors.

The jsv.sh script starts the Herd JSV that talks to the HDFS namenode. It removes the hard hdfs_input request and replaces it with a soft request for hdfs_primary_rack, hdfs_secondary_rack, and a set of hdfs_blk<n><n> resources. (A hard request is one that is required for a job to run. A soft request is one that is desired but not required.) The primary rack resource lists the racks where most of the data live. The secondary rack resource lists all of the racks where any of the data lives. Because the HDFS data block id space is so large, we aggregate blocks into 256 chunks by their first two hex digits.

The scheduler will do its best to satisfy the job's soft resource requests when assigning it nodes, 128 in this example. It's probably not going to be able to assign the perfect set of nodes for the desired data, but it should get pretty close.

After the scheduler assigns hosts, the qmaster will send the job (everything between the "echo" and the "|") to one of the assigned execution hosts.

Before the job is started on that host, the Hadoop parallel environment kicks in. It starts a tasktracker remotely on every node assigned to the job (all 128 in this example) and a single jobtracker locally. An important point here is that the tasktrackers are all started through SGE rather than through ssh. Because SGE starts the tasktrackers, it is able to track their resource usage and clean up after them later.

After the Hadoop PE has done its thing, the job itself will run. Notice in the example that I told it to look in $TMP/conf for its configuration. The Hadoop PE sets up a conf directory for the job that points to the jobtracker it set up. That conf directory gets put in the job's temp directory, which is exported into the job's environment as $TMP.

After the job completes, the Hadoop parallel environment takes down the jobtracker and tasktrackers.

Information about the job, how it ran, and how it completed is logged by SGE into the accounting files.

Since everyone loves pretty pictures, here's the diagram of the process:

The Hadoop integration will attempt to start a jobtracker (and corresponding tasktrackers) per job. For most uses, that should be perfectly fine. If, however, you wan to use the HoD allocate/deallocate model, you can do that, too. Instead of giving SGE a Hadoop job to run, give it something that blocks (like "sleep 10000000"). When the job is started, the address of its jobtracker is added to its job context. Just query the job context, grab the address, and build your own conf directory to talk to the jobtracker. You can then submit multiple Hadoop jobs within the same SGE job.

Hopefully this gives you a clear picture of how the Hadoop integration works. You can find more information in the docs. I think it's a testament to the flexibility of Sun Grid Engine that the integration did not require and changes to the product. All I did was add in some components through the hooks that SGE already provides. One more thing I should also point out. This integration is in the Sun Grid Engine product, but not in the Grid Engine courtesy binaries that we just announced.