Oracle Blog

news from within darkstar

writing services

Lately there's been a lot of discussion on the forums about how to write Services for Project Darkstar. Specifically, there seems to be some confusion about some pretty fundamental issues around transactions and how to actually participate in this model. Part of this confusion is undoubtedly due to the lack of tutorials (although there are lots of good examples and javadocs available to get folks started). So, I thought I'd spend a little time laying out some basics of Services, and how you go about writing them. Note that this example is written against the 0.9.6 APIs, and while I haven't actually tested/compiled all the code snippets, I'm pretty sure they work (he says hopefully).

Before I begin, a warning: you really shouldn't be writing Services. Or, rather, you usually shouldn't be writing them. The whole point of the Darkstar project is to make it easy to write server-side logic for games by hiding the individual nodes in the cluster, handling all the threading, doing persistence for you, etc. When you write an application, you can ignore all the "hard stuff," but that's because there are Services in the system supporting you. In some cases you will need to get into the lower-levels of the systems, but it really should be avoided, because, well, writing this kind of code is hard. You will be exposed to multi-threaded code. You will have to deal with failure. You will have to model your own way of working between all the nodes on a cluster. You will have to understand the transaction model and how to handle things like 2-phased commits and aborts. So, my advice is that you don't treat the Service APIs as "just another API to use when writing Darkstar games." You have been warned :)

Like I said above, the reason for Services is because we need a place to do much of the hard stuff. Perhaps most importantly, we need a clear layer that sees both the transactional and the non-transactional nature of the system. Services fill this role. They are an abstraction that supports the application and ties the cluster together. They are designed to be pluggable, so that you can swap in and out different implementations. There are a handful of "standard" Services (meaning that they will always be available in the system). Beyond that, you can write any number of additional Services that you need.

An application, of course, doesn't see these Services directly. This is by design, to provide some isolation boundary (think user-land versus kernel code in an operating system). In this way, Services can define pretty complex interfaces that any other Service can take advantage of, without exposing this to the application. What the application sees is a set of Managers. These are effectively the bridge between Services and applications, and are used to provide whatever subset of the Service API makes sense, pre or post-process application inputs and outputs, etc. Some Services don't even have Managers, and instead just define an interface for other Services to use. But, maybe I'm getting a little ahead of myself here.

Here's the basic model: when an instance of the Darkstar stack starts up, some core components are created (we'll get to this in a minute). After that, all of the standard Services are loaded and initialized. Once these are in place, any custom Services are started. If you're writing a Service, it will almost always be a custom Service (as in the example below), which means that you will be able to take advantage of all the other Services in the system. First all of the Services are constructed, and then once all Services have successfully been created, they are told that the system is ready. Remember that this happens in each instance of the stack (i.e., on each node), so while an application is only initialized once, your Service will get created on each node.

Let's take a simple example. To start, all Services must implement the com.sun.sgs.service.Service interface, and must implement a constructor with a specific set of parameters:

You can look at the javadocs for more detail, but basically the properties are all properties associated with the application, the registry gives you access to core components, and the transaction proxy gives you access to individual transactions. More on all of this in a minute. You should treat this constructor as any other Java programming language constructor: it's a chance to do any initial setup you need to do. Once you return from your constructor, other Services may call you, so plan accordingly. Note that this is also your chance to decide that there's something wrong with the setup of the system, and throw a Runtime Exception. If this happens, startup will fail and the node will shutdown. This constructor is not invoked in a transaction, so you can spend as much time as you want initializing. Just remember that you can't invoke the AppContext, many other Services, etc. without setting up a transaction. Again, more on this in a minute.

In addition to a specific constructor, you need to implement a couple other methods:

The getName() method is just an identifier for your Service, typically the fully-qualified class name or something similar that will be unique and easy to use in identifying your Service. The ready() method is called on all Services in turn once all the Services have been constructed. It's basically a notification that the system is finished setting up, and your final chance to bail if there are any last-minute problems getting setup. The shutdown() method, unsurprisingly, is called when the local node is shutting down. You can take as long as you need to shut down, but if for any reason you can't finish shutting down, then you can return false.

The only thing left is to get your Service started. To do this, you use the com.sun.sgs.services property, which is a colon-separated list of Services to include on startup. Make sure your Service implementation is in your classpath, and then specify the fully qualified class name to the property, either in your application's property file or on the command-line:

com.sun.sgs.services=MyService

That's it. You've got a Service implemented, setup and running in the stack. Any other Service can resolve it, and use the functionality that it exports. Of course, our Service doesn't do much at this point. Let's work on that.

Suppose you are building some infrastructure around Darkstar, including a web interface where players can login to chat, post to forums, etc. It would be nice to allow players who are in-game know when friends are logged into the web site, since then they could chat back and forth, invite the friends to login to the game, etc. There are lots of ways to accomplish this, but in the spirit of trying to come up with an example that shows most aspects of writing a Service, let's assume that you want to do this by calling out to the web site to get the player's status. The model we'll assume here is that there's a known URL you can query that will return a boolean representing the status. Pretty simplistic, but not too unreasonable, eh? (work with me here...)

First off, your Service will need to know where to go to make this query. Since you're already getting properties as an input, this is a good place to define the server end-point:

Easy right? Now, note that none of what we've done so far has been transactional. This means that the call to isLoggedIn() can take as long as we want, and nothing will time out. Of course, this means that we can't call this method from a transaction, and it also means that we have to be pretty careful calling this method, since it could block something else that needs to run in a timely manner. So, while we've got some nice basic logic that other Services may be able to use, we don't have something we can export to the application.

This gets to the core of perhaps the hardest problem with writing code at this layer: you are working on the boundary between transactional and non-transactional code, and often have to switch between these two worlds. The key is to be careful in documenting your methods, and keeping track of what state you're in at any given time. It can twist your mind around, but once you get into the zen of how this works, it's a lot of fun (where define "fun" to mean "fun to crazy people like me who like hurting their brains on occasion"). In case you're wondering, no, this isn't specifically an artifact of how we do things in Darkstar. Pretty much any transaction-driven system has this layer, and it's always difficult to work here.

So, what's the trick to writing good code at this level? You need to work asynchronously. If you look at how the standard Services are implemented, you'll see a lot of hand-off and Future-like interfaces. The isLoggedIn() method above is synchronous: it blocks until a result is available or an error occurs. With this in mind, let's add a new method that hands-off control:

The new method is defined to return immediately, and takes a callback object that is called when a result is ready. Now we have a method that will take a small, bounded amount of time to run, and can therefore be called from within a transaction. Better still, this is something that can easily be exposed to application code, since the application can make this call and then wait to be notified with a result. The only thing left to do now is implement the query method (details, details..). Usually in Java this would be a place where you'd create a new Thread to do the work, and that would be fine here. But, there's another option that has some nice benefits. Darkstar is built on a task model, with core schedulers that schedule and run the tasks, report profiling data, etc. When you write a Service, you can use this core facility:

This interface will give you a scheduleTask method that you can use to submit a task to run. The interface you use is KernelRunnable, which has a run() method as well as a method for identifying the type of task (which is really useful when you look at the profiling output). Just implement the run() method to call the original isLoggedIn() method, and then invoke the callback when it's finished. If you look at the scheduler methods, you'll see that they take an Identity as well as a task. This is the owner of the task, or the entity who is actually doing the work (for all the gory details, check out my last blog entry).

The easiest thing to use here is the identity of the calling task, which can be fetched from the TransactionProxy provided to the constructor (yes, I know, you can get the current identity even when you're not in a transaction...it's weird, and something of an historical artifact of the system but we're unlikely to fix it by changing the name now). Putting it all together:

Sweet. We've now have a Service that does some setup when the node starts up, and provides two methods for querying the status of a user at a web site: one synchronous and the other asynchronous. The asynchronous one uses a call-back interface, so that the caller returns immediately and is later notified about the result. This hand-off is done using one of the core components of the system for scheduling tasks, so you'll get to collect profiling details about this task each time it runs. So, we're done, right?

Almost. In spite of everything we've done, we still haven't actually run any transactions. We do have a method that can be called within a transaction (although it doesn't need to be) because it returns immediately, but if we want to let application code call down into this method, we'll need a way to get "back into" a transaction to call back up to the application when the query finishes. In other words, when we're ready to the notification, we want to do that in a new transaction.

The way to do this is by using the other scheduler. Just as there's a TaskScheduler for scheduling non-transactional tasks, there's also a TransactionScheduler that has similar methods, but runs its tasks within a transactional context. This is actually all it takes in Darkstar to start a new transaction. So, when your Service starts up, get the transactional scheduler the same way you got the non-transactional one:

Here's why the doNotify() was included in the example above. To get into a transaction, rather than calling the callback object directly, now you can use one of the scheduleTask methods on the transactional scheduler just like you did to run the non-transactional task. Your run() method is now running in the context of a new transaction, meaning that it can interact with application code, access the AppContext, etc. Now you have a method that can be called from a transaction, and will provide notification of the result in a new transaction. So, we're done, right?

Well, not exactly. There's one more piece of this that needs to be taken care of before you can expose this functionality to your application. Recall that one of the nice things about programming to transactional systems is that transactions can be aborted and re-tried, but as a developer all you ever see is the final, successful run. From the failed transactions, there are no side-effects. Of course to support this the underlying infrastructure needs to support this model. In this case, that's our Service. Since the isLoggedIn() method will actually query a web server and then call back to application code, we really only want to do this operation if the calling transaction commits. This is much like the networking model for application code, where Session and Channel sends only actually happen if the calling transaction succeeds.

To support this model, our Service needs to add one extra layer of indirection. When the isLoggedIn() method is called, rather than actually scheduling the task, we want to delay until the current transaction commits. Then we want to schedule the tasks to run. This involves what's called participation in the transaction. By participating in the transaction, the Service will know when the various stages of a transaction happen, and can act accordingly. This also gives the Service a chance to abort the transaction if we get to the end an there's any trouble, for for this simple example we don't need to worry about that. Note that there are some utility classes for writing participants in the com.sun.sgs.impl.util package, but again, this example is small enough that we'll stick with the basic APIs.

You can participate as a durable or a non-durable participant. The former is something that actually stores data persistently, and needs to maintain consistency of the data (e.g., the DataService). The later is something that may use durable Services to store data, but doesn't maintain data itself. In our current system we only allow one Durable participant per transaction, so unless you're replacing the DataService implementation, you'll always be writing a non-durable participant:

Now that the Service implements the participant interface, it can, well, participate in transactions (we'll look at the implementation of this interface in a minute). To do so, it needs to get the current transaction and join it. You can join a single transaction as many times as you like, but as long as you call join() at least once, you'll start participating in the given transaction. In addition to joining the transaction we'll need to keep some state associated with each transaction, in this case the loggedIn queries that we want to make. This can be done any number of ways, so for this example we'll just use a map (again, the details of the map will get filled in a little later). Adding the code for joining a transaction and maintaining state, we end up with:

Before we continue, there are (at least) two things to note here. First, while the system is multi-threaded (thus the concurrent map), a given transaction always runs in a single thread. This means that, within the context of work done for that transaction, you know there won't be any contention. That's why it's safe to add the value to the map as above, and why no extra synchronization is needed. Second, this method now assumes that a transaction will always be active when its called. Otherwise the call togetCurrentTransaction would throw an exception. In a full version of this Service you should catch that exception, and use it to signal that there's no transaction so you can just schedule the task directly. As an aside, note that the join() call can be done as often as you like, but as an optimization (and since we're already implicitly checking to see if we've joined the transaction by seeing if we're maintaining any state for that transaction yet), each given transaction is only joined once.

Good so far? To review, the code above has setup state unique to each running transaction, and made sure to join each transaction so that our Service can act as a participant in any transaction where its doing any work (obviously if the Service's method is never called, then it will never join the transaction, so it won't add any processing overhead to other transactions). Now, what about that value in that map? Well, what we want is some kind of Set to keep track of each of the queries that we'll be making. The question is, what goes in that Set?

We want to keep track of the queries we're planning to run, without actually running them until the transaction commits. We also want (for reasons that will become clear later) to be guaranteed that when it comes time to commit, we can run those tasks. Conveniently, the scheduler interfaces provide methods with the same inputs as the scheduleTask() methods, but for requesting reservations to run tasks. This means we that we can make a non-binding reservation, and then decide later if we want to use it. Neat, huh? With this in mind, we can update the map:

Note also that in the case of failure the doNotify() method will have to be updated to do this kind of delayed logic, but since the TransactionScheduler has the same reservation mechanism, this is easy (just only do it where an Exception is caught, and not in the notification from the running query itself...I'll leave this as an exercise for the reader...heh).

Ok. So, now we've updated our Service to participate in transactions, track state associated with any transaction it's participating in, and delay actually running any queries until the calling transaction commits. The last piece left is to actually implement the participation methods, and use the reservations:

Ok, so what just happened? When a transaction commits, it actually uses a 2-phase commit protocol. First, all the participants are asked to prepare for the commit operation. This is your last chance to complain and cause the transaction to abort. Once your return from prepare, you may not fail to commit your state. Returning false from prepare() means that you still need to get called for the other stages of the transaction.

Once all of the participants are prepared, then they are all called to actually commit their state. This is the point where we know that the transaction is going to succeed, and so now we can actually use those reservations and schedule the tasks that will do the queries. Remember earlier when I said we wanted to make sure that we can run the tasks later? This is why. Once we get to the commit phase, we're not allowed to fail, so we need these reservations to make sure that we proceed. Note that in practice prepareAndCommit() will never be called on your Services, since this is an optimization used in special cases, and typically only on the DataService.

If at any time during the running of the transaction, or during the prepare phase, some fatal error occurs, then the transaction will be aborted. If your Service has joined the transaction, then it will get notified. This means that the transaction is failing, and state needs to be rolled back. For our Service, this is just a matter of canceling the reservations. This ensures that for any transactions that don't commit, we don't ever schedule any queries or notifications to the caller about errors. Note that the final method is used to identify the participant in a way that's useful when looking at profiling data or other management interfaces.

One final thing to note here is that once prepare() or abort has been called, the transaction is over. This means that you can't query for the current transaction state, or do anything that involves open transactions, including calling other Services. Our example Service hasn't actually made use of any other Service (you're likely to use the DataService at the very least), but had we done so, we couldn't invoke them at this point. Keep this in mind as you design your Services.

One other final thing to note is that I wanted a simple example of some pending operation so I used tasks. In practice, you may find it easier to use the TaskService which provides a much richer interface than the TaskManager. It is designed to handle delaying operations, persisting tasks to guarantee they run (with our Service, if the current node fails then the query operation is lost), etc.

This brings us to the final piece in all of this. One of the reasons for having Managers is to selectively decide what interfaces to expose only to low-level code, and what methods the application has access to. The final step in making our Service useful in supporting applications is writing a Manager. In our case, this Manager should be pretty simple, with just one method that can be called to make our query. Managers don't implement any specific interface, but they need a Constructor that will accept the Service instance so that they can call through. You should look at the implementation of the standard Managers for details on the full pattern we use for separating interfaces and implementations, but for the sake of simplicity, here's a fully implemented Manager for our Service:

This was definitely not an exhaustive guide to Services. I didn't go into any detail on using other Services, some of the Service-level interfaces that aren't exposed to applications, etc. I didn't talk about node-local versus cluster-wide design, and how to use the Watchdog and Node Mapping Services. I didn't talk further about the details of Identity. I didn't get into the various design considerations for caching and working with external databases or other similar services. I figure this entry is already long enough (sorry about that), and those topics can wait for the next installment.

In spite of these omissions, I hope this was a useful introduction to some of the key concepts and details involved in writing your own Service, I hope you'll ask questions, and please, if you see an error in anything I've written, let me know! Most importantly, I hope you'll experiment, and let us know what you'd like to see added, or what utilities you think would help at this level. Finally, note that in the coming weeks, as we push our codebase and development activities into the open, we're planning on publishing a bunch of utility Services, so I hope you'll look at those as examples, and either suggest new Services or contribute your own (as some folks have already started doing...thanks!). It's my real hope that most people will never have to write Services, but that's only going to happen if there are enough pieces already in place that folks have the utility they need to really focus on game development. Thanks again to the whole community for your creativity and curiosity in the space, and please let me know how to help going forward!

"I didn't talk about node-local versus cluster-wide design, and how to use the Watchdog and Node Mapping Services. I didn't talk further about the details of Identity. I didn't get into the various design considerations for caching and working with external databases or other similar services."

Personally I would love read more about Identity and working with external databases (or similar services), I hope your next post is about those topics.