21 Jan 2007

In Nebula3, there are 2 fundamentally different scenarios, where code runs in parallel. The first scenario is what I call "Fat Threads". A Fat Thread runs a complete subsystem (like rendering, audio, AI, physics, resource management) in a thread and is basically locked to a specific core.

The second type of thread is what I call a "Job". A job is a piece of data, and the code which processes the data, packed into a C++ object. Job objects are handed to a job scheduler, which tries to keep cores busy by distributing jobs to cores which currently have a low work load.

Now the challenge is of course, to design the overall system in a way to keep all cores evenly busy at all times. It's more likely that bursts of activity will alternate with phases of inactivity during a game frame. It is very likely that job objects will have to be created well in advance (e.g. one frame ahead), so that they are allowed to fill the gaps in the current frame where the various Fat Threads are idling.

This is where I expect a lot of experimentation and fine tuning.

Now the second challenge is to make a programmers life as simple as possible. A game application programmer shouldn't have to care all the time that he's running in a multi-threaded environment. He shouldn't be afraid to create dead-locks or overwrite another thread's data. He shouldn't have to mess around with critical sections, events and semaphores. Also, the overall engine architecture shouldn't be "fragile". Most traditional multi-threading code is fragile in a sense that race conditions may occur, or a forgotten critical section may disrupt data.

Multi-threading becomes tricky when data needs to be shared and when communication between threads needs to happen. These are the two critical areas where a solution to the fragile-code-problem must be found.

On the large scale, Nebula3 solves those 2 problems by a concept called "Parallel Nebulas". The idea is that each "Fat Thread", which runs a complete subsystem has its own minimal Nebula runtime, which consists just of the components required for this subsystem. So if a subsystem running in its own thread needs file access, it has its own file server which is completely separate from the file servers in the other Fat Threads. The advantage of this solution is, that most of the code in Nebula doesn't even have to be aware that it is running in a multi-threaded environment, since no data is shared at all between fat threads. Every minimal Nebula kernel is running in complete isolation from the other Nebula kernels. The disadvantage is of course, that some memory is wasted for redundant data, but we're talking a couple of kilobytes, not megabytes.

This data redundancy eliminates the need for fine grained locking, and frees the programmer from having to think about multi-threading safety at each line of code.

But of course, communication between Fat Threads must happen at some point, otherwise the whole concept would be useless. The idea here is to establish one and only one standard system of communication, and to make really sure that the communication system is bullet-proof and fast. This is where the messaging subsystem comes in. Communication with a Fat Thread is only possible by sending a message to it. A message is a simple C++ object which holds some data, along with setter and getter methods. With this standard means of communication, only the actual messaging subsystem code has to be thread-safe (also, access to resources associated with messages, like memory buffers, must be restricted, because they represent shared data).

This solves much of the multi-threading issues in the Fat Thread scenario, but doesn't solve anything for Job objects. Nebula3 very likely needs to put restrictions in place, what a Job object may do and must not do. The most simple approach would be to restrict jobs to do simple computations on memory buffers. That way, no complex runtime must exist for jobs (no file i/o, no access to rendering, and so on). If this isn't enough, a "job runtime environment" must be defined, which would be its own minimal, isolated Nebula runtime, just as in Fat Threads. Since a job doesn't start its own thread, but is scheduled into an existing thread from a thread pool, this shouldn't be much of a problem in terms of runtime overhead.

So far, only Nebula3's IO subsystem has been implemented in a Fat Thread as a proof-of-concept, and it is working satisfyingly. For doing traditional synchronous IO work, a Nebula3 application can simply call directly into the thread-local IO subsystem. So, for simply listing the content of a directory, or deleting a file, a simple C++ method call will do. For asynchronous IO work, a well-defined set of messages exists for common I/O operations (i.e. ReadStream, WriteStream, CopyFile, DeleteFile, etc...). Doing asynchronous IO is just a few lines of code: create the message object, fill it with data, and send the message to an IOInterface singleton. If necessary, it is both possible to wait or poll for completion of the asynchronous operation.

The good thing is, that the entire IO subsystem doesn't have a single line of multi-threading-aware code, since the various IO subsystems in the different Fat Threads are totally isolated from each other (of course, synchronization must happen at SOME point for IO operations, but that's totally left to the host operating system).

20 Jan 2007

C++ only offers automatic life time management for objects created on the stack. When the C++ context is left, stack objects will be destroyed automatically:

{ // create a new object on the stack MyObject obj;

// do something with obj...

// current context is left, obj is destroyed automatically}

When creating an object on the heap, the object has to be destroyed manually, otherwise a memory leak will result:

{ // create an object on the heap MyObject* objPtr = new MyObject;

// do something with obj...

// need to manually destroy obj delete obj;}

This gets all much more complicated, when more then one "client" needs access to a C++ object, because then, ownership rules must be defined (the owner would be responsible for deleting an object, all other clients just "use" the object).

In a complex software system, this ownership management gets tricky very quickly. An elegant solution to this problem is refcounting. With refcounting, no ownership must be defined, since each "client" increments a reference count on the target object, and decrements the refcount when it no longer needs to access the object by calling a Release() method. When the refcount reaches zero (meaning, no client accesses the object any more), the object is destroyed. This fixes the multiple client scenario, but still requires the programmer to manually call the Release() method at the right time.

Smart pointer fix this second problem as well. A smart pointer is a simple templated C++ object which points to another C++ object, which manages the target refcount on creation, destruction and assignment. Other then that a smart pointer can just be used like a raw pointer, except that it fixes all the dangerous stuff associated with raw pointers.

// at the end of context, the smart pointer object is destroyed // and will release its target object}

With smart pointers, a heap object handles exactly like a stack object, no extra care is needed for releasing the object at the right time. Smart pointers also fix the cleanup problem with arrays of pointers. If you want to create a dynamic array with raw pointers to heap objects, you must take care to delete the target objects manually before destroying the array, because a raw pointer has no destructor which could be called when the array is destroyed. By creating an array of smart pointers, this problem is solved as well. When the array is released, it will call the destructors of the contained smart pointers, which in turn will release their target objects:

// create objects and append to array int i; for (i = 0; i &lt 10; i++) { Ptr&lt MyObject &gt = MyObject::Create(); objArray.Append(obj); } // when the current context is left, the array is destroyed // which destroys all its contained smart pointers, which in turn // destroy their target object... }

That's it, simple and elegant. With smart pointers you can work with heap objects just as they were stack objects!

Refcounting and smart pointers are not perfect however. They fail on cyclic dependencies (when 2 objects point to each other). There seems to be no clean and easy fix to this. Thankfully, in a well-designed software system cyclic dependencies are rarely necessary.

19 Jan 2007

The Nebula3 Foundation Layer provides common platform abstraction services below rendering and audio. It's the lowest layer in the Nebula3 layer model and can be used on its own for applications which require an easy to use low level framework but don't need to do 3d rendering.

The Foundation Layer is already pretty much finished, with a complete set of test classes and performance benchmark classes.

The Foundation Layer consists of the following subsystems:

Core: this implements the basic Nebula3 object model, with support for ref-counting, smart pointers, RTTI and creating objects by class name or a class FourCC identifier.

Memory: provides memory management and wrapper functions.

Util: a collection of utility classes, mainly different types of containers, a powerful string class, a guid wrapper class and so on...

Timing: this subsystem offers classes for measuring time and profiling.

Messaging: This is an improved version of Mangalore's messaging subsystem. Messages are a standardized way to communicate between objects in the same thread, between threads, or between applications on the same machine or across a network.

Math: A standalone vector math library. The idea here is to make math code look much like HLSL shader code, and provide the highest performance possible across different platforms and compilers.

Scripting: Nebula3's scripting subsystem is much less obscure then Nebula2's, and easier to extend. The standard scripting language is now LUA (as compared to TCL in Nebula2), reducing the memory footprint drastically. Scripting is now no longer tied into the architecture. It is easy to disable scripting all together and reduce the memory footprint even more. It's still possible to add support for other scripting languages in Nebula3.

Attr: Implements the concept of dynamic attributes from Mangalore. Attributes are compile-safe key/value pairs, thus offering compile-time validation of attribute names and data types. Attributes are the foundation for the database subsystem.

Db: The database subsystem is an improved version of Mangalore's database subsystem, offering an abstract, performance-optimized interface to a SQL database for storing all types of application data. The standard implementation uses SQLite as its database backend, the wrapper code has been carefully tuned for best database performance.

Naming: Implements a hierarchical naming service for Nebula3 objects. Naming is no longer hard-coded into the base class, instead objects can be registered with the naming service at runtime, adding the additional overhead only when needed.

Nebula3 will be divided into 3 layers, where each layer builds on top of each other:

Foundation Layer: the lowest level layer, this provides a basic platform abstraction layer below rendering and audio. The Foundation Layer can be used as a platform abstraction for any type of application, not just real-time 3d apps.

Render Layer: this is the medium layer, which adds more meat to the Foundation Layer, like 3d rendering, audio, physics, and scene management.

Application Layer: this is the highest layer and provides a complete game framework which lets the programmer concentrate on the game logic instead of caring about all the little details necessary for being a well-mannered game application.

Nebula3 will integrate Mangalore completely, the subsystems from Mangalore will be integrated into different Nebula3 layers where they fit best.

Nebula3 will be more "C++-ish" then Nebula2, and doesn't try so much using C++ for stuff it wasn't intended for.

Nebula2 was mainly a rewrite of the Nebula's higher level area. The kernel and low level code was largely unchanged from Nebula1, so some of the low-level code in Nebula1 is nearly 8 years old, and it shows.

Some Nebula2 features which were "cool" in its days have become irrelevant (at least for Radon Labs). For instance being able to switch between OpenGL and D3D rendering at runtime, the fine-grained scripting support, etc...

More real-world experience shows how to better arrange certain subsystems, moving them up or down in the Nebula layer model.

Nebula is hard to grasp for beginners, partly caused by its somewhat esoteric object model and other design decisions. Also, experience shows that application programmers work with the high level game framework interfaces (Mangalore), and hardly work with Nebula directly. Thus, Nebula becomes more of a platform abstraction layer for the high level game framework code. Nebula3 will respect this paradigm shift.

Nebula2 is hard to scale upwards and downwards (modern multi-core hardware and DirectX10 on the upper end, Nintendo DS on the lower end). Now, its probably not a good idea trying to write an engine that scales unchanged from a next-gen console down to a Nintendo DS, but it should be possible to at least use a common engine core, which is slim enough for handhelds, while still being a good foundation for a next-gen engine (a small memory footprint and good performance doesn't hurt on bigger hardware either).

Better multithreading infrastructure. Nebula3 is designed with multi-core hardware in mind, and provides a programming model where the programmer doesn't have to care too much about running in a multithreaded environment.

Better networking infrastructure. Networking was bolted into Nebula2 as an afterthought. Nebula3 will offer networking support from the ground up (from direct TCP/UDP communication and direct support for HTTP and FTP on the low-end to session management and builtin multiplayer support for Nebula applications on the high end).

Nebula2 doesn't provide a proper high-level game framework, that's why we wrote Mangalore. This split approach caused much confusion, Nebula3 will be designed into 3 layers, where the highest layer provides a complete application framework, thus integrating Mangalore back into Nebula.

Development focus has shifted to Mangalore, our high-level game framework, and Nebula3, the next big version of the Nebula Device.

We created a stripped down Nebula2 version at Radon Labs (dubbed Nebula Embedded), and ported it to the Nintendo DS (internally called Nebula DS). There is currently a team at Radon Labs working on a title for the Nintendo DS.

Nebula3 will have 3-layer architecture, with the core and graphics subsystems rewritten completely, and Mangalore integrated as the high-level application layer.

16 Jan 2007

The last postofthisseries dealt withversioncontrol, nowitstimeforbugtracking. In anynon-trivial projectit's importanttohave a goodplan, ordesigndocument, whichdefinesthe final product, and a meanstokeeptrackonhowclosethecurrentstateoftheprojectistothe final product. Also, in a team, ititsimportanttodefinetheworkitemseachteammembershouldworkonat a giventime, andtokeeptrackwhatworkitemshavebeenfinished, andhowthisaffectsthecurrentoverallprogress versus theplannedprogress. Finally, duringthedevelopmentofanyproject, hundredsofunexpectedlittleannoyingbugscreep in. These bugsmustbecollected, analyzed, fixedandsquashed.

Ifthissoundsterriblycomplicatedtoyou, that's becauseitis. During a gameproduction, manyplansanddeadlinesaremadeonvariouslevelsofabstraction. A simplifedlistfrommostabstracttomostdetailedmaylooklikethis:

GameProposal: Thisis a veryabstractdocumentoftenusedwhilepitching a newgametopublishers. A gameproposalissomewherebetween 4 and 10 pagesandgives a quickoverviewofthegame. Thepointoftheproposalistoget a publisherinterestedandtoconvicehimthatthegame will makehim a shitloadofmoney.

(Disclaimer: fortherestofthepost, when I write "bug", I reallymean "workitem").

Theshort answer is: usability. Trytoenter a newbuginto a toollikeBugzillaorMantis. Itis a nightmareofwebforms, filledwithdozensoffieldswhichmust all beclickedandfilledout. Itsconfusing, takesuptoseveralminutes per bug, andgenerallyisn't a lotoffun. Now, whenwewere a smallcompanywithoneprojectat a timeand a singleprojectmanager, wealreadyhad a system in placewhich was a joytoworkwithandwhereentering a bugonlytooksomewherebetween 10 secondsand half a minute. All bugswerelaidout in a singletable, everylineonebug. Bugs couldbefiltered in an instant, statisticscouldbegeneratedandseveraluserscouldwork in parallel onthebugdatabase. It was almostperfect. Thenameofthatmagicbugtrackingtool? An Excelsheetwith a fewmacros!

So thisiswhere an evaluationodysseybegan. WelookedatseveralreadilyavailablebugtrackingtoolsanddecidedtoswitchtoMantis, while still continuingtolookforbetter alternatives. When we switched to Mantis an interesting effect occured: projects which used Mantis as bugtracker tool had dramatically less bugs then projects which used our traditional Excel sheet! Obviously, the Mantis-projects didn't magically obtain a better code quality. People simply entered fewer bugs because entering a bug in Mantis was more complicated then before and took much more time. Unfortunately we weren't really aware of the importance of this fact until late into the project and we had to start slipping deadlines (something unheard of so far at Radon Labs). This really taught us a lesson how important a good bugtracking tool was for the existence of the company.

We knew exactly what we wanted from a bugtracking tool, we tried for 2 years finding a better solution then Excel (imagine that!), and finally we decided to write our own tool. We hired a programmer with C# and SQL experience and had a first version of a working bugtracker which did everything exactly the way we imagined, and fixed all shortcomings of our Excel sheet within one month!

The requirements for our bug tracker are as follows:

entering a new bug and filtering existing bugs must be simple, intuitive and fast

must support multiple projects

must work over a DSL line

must support user roles (access rights, users assigned to projects and departments)

must support multiple projects

should a true SQL database backend

it must be possible to extract certain statistics

must support parent-child and follow-up dependencies between bugs

it must be possible to add attachments to a bug

We decided that it would be best to implement the front end in C#, since GUIs and working with databases is what C# does really well. It's also relatively easy to find programmers experienced in C# and SQL.

The default view of the bugtracker basically looks like an Excel sheet. Each line represents one bug. Every column represents a bug attribute:

State: the current state of the bug, one of: open, fixed, duplicate, in the works, nice idea, obsolete and open

Creator: who entered the bug?

Assigned To: who's the bug assigned to?

Date Created: when has the bug been entered?

The are several other fields which are usually hidden, but can be configured to be visible.

Below the table view is a large text entry field for the bug description, this should contain a detailed description of the bug, and at least the steps to reproduce the bugs. The description field will also contain automatically generated log messages when the state of the bug has been changed. The attachment list contains all attached files, which can be inspected by double clicking on them. It's also possible to save the attachments to the client machine.

At the top of the table view is a row of drop down boxes, which allow quickly filtering the list of displayed bugs (for instance: show me all my open bugs of priority 1 only needs to mouse clicks). More complex filters with boolean operations can be created and saved for re-use as well very easily. There's also a pre-defined standard filter called "My Bugs", which display all open and work-in-progress bugs assigned to me.

Now, how does bug tracking work in practice?

Alice from QA finds a crash bug in project A (hopefully just an assert() that got triggered). After some trial and error she finds out how to reproduce a bug. She goes into the bug tracker (which is usually open all the time) and creates a new bug in project A, which adds a new empty line to the bug table. She fills out the fields right in place, she knows that the should go into programming, but isn't sure who would be working on it. That's why she assigns the bug to the lead programmer of project A, which is Bob. Since it's a crash bug she will definitely set the priority to 1. Usually she also nags Bob directly about the bug if she thinks the bug should be fixed immediately.

Bob checks his open bugs and looks at the bug description. From the bug description, looking at the source, and maybe checking previous version of the source code he's pretty sure that fixing the bug is programmer Carl's job.

Now he re-assigned the bug to Carl and also tells him to have a look at it ASAP.

Now Carl checks his open bugs and finds Alice' bug. Looking at the description he's pretty sure what's wrong and fixes the bug. Once he's sure the fix works by trying to reproduce the bug following Alice' repro steps he commits the changes into version control.

He sets the bug to fixed and tells Bob and Alice that the fix should be in the next build.

Now, in the "official" Life-Of-A-Bug, the fixed bug would be re-assigned to Alice automatically, and when the new build is available, Alice would have to accept the fix from Bob by checking that the bug is indeed fixed, and if that is indeed the case, set the bug to closed. At Radon Labs we omit this final step, and let the bug's life end at the fixed stage.

During production, thousands of bugs are entered into a project's bug database (remember, these are not just critical programming bugs, but all types of work items for the entire team). Often, non-critical bugs will remain unfixed for some time, duplicate bugs will be entered, or bugs are for some reason no longer reproducible. That's why it is necessary that the bug list is maintained and kept tight. Also, sometimes bug priorities must be decided "by committee", maybe the graphics department thinks some bug is highly critical, while the project manager thinks it isn't. That's why the project manager, the lead programmer and the headof's of the graphics and level-design departments gather every one or two weeks to do a bug triage. This is just a short meeting where the list of open bugs is sighted, and re-assigned, re-prioritized, or set to obsolete or duplicate. This is necessary house keeping for keeping the bug list clean, or to set the right priorities in order to hit the next milestone on time.

That's it for a basic overview of bug tracking! However, there's much more to planning and bug tracking as could be written in a single post. So maybe I'll come back to this topic at some later time.