When doing a real-world game this simple approach has a couple of problems:

blocking: The above code is blocking, when reading from a fast hard disk this is probably not even noticeable, but try loading from a DVD or Bluray disk or some sort of network drive over a slow connection and the game loop will stutter

hard-coded paths: The concept of a current directory is often not portable, you can’t depend on the current directory being set to where your executable is. It is better to establish an absolute root location and have all filename paths in the game relative to that (of course how to establish this root location is platform-dependent again, for instance get the absolute path to the executable, and go on from there)

can’t use different transfer protocols: the above code works fine for local filesystems, but not loading data from a web- or ftp-server, and operations like creating a new file, or randomly seeking in a file may not be available with other protocols.

It is a good idea to restrict the type of file operations that a game can use, e.g.:

do we really need write and create access? An offline game may need to write save-game files and options, while an online game probably doesn’t need access to the local file system at all.

do we really need random seek? Randomly seeking in a file can be either impossible (HTTP) or slow because some mechanical device must be moved around, it’s often better to read a file straight into memory and seek there or to avoid such operations at all.

do we really need to iterate directory content? again, this can be either expensive (mechanical storage device) or impossible (in plain HTTP for instance)

do we really need free-form file paths? Games usually need to access very few places in the file system (the asset directory which is usually read-only, and maybe some sort of per-user writable location for settings and save-games)

do we really need access to file attributes? Stuff like last modification time, ownership, readable/writable. Usually this is not needed.

do we really need the concept of a “current directory”? This can be tricky for portability, and some platforms don’t have the concept of a current working directory at all

That’s a lot of features we don’t need in a game and which are also often not provided by web-based runtime platforms like PNaCl and JS. It helps to look at the HTTP protocol for inspiration, since that is where we need to load our data from anyway in the web scenario:

file system paths become URLs

only one read operation GET, which usually provides an entire file (but can also load a part of a file)

no directory iteration

no “write access” unless specifically allowed by the server

state-less, no current directory or current read position

operations can take very long (seconds or even minutes)

For a game which wants to load its asset from the web the IO system should be designed around those restrictions.

As an example, here’s an overview of the Nebula3 IO system:

all paths are URLs: Not much to say about this :)

a single root location: At application start, a root location is established, this is usually a file:// URL pointing to the app’s installation directory, but can be overriden to point (for instance) to an http:// URL. Loading all data from a web server instead of the local hard disk is done with a single line of code which sets a different root location.

Amiga assigns as path aliases: A filesystem path to a texture looks like this in N3: tex:walls/brickwall.dds, where the tex: is an “AmigaOS assign” which is replaced with an absolute path, incorporating the root directory.

all paths are absolute: there is no concept of a “current directory” in Nebula3, instead all paths resolve to an absolute location at runtime by replacing assigns in the path.

pluggable “virtual filesystem” modules associated with the URL scheme: URLs starting with file:// are handled by a different file system module than http://, plus Nebula3 apps can plug in their own filesystem modules if they want

stream objects, stream readers and stream writers: this is interesting in the web context only because there’s a MemoryStream object which is used to store and transfer downloaded data in RAM

asynchronous IO is really simple: more on that later in this post :)

Since Nebula3 is also used as a command-line-tools framework, the IO subsystem is a bit of a hybrid, which in hindsight was a design fault. There are still all these writing and file creation operations, blocking IO, directory walking etc… which makes the API quite bloated. In a new engine I would probably strictly separate the two scenarios, use the engine as a game framework only, which only supports very simple asynchronous read operations, and write the tools with another framework (or even other language, like python).

Asynchronous IO in Nebula3

Let’s look at async IO in Nebula3 a bit closer since this is the most interesting feature for web-based platforms. This is based on the “non-blocking future” pattern (or whatever you wanna call it) and depends on a frame-driven instead of event- or callback-driven application architecture.

Here’s some pseudo code:

voidStartLoading(){// To start loading data we need to create an // IO request object and "send it off" to the// IoInterface singleton for asynchronous processingPtr<IO::ReadStream> req = IO::ReadStream::Create();
req->SetURI("tex:walls/brickwall.dds");IoInterface::Singleton()->Send(req);// The IoRequest is now "in flight" and will contain// a result at some point in the future. Because we need// to check for completion in some later frame we need to// store the smart pointer somewherethis->pendingRequest = req;// ok, we're done for this frame...}voidHandlePendingRequest(){// this function must be called regularly (e.g. per// frame) to check whether the async loading operation// has finishedif(this->pendingRequest.isvalid()&&this->pendingRequest->Handled()){// ok, the request has been completed, if // the file was loaded successfully we get// a MemoryStream object with its contentif(this->pendingRequest->GetSuccess()){// actually load the data from the memory// stream and throw the request object away,// since all file data is in memory, we can// actually use the normal open/seek/read/close// pattern on the stream objectthis->LoadFromStream(this->pendingRequest->GetStream());// delete the request object, // remember, this is a smart pointer :)this->pendingRequest =0;}}}

There may be less verbose or more elegant versions of this code of course, but the basic idea is that you start loading a file in one frame, and then need to check in the following frames if loading has finished (or failed), and get the completely loaded data in a memory buffer which can be parsed with “traditional” read and seek functions (and which is very fast since everything happens in memory).

This implies that the engine needs to know what to do while some required data has not been loaded yet. For a graphics pipeline this is quite simple by either rendering nothing or some placeholder while the data is still loading.

But there are cases where the code cannot progress without important data being loaded, or where it would be very tricky or impossible to implement asynchronous IO (for instance when integrating complex 3rd party libraries like sqlite).

If we could simply block this wouldn’t be a problem: the worst thing that would happen is that our game loop would stutter, but on web platforms we cannot simply block the main thread (it is easier on PNaCl where it is recommended to move the game loop into a separate thread, which then can block waiting for the main thread to process asynchronous IO requests).

For Nebula3 I fixed this with an additional application object state called the “Preloading Phase”. The idea is that the app enters this state outside of the normal game loop (for instance while displaying a loading screen), and during this state, populates a simple in-memory filesystem (basically just a lookup-table with URLs as keys and MemoryStream objects as values) with the asynchronously loaded data. When all data has been loaded (or failed to load), the app will leave the preloading phase (and hide the loading screen) and synchronous loader code will transparently get the data from the in-memory file system instead of starting an actual asynchronous IO request. Since all this preloaded data resides in memory this means of course that only small data and few files should be preloaded, and the majority of data should be asynchronously streamed on demand during the game loop. It’s really only a workaround for the few cases where synchronous access is absolutely necessary.

emscripten and PNaCl details

Ok, almost done!

For the emscripten and PNaCl platforms I basically wrote a simple Nebula3 filesystem module which fires HTTP GET requests through he respective emscripten and PNaCl API calls, and copies the received data into MemoryStream objects, it’s only a few hundred lines of code each.

The main difference between the two platforms lies in the use of threading:

PNaCl works like “traditional” platforms, there are a number of IO threads (about 10, but that’s tweakable) each of them processes one IO request at a time, so that as many IO requests can be in flight as there are IO threads. Those threads also directly handle processing of the received data like decompression.

In emscripten, the IO calls (sending a HTTP request, and the callback when the response has been received) is handled on the main thread, but the expensive processing (e.g. decompression) of the received data is handed over to a WebWorker pool (usually 4 WebWorker threads). There can still be multiple IO requests in flight because the IO system doesn’t “wait” for an IO request to finish before firing a new one (but it is still throttled to restrict the number of requests in flight in case a lot of requests arrive in a very short time period).

The actual code implementation is straightforward so I’ll spare you the source code samples. The respective class in PNaCl is called pp::URLLoader, and emscripten offers a whole set of rather specialized C functions which all start with emscripten_async_wget. Both fire an HTTP request (emscripten does an XmlHttpRequest, and PNaCl presumably under the hood as well - this has some unfortunate cross-domain implications), and invoke callbacks on failure or when data has arrived. PNaCl needs a bit more coding work since data is received in chunks (and the receive callback can be called multiple times), while emscripten waits until all data is received before calling the received-callback once.

emscripten has more options to integrate the data with the web page DOM (for instance it can automatically create DOM image objects from downloaded image files), and it also has a very advanced CRT IO emulation layer (so you actually can directly use fopen/fclose after the data has been downloaded or preloaded), but I haven’t looked into these advanced concepts very closely since Nebula3 already does a lot of this layering itself.

There’s a similar filesystem layer for NaCl called nacl-mounts, but similarly to emscripten I didn’t look into this very closely since the low-level URL loading functions were a better fit for N3.

3 Nov 2013

Since I’m dabbling with emscripten I’ve had this idea in my head to write or port a KC85/3 emulator, so that I could play the games I wrote as a kid directly in the browser. The existing KC85 emulators I was aware of are not trival to port, they either depend on x86 inline assembly, or are hardwired to a specific UI framework (if you read German, here’s an overview on what’s out there: http://www.kc85emu.de/Emulatoren/Emulatoren.htm )

About 2 weeks ago I started to look around more seriously for a little side project to spent my 3 weeks of vacation around Christmas (I need to burn my remaining vacation days, in Germany employees are basically required by law to take all their vacation - tough shit ;) My original plan was to cobble together a minimal emulator just enough to run my old games: Take an existing Z80 CPU emulator like the one from FUSE, hack some keyboard input and video output and go on from there.

Thankfully I then had a closer look at MESS. I always thought that MESS could only emulate the most popular Western game machines like the C64 or Atari 400, but it turns out that this beast can emulate pretty much any computer that ever existed (between 600 and 1700, depending on how you count), it even has support for the PDP-1 from the early 60’s! When searching through the list of emulated systems here (http://www.progettoemma.net/mess/sysset.php) I stumbled over the following entries:

These were GDR office computers. The 1715 was a CP/M compatible 8-bit PC, and the A7150 was a not-quite-compatible x86 IBM-PC clone. I’m actually not sure what the 5120 was, just that it was a big ugly box with built-in mono-chrome monitor.

Since all those systems are marked as “not working” in this list I wasn’t too enthusiastic yet, but I had to be sure. The latest MESS compiled out of the box on OSX, and it was easy to find the right ROM images in the net. So I started MESS with:

./mess64 kc85_3 -window

To my astonishment I watched a complete boot sequence into the operating system:

Excite!

I also came across the JSMESS project before, which is a port of MESS to Javascript using emscripten. So my next step was to compile JSMESS and see whether the KC emulator works there as well. It booted, but didn’t accept any keyboard input :( After comparing the source code it dawned on me that JSMESS was far behind MESS, about 2 years to be exact. But this was a good excuse to dive a bit deeper into how MESS actually works, and the deeper I crawled the more impressed I became.

MESS had been derived from the well known MAME arcade machine emulator project, with the goal to extend the emulation to “real computers”. Later MESS merged with MAME again, so that today both projects compile from the same code base.

A specific emulated machine is called a “system driver” and can be described by just a few lines of code listing what CPU to use, the RAM and ROM areas, what ROM image to load, and what memory-mapped IO registers exist. You’ll also have to provide several callback routines for handling reads and write to IO addresses and to convert the system’s video memory into a standardized bitmap representation. For a very simple computer built from standard chips a working emulator can be plugged together in a couple of hours, but writing a complete and “cycle-perfect” emulator is of course still a tough challenge, especially if custom chips are used. The overall amount of research and implementation work that went into MESS is almost overwhelming. Pretty much every computer, every mass-produced chip that ever existed is emulated in there, often with all of their undocumented quirks!

Ok, back to the KC85/3: after analyzing the source code of the KC driver it quickly became clear that the keyboard input emulation was the toughest part, since this was where the original KC engineers were very “creative”. As far as I understood the several pages of email exchange which are included as comment in the MESS KC driver, the KC keyboard used a very exotic TV remote control chip to send serial pulses to the main unit (the KC had an external keyboard connected with a “very thin” wire, so it was very likely a simple serial connection). The base unit which received the signal didn’t have a “decoder chip” however, but used its universal Z80-CTC (timer) and -PIO (in/out) chips to decode the signal. Emulating this behaviour seems to be very tricky since a lot of KC emulators have yanky keyboard input (not registering key presses, or inserting random key codes when typing fast, etc…).

Since I didn’t get this to work reliably even after back-porting the latest keyboard input code from MESS (which somewhat works, but still has problems with random keys triggering), I decided to be a bit naughty and implement a shortcut (the “cycle-perfect” emulator purists will likely kill me for this heresy):

After the KC-ROM reads a keyboard scan-code through this tricky serial-pulse decoding described above, it converts the scan code to ASCII and writes it to memory location 0x1FD, and then sets bit 0 in memory location 0x1F8 to signal that a new key code is available. It also maintains a keyboard repeat counter in address 0x1FA. All of this can be gathered from the keyboard handling code in ROM (and is also explained in that very informative, very long comment in the source code). I’m basically “shortcutting” this with C code and write the ASCII code directly to 0x1FD and also handle the key repeat directly in C. The tricky serial decoding stuff in ROM is never triggered this way. With this hack the keyboard input is fairly responsive (sometimes the first key is swallowed, don’t know yet what’s up with this).

Next I had to fix the RGB colors which were off both in MESS and JSMESS (bright yellow-green looked more like puke-yellowish, and all other “inbetween colors” were off too), and I finally back-ported (and also optimized a bit) the video memory mapping code from MESS to JSMESS.

You can check all my changes here on GitHub: https://github.com/floooh/jsmess/tree/floh Right now a “reboot” is going on in the JSMESS project to bring it uptodate with the latest MESS version. I’ll wait with any pull-requests until this is finished and I refreshed my own fork as well. Also I will not try to contribute my “dirty hacks” back to the main code base of course, the MESS guys are right to insist on perfect emulation instead of shortcut hacks like the keyboard hack described above. But my (rather egoistic) main goal is to get my own games running on my web page, so I think I can get away with such hacks in my own fork.

The next challenge is to get all of my games running in JSMESS. This is harder than I thought. Part of the problem is that there exist several memory dump files which are not original. I found dump files with the wrong entry address, and dumps where others have implemented cheats and trainers. So far I’ve got 3 out of 7 games working. Getting the remaining 4 games into working condition might take a while since I may have to do some hardcore assembly debugging to find out what’s wrong.

Thankfully MESS has a completely assembler-level debugger built-in:

Re-constructing the program flow of this 25-year-old game which I wrote in machine code (instead of using a assembler) is actually quite a lot of fun, much easier than trying to reconstruct a program which was written in a high-level language and compiled to machine code. Subroutines often start at “even” addresses, and have a block of NOP instructions appended, in case I needed to add instructions when fixing bugs, strings are usually embedded right into the instruction sequence instead of a central “string pool”. Analyzing the program flow comes down to figuring out what a given subroutine does (drawing a sprite? handling keyboard input? updating the hiscore display?), and what variables are stored at specific memory addresses (for instance current live counter, current position, and so on).

What’s remarkable is how small the game code actually is, even though it is not very dense with all those NOPs inbetween and a lot of redundant code segments (e.g. I didn’t specifically care about code size). Of the about 12kByte of my (very simple) Pacman clone, only about 3.5 kByte are actual code. The entire game code fits on a single screen (marked in yellow here):

Finally, here’s the current result of this work: a JSMESS KC85/3 and KC85/4 emulator, and 3 of my old games running directly in the browser. Don’t try this on an iPhone though (or generally Safari). Firefox or an uptodate Chrome works very well:

8 Oct 2013

Today I ported the OpenGL rendering code in Nebula3's bleeding edge branch back to Windows:

This is remarkable in 2 ways:

It's the first time since around 1997 that I ported a significant amount of code to Windows. Usually it was from Windows towards another platform.

This is also the end of DirectX in our code base (well almost, we're still using the xnamath.h header, which is a standalone header and doesn't require linking against any DX DLL).

Why do I think that this is remarkable:

It is the end of an era! In 1997 I ported Urban Assault from DOS to Windows95 and Direct3D5. This was just around the time when Windows started its career as a gaming platform. D3D5 was the first D3D version which didn't completely suck because it had the new DrawPrimitive API, before that, rendering commands had to be issued through an incredibly arcane "execute buffer" concept (theoretically a good idea if GPUs would have been able to directly parse this buffer, but terrible to use in real-world code). The Urban Assault port to D3D was pretty inefficient since we ported from a software rasterizer (with perspective correction and all that cool shit), and if I remember correctly we issued every single triangle through a single DrawPrimitive call (although that wasn't such a big deal at the time). And the only graphics card which had somewhat acceptable D3D support was the RIVA128 from an underdog company called nVidia (this was before their breakthrough TNT2), and the top dog was the 3dfx Voodoo2 which had much better support for Glide then for D3D. But since UA was published by Microsoft we had to be D3D-exclusive of course.

Since 1998 Direct3D was our primary rendering API, I dabbled around with OpenGL from time to time, but nothing serious. We made the jump to D3D7, D3D8, and finally D3D9. Each new version sucked less and less, and D3D9 is still a really good API. We never made the jump to D3D10 because of Microsoft's exceptionally stupid decision to not back-port D3D10 to Windows XP from Vista, and since Nebula was never about high-end rendering features but instead running on a broad range of old hardware we could never justify to add D3D10 support, since we couldn't give up D3D9.

And as silly as it sounds, this boneheaded Microsoft decision from 7 years ago is one important reason why I'm ditching D3D today. World-wide, WindowsXP is the fastest growing Windows version. It's growing a lot faster than Windows8. Don't believe me? See the Unity hardware stats page for a scary reality check:

The Chinese Dragon has awoken, and it is running exclusively on XP. WindowsXP is also very popular in Eastern Europe and the Middle East. So if you want to support markets east and south of Middle Europe you're basically fucked if you don't support XP.

Another important reason is streamlining the code base. The currently "interesting platforms" (browser and mobile) are all running some variant of POSIX+OpenGL. In this new world the Windows APIs are the exotics, and Microsoft doesn't exactly help the cause by repeating their errors of the past (limiting Windows Store apps to D3D11). By using a single rendering code base (and especially shader code base!) across all platforms we're reducing our technical debt in the future.

I have a fallback plan of course, because there are a few risks:

What if OpenGL driver quality on Windows is as bad as everybody says?

What if we need to support native Windows Store apps (as opposed to a WebGL version running embedded in a browser)?

The fallback plan has 2 stages:

Use ANGLE which layers OpenGL ES2 with some important extensions over D3D9 or D3D11, this is the preferred solution since we don't need to touch the render layer code and shader library.

If ANGLE isn't good enough, write native D3D9 and D3D11 ports of the CoreGraphics2 subsystem, and optimally use some API agnostic shader language wrapper. This wouldn't be as bad as it sounds, each wrapper would have around 7k lines of code, which is about 4.5% of Nebula3 in its minimal useful configuration (which is about 150k lines of code, depending on which other N3 modules are added this can go up to 500k lines of code).

OpenGL isn't perfect of course. It has some incredibly crufty corners, most of those have been fixed in through extensions and newer GL versions over time, but realistically we can't use anything newer then OpenGL ES2 with very few extensions for the renderer's base feature set.

When I removed the DirectX library stubs from the Nebula3 CMake files this afternoon I really had to stop and think for a moment. Who knows, maybe in a future blog post in about 15 years I will write "this was around the time when Windows became irrelevant as a gaming platform"? ;)

However under the hood, the startup process until the NebulaMain() function is entered is completely different from other platforms, since PNaCl doesn't have a main() function. Instead PNaCl has the concept of application Module and Instance objects. This is where the plugin-nature of a PNaCl app shines through. There is a single Module object created on a web page containing a PNaCl app, and for each <embed> element on the page, one Instance object. In reality though, most of the time there will be exactly one Module and one Instance object, so the distinction doesn't really matter.

PNaCl offers two different startup APIs for C and C++. The C++ API is easier to grasp IMHO, so I'll just concentrate on this (this dual C/C++ nature continues through the whole NaCl API, there's a pure C API, extended by a slightly higher-level C++ API.

Hooking up your code to NaCl basically means to write 2 subclasses, one deriving from pp::Module, and one deriving from pp::Instance, and the NaCl runtime will then call into these classes through virtual methods for initialisation and notifying the application about events.

But first things first:

Everything starts at a global C Function called pp::CreateModule() which you must provide, and which must return a new object of your pp::Module subclass (called N3NaclModule in this case):

Although this is the very first function that NaCl will call, you should be aware that initialisers in the global scope (static objects) will already be initialised and have had their constructors called at this point.

The main job of the derived Module class is to create Instance objects, but we can also put some one-time init code in there. There's a pair of functions to initialise and shutdown GL rendering called glInitializePPAPI() and glTerminatePPAPI(). The only rule is that no GL calls must be made outside these two functions, so I guess we could also put them somewhere else, as long as is guaranteed that they are not called multiple times.

But - the most important method in the derived Module class is the factory method for Instance objects called CreateInstance. In my case, I have created a subclass of pp::Instance called NACL::NACLBridge.

Now on to the NACLBridge class, this is (I know I'm repeating myself) derived from the pp::Instance class, but is called "Bridge" for a reason: in the PNaCl we're spawning a dedicated thread for the game loop, and leave the main thread (aka the Pepper thread) for event handling and rendering. Our derived pp::Instance subclass serves as a "bridge" between these 2 threads, that's why it's called NACLBridge.

The NaCl runtime will call into virtual methods of an pp::Instance object for handling events, the most important of these are Init(), DidChangeView(), HandleInputEvent(). For a complete overview and exhaustive documentation of those callback methods I recommend sifting directly through the SDK header: include/ppapi/cpp/instance.h

In the Init() method I'm only building a CommandLineArgs object from the provided raw arguments (these have been extracted from our <embed> element in the HTML page).

The actual initialisation work happens (in my case) in the first call to DidChangeView() by calling a Setup() method in the NACLBridge object. I choose this place because this is where I'm getting the current display dimensions of the <embed> element, which is required for the renderer initialisation (although now thinking about it, I might also be able to extract these from the arguments provided in the Init() method, need to try this out some time).

The NACLBridge::Setup() method only does one thing: create a thread with the NebulaMain() function as entry point, and then return to the NaCl runtime. The code inside NebulaMain() works just as on any other platform, with the only difference that it is not running on the main thread, but in its own dedicated game thread.

The big advantage to run the game loop in its own thread is that you "own the game loop", and you can perform blocking, for instance to wait for IO. The disadvantage is that you can't call any PPAPI (NaCl system functions) from the game thread, which is a blog-post-topic on its own.

So to recap: The ImplementNebulaApplication macro runs on the main thread, and creates one pp::Module and one pp::Instance object. The pp::Instance object creates the dedicated game thread, which calls into the NebulaMain() function, which from that moment on runs the game loop like on any other platform. With this approach we don't need to slice the game loop into frames like on the emscripten platform.

Now that you heroically worked your way through through all of this I'll tell you a secret: NaCl also provides a simple alternative to this complicated mess called the ppapi_simple library, which essentially provides a classic main() function running in its own thread, and because blocking is allowed on this thread, also provides normal POSIX fopen()/fclose() style blocking IO functions (sound familiar?).

Check out the header file include/ppapi_simple/ps.h as starting point.

Unfortunately this ppapi_simple library didn't exist when I started dabbling with NaCl about 2 years ago, certainly would have made life a lot easier. On the other hand, the work that had already gone into the NaCl port made the emscripten port easier, which wouldn't be the case had I used the ppapi_simple wrapper code.

Trying this in on one of the browser platforms like emscripten or PNaCl results in a freeze and after a little while the browser will kill your tab :(

The problem is that the browser won't "let you own the game loop", and this is a general problem of event- or callback-driven platforms (iOS and Android have the same problem for instance). On such platforms the execution flow of the main thread is not controlled by your game code, instead there's some outer event loop which will call into your code from time to time. If you spent too much time in your allotted slice of the pie you will drag the entire system event loop down and other important events (such as input events) can't be handled fast enough. Result is that the entire user interface will feel sluggish and unresponsive to the user (for instance, scrolling in your browser tab will stutter or even freeze for multiple seconds). And if you don't return for about 30 seconds, then the browser will kill your app (Aw Snap!).

This is all bad user experience of course, we want the browser to remain responsive, and scrolling smooth all the time, also during initialisation and load time.

The core problem is that your code must always return within a few milliseconds back to the browser (e.g. 16 or 33, depending on whether you're aiming for 60 or 30fps), and this is the big riddle we need to solve for a game application running in a browser.

For a Flash or Javascript coder, or someone who's mainly writing event-driven UI applications this will all be familiar, they are used to have all their code run inside event handlers and callbacks, but typical UI apps usually don't need to do anything continuous. Event-driven applications sleep most of the time, react to (mostly input-) events from the outside, and go to sleep again. But games need to do continuous rendering, and thus are frame-driven, not event-driven, and mixing these two programming models isn't a very good idea because its hard to follow the code-flow. The usual way to implement games on event-driven platforms is to setup a timer which calls a per-frame callback function many times per second. I think hacks like this is why game programmers have a deep hatred for UI-centric platforms (and why I still like Windows despite its other shortcomings, because the recommended event handling model in Windows for games (PeekMessage -> TranslateMessage -> DispatchMessage) actually lets you "own the game loop" in a very simple and elegant way through message polling).

There are a few different approaches to either get a true continuous game loop, or at least to create the illusion of a continuous game loop on platforms where polling isn't possible, mainly depending on whether "true" pthreads-style multi-threading is supported or not.

In a Nebula3/emscripten application this isn't the case, the actual game loop and the rendering code runs on the main thread. Reason for this is that emscripten's multithreading support is built on WebWorkers. pthreads emulation isn't possible in emscripten since WebWorkers can't share memory with the main thread, furthermore, WebWorkers can't call into WebGL. This puts a lot of restrictions on our "game loop problem", and it required to refactor Nebula3's application model: in all previous ports there was always a way to somehow run a continuous game loop, mostly by moving the game loop into its own thread, but we don't have this option in emscripten (yet ... but hopefully one day, with more flexible WebWorkers).

Traditionally, a Nebula3 application used to go through a simple "Open -> Run -> Close -> Exit" sequence. An N3 main file looked like this for instance:

Instead of a main() function, there's a NebulaMain() wrapper function and a macro called ImplementNebulaApplication(). These hide the fact that not all platforms have a standard main() (for a Windows application, one would typically use WinMain() for instance).

The actual system main function is hidden inside the ImplementNebulaApplication() macro, for a PC-like platform the macro code looks like this:

Now back up to the NebulaMain() function's content: the Application::Open() method could take a while to execute (couple of seconds, worst case), and the Application::Run() will contain the "infinite" game loop, which only returns when the application should quit.

Since this wasn't a very good fit for the emscripten platform (because of this "infinite" loop inside the Run() method), first step was to make the app entry even more abstract to give the platform-specific code more wiggle room:

The most obvious change is that there's only a single StartMainLoop() method instead of the Open->Run->Close->Exit sequence. And at closer inspection some strange stuff is going on here: The application object is now created on the heap, the pointer to the object lives in the global scope, and the app object is never deleted. WTF?!?

To understand what's going on we need to dive a bit deeper into the emscripten system API.

The StartMainLoop function is actually only a one-liner on the emscripten platform:

emscripten_set_main_loop(OnPhasedFrame, 0, 0);

This sets the per-frame callback (called OnPhasedFrame) which the browser runtime will call regularly, and we'll have to do everything inside this callback function. The first 0-arg is the intended callback frequency per second (e.g. 60). 0 has a special meaning: in this case emscripten is using the modern requestAnimationFrame mechanism to call our per-frame function (instead of of the old-school setInterval or setTimeout way). The second argument is called simulateInfiniteLoop, and to understand what this does it is first necessary to understand what happens when it is not used:

The emscripten_set_main_loop() function will simply return, all the way up to main(), which will also return right after it has started! WTF indeed...

In a normal C program, returning from the main() function means that the program is shutting down of course. Local-scope objects will be destroyed before leaving main(), then global-scope objects (static initialisers).

In emscripten's case, a program which has called emscripten_set_main_loop() continues to run after main() has returned. This is a bit of a strange design decision, but makes for familiar looking code (e.g. hello_world.cpp is the same as on any other platform). Objects in the global scope will continue to exist in emscripten after main() returns, but objects in the local scope of main() will be destroyed, thus this strange way to create our application object, to prevent the app object from being destroyed after main() is left:

static MyApplication* app = new MyApplication();

And now back to that simulate_infinite_loop argument: This is a new argument which was introduced after I started the Nebula3 emscripten port. Setting this argument to 1 will cause the emscripten_set_main_loop() function to not return to the caller, instead a Javascript exception will be thrown which essentially means that execution bails out of the C/C++ code without unwinding the (C/C++) stack, thus leaving local-scope objects of the main() function alive, everything after emscripten_set_main_loop() will never be called. So with this fix we could just as well write:

So this basically covered emscripten's application startup process, we now have a per-frame function (called OnPhasedFrame) which will be called back at 60 fps. We just need to cram everything the application has to do into these 1/60sec time slices. This is fine for the actual game loop after everything has been loaded and initialised, but can be a problem for stuff like loading a new level, which could take a couple of seconds. In a traditional game, worst thing that could happen in this case is that the loading screen animation (if there is any) may stutter, but in a browser environment, such pauses will affect the entire browser tab (freezing, no scrolling, etc...), which makes a very bad first impression to the user.

So what to do? For Nebula3 I created a new Application base class called "PhasedApplication". Such a phased application goes through different life time phases (== states), such as:

Initial -> app has just become alive
Preloading -> currently preloading data
Opening -> currently initializing
Running -> currently running the game loop
Closing -> currently shutting down
Quit -> shutting down has finished

Each of these phases (or states) has an associated per-frame callback method (OnInitial, OnPreloading, OnOpening, etc...). The central per-frame callback will simply call into one of those methods based on the current phase/state. Each phase method invocation must return quickly (the browser's responsiveness depends on this), and may be called many times until the next phase is activated. So instead of doing a lot of stuff in a single frame, we do many small things across many frames.

Best example to illustrate this is the OnOpening() method. Suppose we need to do a lot of initialisation work during the apps Opening phase. Files need to be loaded, subsystems must be initialised and so on. This may take a couple of seconds. But the rule is that we must ideally return within 1/60sec, and we also don't have an independent render thread which could hide the main-thread freeze behind a smooth loading animation. So we need to do just a little bit of initialisation work, possibly update rendering of the loading screen, and return to the browser runtime. But since we haven't switched to the next state yet, OnOpening() will be called back again, and we can do the next piece of initialisation work. Sounds awkward of course, and it is, but there's not a lot we can do about it.

A new Javascript concept called generators could help to clean up this mess, with these it should be possible to chop a long sequence of actions into small slices while leaving the function context intact (essentially like a yield() function in a cooperative multithreading system) - catapulting Javascript into the illustrious company of Windows1.x and Classic MacOS. But enough with the ranting ;)

A somewhat cleaner method for long initialisation work is starting asynchronous actions through a WebWorker job in the first call to OnOpening() and during the next OnOpening calls check for all of those actions to have finished, gather the results, and finally switch to the next state, which would be Running. In the worst case, initialisation code must literally be chopped into little slices running on the main thread.

So that's it for this blog post. Originally I wanted to compare emscripten's and PNaCl startup process, but this would be way too much text for a single posts, so next will very likely be a similar walk through of the PNaCl application start, and after that the next big topic: how to handle asset loading.

26 Aug 2013

I recently ported Nebula3 to Google's PNaCl. Main motivation was that I wanted to see how it compares to asm.js both for performance and "ease of use". This was basically a drive-by port, I didn't want to put too much effort into it. Thankfully I had old NaCl code lying around which I could reuse and after 2 or 3 afternoons (and some WTF-moments) I had a pretty clean port running which I'm planning to keep updated into the foreseeable future.

The big news about PNaCl is that deployment no longer has to go through the Chrome Web Store, instead it is now finally possible to host PNaCl applications from any URL.

You can check out the Nebula3 PNaCl demos here: http://www.flohofwoe.net/demos.html. Just make sure you're running the latest Google Chrome Canary, and if an error pops up that PNaCl isn't enabled, just restart Chrome, and wait a little bit. First start can take up to one minute, since PNaCl support is installed on demand which is a multi-MByte download.

Over the next few weeks I'm intending to write up a little series of blog posts comparing the PNaCl and emscripten Nebula3 ports. From a coder's perspective, the two systems are actually fairly close when seen from high above.

As a "pragmatic programmer", I don't really care about the political side. Both asm.js and PNaCl had to take a lot of flak from web purists. The only thing that counts to me is that both technologies provide a seamless software distribution channel directly from the coder to the user. No app shops, gate-keepers, code-signing-certificates or approval processes inbetween.

The Build System

First step is of course to get the SDKs. Both emscripten and PNaCl offer a GCC-style cross-compiling toolchain based on Clang-LLVM. Quick disclaimer: I'm running on OSX, haven't looked at the Windows side of things yet.

The emscripten SDK is simply installed and updated through a github repository. There's a stable master branch, and a bleeding-edge incoming branch. emscripten requires a couple of external tools, most notably Clang-LLVM, python and node.js. Even though clang is the standard compiler on OSX I installed a separate version because emscripten required a newer version then was installed on OSX 10.7. Paths to external tools must be provided through a .emscripten config file in your home dir.

The NaCl SDK is a normal download-archive which should be unzipped to a nacl_sdk directory in your home directory. This download only contains a script file called "naclsdk" which takes care of downloading and updating the actual SDK files in the future. The NaCl SDK contains versioned bundles, each of which is actually a complete SDK in itself, with tools, headers, libraries and examples. This is the same philosophy as the DirectX SDKs. You pick a version to work with and decide yourself when to switch to a newer version, this guarantees you a stable API, and gives the dev team the freedom to change APIs in new versions without breaking code compiled against older versions.

One challenge about the NaCl SDK is to find the right compiler tools and runtime libs since there are so many choices. The "classic" CPU-specific NaCl had different toolchains for ARM and Intel CPU architectures, and two different C runtime libs to choose from: newlib or glibc.

PNaCl is much simpler though: there are no longer different target CPU architectures since PNaCl executables are essentially LLVM bitcode, and the only available C runtime lib is newlib (which is the better choice anyway, since it is much slimmer then glibc).

In Nebula3 I'm using cmake to generate build files for different target platforms and build systems / IDEs. For each platform, you build a so called toolchain file which contains paths to the cross-compiling tools, search paths to headers and libraries, and compiler/linker settings.

Writing such a toolchain file can be a bit of guess work, but there are examples flying around the net, also emscripten comes with sample cmake toolchain files which might be helpful as a starting point.

Here are a couple of tips which might save you a some trouble:

don't set "ld" as the linker tool, in both toolchains the normal compiler tool also serves as linker (in emscripten this is emcc, in PNaCl use pnacl-clang++

PNaCl requires an additional post-build-step after linking, called pnacl-finalize, cmake has the add_custom_command macro for this

To properly separate the different build files I have a directory structure like this:

The -G option is the cmake "generator", we're telling cmake here that we want Eclipse project files using the ninja build tool (ninja is a more modern make alternative). *-DCMAKE_BUILD_TYPE* sets the AsmJS build config (cmake lets us define any number of custom configs, commonly just Release and Debug but in emscripten I have defined an extra AsmJS config), then -DNEBULA_PLATFORM=EMSCRIPTEN is one of our own custom symbol definitions, this simply tells our cmake files, that we're building for the emscripten target platform (actually this is redundant, a better place for this definition would be the toolchain file). Next we tell cmake which toolchain file to use, and finally where the source code is located (or more specifically: where to find the root CMakeLists.txt file - CMakeLists.txt files tell cmake what targets to build, and from what sources).

When cmake has run, we could import the generated project into Eclipse, or we can just run ninja from the command line:

Writing a proper cmake based build environment can be a lot of work, but it is definitely worth it. Managing a multi-platform build environment across Linux, OSX and Windows and probably several game consoles, spanning different IDEs like Visual Studio, Xcode and Eclipse would be a nightmare without a meta-build-tool like cmake.

Deployment

Big jump here, but no worries, I'll deal with all the inbetween-stuff in the following blog posts.

The common thing between emscripten and PNaCl when deploying is that the generated files are embedded into a web page, and thus can be easily integrated into existing web site build- and deployment-processes.

The details are a little bit different between the two though:

An emscripten "executable" is either a .js file or a complete HTML page (the so called shell page) which embeds the generated Javascript code. The emscripten linker looks at the output file extension to decide whether it should generate a .js or .html file. Emscripten comes with a default html shell file which should be used as starting point for a customised web page.

Integrating emscripten generated code into a web page is just the same as integrating any piece of complex Javascript code. Since emscripten-generated code is just Javascript, it is also very easy to interact with the rest of the page through direct JS function calls.

PNaCl on the other hand integrates like a plugin into the HTML page using the embed element:

Instead of the .pexe file, a .nmf manifest file is given to the embed element which contains the name of the .pexe file (this manifest file used to look more interesting in classic NaCl since it contained one entry for each target cpu architecture, but for PNaCl there's only one useful piece of information):

Finally, the type="application/x-pnacl" attribute is important for Chrome to recognise the embed element as a PNaCl application.

Interaction between a PNaCl application and the surrounding web page works through the Javascript messaging system. To get events from the PNaCl application, just add event listeners to the embed element:

6 Jul 2013

This old blog post about the Nebula3 Application Layer is the 3rd-popular-post on my blog, very likely because it was linked from Stack Overflow. I always wanted to write a followup to this post, because if I would design such a system again, it would look quite differently today.

First a quick recap of the original system:

the original system consists of the following classes:

Entity: a container for Properties and Attributes, can receive Messages which are distributed to its Properties

Property: attached to an Entity, implements some part of the entities "game logic", receives and processes messages

Message: a small object which is sent to an Entity and distributed to Properties which may handle them

Attribute: key/value pairs attached to entities

Manager: singletons which implement global game logic

the only pre-defined Manager is the EntityManager which is a container for Entities, and allows to query for entities

Entities and Properties have several per-frame callbacks and are called back by the EntityManager

the motivation behind this system:

to have a simple, extensible high-level framework for implementing game-play logic

fix extension-through-inheritance problems through composition

and the problems of the original system:

poor spatial locality: Entities, Properties and Messages are isolated heap objects and can be spread all over the address space in the worst case

high cost for creation and destruction: all objects are dynamically allocated, this is especially a problem for Messages, there may be thousands of Messages created and destroyed per frame

high cost for settings/getting Attributes: setting or getting an attribute value involves a O(log2 n) lookup

high overhead for on-frame callbacks: the EntityManager calls several callbacks every frame on each entity, with many entities the call-overhead is non-trivial

reliance on virtual methods: almost all public methods in properties are virtual, because the message handler and callback methods are implemented in a Property base class, with specialised properties as subclasses

In the old single-player Drakensang games we had up to two-thousand game entities in some bigger maps, and we ran into real performance problems because the entity system is so heavy-weight.

So here's how I would implement a similar system today, keep in mind that this is just a "Gedankenexperiment", and I will make up some stuff while I type (but most of it has been lingering in the back of my head for quite a while now).

The main goals are to improve performance by making the system less dynamic, reduce memory fragmentation and reduce message-passing and object creation overhead.

Here we go:

1. Move all the interesting code into separate subsystems

In the original entity system, Managers and Properties would often implement actual game logic, and could become big, complicated and unwieldy.

The new entity system would only be minimal glue code between (ideally autonomous) subsystems, each with a Facade singleton as its main public interface. Such subsystems could be: rendering, AI, physics, audio, and also anything else what makes up the game. The last point is important: Even when already using such autonomous subsystems for low-level stuff like rendering or audio it is tempting to write the actual game logic "along the way" inside Properties without separating it into additional "game logic subsystems", which is guaranteed to soon end in an unmaintainable mess.

Ideally, each of the autonomous subsystems can live (and be tested) on its own, and will not interact with other subsystems (the physics world must not know about the rendering world or the audio world and so on).

One of the main jobs of the entity-component-system is to control and coordinate the data flow between those autonomous subsystems, it glues the subsystems together (e.g. getting the desired motion from the AI/navigation system into the physics system, and getting position updates from the physics system into the rendering system).

The other job is to provide different types of game objects (for instance different unit types in a strategy games) by combining small, reusable Component objects which implement different aspects/behaviours of the game logic.

The important thing to keep in mind is that all the classes of the new entity system will only provide a slim layer of glue between subsystem which contain all the meaty stuff.

What's in the new entity system

Properties will now be called Components, but their role will be the same. Managers and Attributes will go away (reasons are detailed below). Entities and Messages will keep their names and roles.

Fixing the Spatial Locality and Cost of Creation

Entities and Components would be created from pre-allocated object pools. Live Entities and Components would ideally be located next to each other without big memory holes inbetween. As public handle to an Entity I would probably use an EntityID instead of a (smart) pointer, the EntityID would be a 32-bit integer, some bits used as index into the entity pool, and some bits as a unique wrap-around-counter to prevent that an old Id points to a recycled object in the pool.

Entities and Components

An Entity would be a template class which must be partly implemented by the game programmer tailored to his project. The max number of Components the Entity can hold is a template parameter. There's a private C array of raw pointers to Components contained inside the Entity class, and programmer-provided template-methods to gain safe access to those Component objects.

An example: let's say the components-access template method would be called Component(), then invoking a method "SetTransform()" on a component "Location" would look like this:

entity->Component<Location>()->SetTransform(m);

Hmm, this looks mighty ugly though... The advantage is that the Component<> method will resolve to a simple inlined pointer indirection, which is as cheap as it gets. But I will have to think of some nicer looking code...

Attributes

Attributes will very likely go away completely because the cost for setting/getting is too high (this involved a binary search). Instead entity state will be exposed through simple inline getter methods in Component classes. There are not setter methods, because direct, unchecked manipulation of internal entity state by an "outsider" would be too dangerous. Manipulating an entity is exclusively done by sending messages to the entity.

There must still be a more dynamic, generalised way to initialise and manipulate an entity (this was a nice side-effect of the general attribute-system), for instance to implement persistence or communicate with remote applications (like a level editor). For this, some general serialisation mechanism to and from a simple binary stream must be implemented.

The Entity Registry

This would be a singleton used as factory and container of entities (basically the facade of the entity system). It would allow creation of entities, resolve an EntityID into a pointer, probably lookup entities by name (if having human-readable entity names makes sense at all), and sending messages to entities. This would be similar to the old EntityManager, but it would not call any per-frame methods on entities (it would be desirable if the new entity-system wouldn't any type of per-frame-tick at all).

Components and Messages

Sending a message to an entity should not involve creating a message object, instead a message is just a simple, short-lived stream of plain-old-data bytes in some hidden memory buffer. There will be a unique message type identifier, which is a simple 32-bit integer value (or maybe an enum) at the front of the byte stream.

Messages are processed by Component objects, which can subscribe to specific message types at the central EntityRegistry by associating a message type with a handler method:

entityRegistry->Subscribe(msgType, componentType, methodPtr);

A message is sent to one or more entities through the central EntityRegistry by calling one of several "PushMsg" template methods which accept a variable number of arguments. Each combination of arg types will resolve to a template specialisation under the hood. The advantage is again, that none of this involves expensive "dynamic" code, each specific message signature will resolve to a piece of code which is very likely inlined and just consists of writing values to memory:

entityRegistry->PushMsg(entityId, msgType, arg0, ...);

This will write the args to an internal memory area (with proper alignment), and and call the handler method of subscribers, which will be provided with some sort of pointer to the start of the arguments, read/decode the arguments and perform some action with them. The disadvantage here is that there's no type-safety for the message arguments. If the caller and handlers don't agree about the order and types of arguments bad things will happen at run time, so it might still be better to use simple message classes instead of multiple typed arguments:

MyMsg msg(x, y, z);
entityRegistry->PushMsg(entityId, msg);

This would have the overhead of an extra object created on the stack (still better then on the heap), and would involve defining dozens or hundreds of message classes which would only consist of setters and getters, this should be a job for a code generator (we have something similar already called NIDL files, which are used to generate C++ message classes from a simple XML description). The advantage is type-safety and automatic agreement between sender and handler about the message arguments, plus the message class constructor can setup default argument values.

The default PushMsg() method will probably call the subscribers immediately. It might be desirable to also have deferred message handling, where the sender defines a time in the future when the message should be handled. It might also be possible to use this mechanism to send messages between remote objects across threads, processes and physical machines, but this might go a bit too far.

What about the Managers?

Managers don't really have a place in the new entity-system. Their role is taken over by the Facade singletons of the autonomous subsystems.

Conclusion

I think the original ideas behind the Nebula3 Application Layer as a flexible Entity-Component-System still make a lot of sense for a high level game framework, but today I look at the original implementation as too "heavy-weight" both in design and implementation. If I were to rewrite the system (and I'm tempted, but other stuff has higher priority) I would start as described here. What the end-result would look like is on another page, I tend to restart such systems from scratch several times if the code "doesn't look right" :)

21 Jun 2013

TL;DR: An attempt to outline the 'good parts' of C++ from my experience of porting Nebula3 to various platforms over the years. Some of it controversial.

Update: some explanation why STL and C++11 is currently "forbidden", see below!

C++

...is relatively famous for how easy it is to shoot yourself in the foot in many interesting ways. The types of bugs which are simply impossible in other languages is legion.
So then, why is C++ so damn popular in game development? One of the most important reasons (IMHO) is that C++ allows to write very high-level and very low-level code. If needed, you can have full control over the memory layout, when and how dynamic memory is allocated and freed, and how exactly memory is accessed. At the same time you can write very clean and high-level code with the right framework and don't care about memory management at all.
Especially the significance of low-level programming, e.g. controlling the exact memory layout of your data is often ignored by other, higher level languages, even though it can have a dramatic effect on performance.
One of the most common C++ newbie errors is to tackle a big software project without a proper high-level "toolbox". C++ doesn't come with a luxurious standard framework like all those fancy-pancy modern languages.
And with only hello_world.cpp under their belt newbies quickly end up with this typical mess of object ownership problems, spaghetti-inheritance, seg-faults, memory leaks and lots of redundant code all over the place after just a few ten-thousand lines of code.
On the other hand, it is incredibly easy to write really slow code in a high-level environment since you don't really know (or need to care) what's going on under all those layers of convenience.
The most important rule when diving into C++ is: Know when to write high-level and when to write low-level code, these are completely different beasts!
So what's the difference between high-level and low-level C++ code? I think there's no clear-cut separation line, but a good rule of thumb is: if it needs to run a few thousand times per frame, it better be really well optimised low-level code!

If you look at a typical rendering pipeline, there's this typical cascade where every stage in the pipeline is executed at least an order of magnitude more often then the previous one: outer-most there's stuff that happens only once per frame, next code is executed once per graphics object, then once per bone/joint, then per vertex, and finally per pixel. The realm of low-level code starts somewhere between per-object and per-bone (IMHO).

Typical high-level code to me is "game play logic". This is also were thinking object-oriented still makes the most sense (as opposed to a more data-oriented approach). You have a couple of "game objects" which need to interact with each other in fairly complex ways. On this level you don't want to think about object ownership or memory layout, and high-level concepts like events, delegates, properties etc... start to make sense. Shit starts to hit the fan when you have thousands of such game objects.

It is of course desirable to get the performance advantages of low-level code combined with the simplicity and convenience of high-level code. This is basically the holy grail of games programming. Hiding complex or complicated code under simple interfaces is a good start.

Ok, so before I drift completely into the metaphysical, here's a simple check-list:

Forbidden C++:

This stuff is completely forbidden in our coding-style:

exceptions

RTTI

STL

multiple inheritance

iostream

C++11

That's right, we're not using C++ exceptions, RTTI, multiple inheritance or the STL. C++11 is pretty cool, but still too fresh. Most of these restrictions will make your multiplatform-life a lot easier (and not much of importance is lost IMHO).

Update: I should have explained why the STL and C++11 is on this list. First the STL: Historically the STL came with a lot of problems because quality differed between compilers a lot, porting to non-PC platforms was difficult if your code depended on STL, and I am reluctant to include more complex dependencies into the engine (like boost for example). Today STL implementations are much better, so on most platforms this is probably no longer an issue.

Personally, I think the STL is an ugly library, *at least* the container classes. You'll have to admire its orthogonality and flexibility, but in reality one project ever only needs 3 or 4 specialisations. What we did was write a handful of container classes (Array, Dictionary, Queue, Stack, List) in the spirit of C#'s container classes (those are probably not as flexible as STL conteiners, but they do look nicer, and the generated code should be the same in most cases). Beautiful looking source code is important I think. This may all change with C++11 though. C++11 is extremely cool, but I think it is too early still to jump on if we need cover a lot of platforms. But C++11 together with the STL is much more powerful then those two alone, so I will very like revert my stance on STL once we switch to C++11.

But I think this switch should be done throughout the entire engine (starting at the core with the new move semantics which are really useful for containers, to the new threading support, lambdas, function objects and so on), so switching to C++11 will involve a major rewrite of Nebula3, maybe even justify a major version number switch. I think it doesn't make sense to sprinkle bits and pieces of C++11 and STL here and there into the code

Tolerated C++:

Use with care, don't go crazy:

templates

operator overloading

new/delete

virtual methods

Templates are very powerful, they can make your code both more readable, AND faster because more type information is known at compile time. But you really need to keep an eye on the generated code size. Don't nest them too deeply, and keep it simple.Operator overloading is restricted to very few places (containers and items in containers). We're NOT having operator overloading in our math library. dot(vec,vec) is much more readable then vec*vec.Not using new/delete in C++ code sounds a bit crazy, I know. But most of the time where you need to create an object on the heap you'll also want to hand its pointer somewhere else, which quickly introduces ownership problems. That's why we're using smart pointers to heap objects which hide the delete call. And since a new without its delete looks a bit silly, we're also hiding the new behind a static Create() method. It's better to avoid heap objects altogether though, especially in low-level code.Virtual methods are important of course, BUT: Just spend a second to think about whether a method really must be virtual (or more importantly: do you really need run-time polymorphism, or is compile-time polymorphism enough?). The more "static" your code is, the more optimisation options the compiler has.

Forbidden C:

Some unusual stuff here as well:

all CRT functions like fopen() or strcmp() are forbidden, except the math.h functions

directly calling malloc()/free() is forbidden

Most of the CRT functions are straight out terrible (strpbrk, strtok, ...) and/or dangerous (strcpy), so we're wrapping them all away and/or use better platform-specific functions under the hood (this can also reduce executable size, which is always good).
Overriding malloc/free with central wrapper functions is really useful once you need to do memory-debugging and -profiling, also makes it easier to try out different memory allocator libs.

Tolerated C:

Some "dangerous" stuff is only allowed in performance-critical low-level code:

raw pointers and pointer arithmetics

raw C arrays

raw memory buffers

These are all recipes for disaster in the hands of an unexperienced programmer (or an experienced programmer who needs to juggle too many things in his head). Instead of pointers, use smart-pointers to refcounted objects (see above), or indices into containers. Instead of raw arrays use containers. Never directly allocate and access memory buffers in high-level code.
All of these "dangerous techniques" are essential for really performance-critical low-level code though, but this is only at a handful places in the code, and when the really mysterious kind of crashes happen, at least you know where to look.

The End

One last point: our code is riddled with asserts which are also enabled in release mode (hardly makes a performance difference, but the uncompressed executable size is up to 20% larger because of the expression strings, thankfully those strings compress very well).
The essential, must-have assert checks are for invalid smart pointer accesses (null pointers), boundary checks in container classes and checking for valid method parameters.
With all of the above, we're rarely ever hitting a seg-fault (maybe twice a year on the server-side). If something breaks, then it is very likely an assertion check which got hit, and this is usually very easy to post-mortem-debug since it comes with a call-stack and method signature.

4 May 2013

I have removed the non-asm.js demos. Since the asm.js code generation in emscripten is now always faster then the "traditional" code generation, it doesn't make sense to have the non-asm.js code around. I'll keep support for the old code-generation in my build pipeline for now, to be able to run comparisons between the new and old code from time to time though.

The demos are now compiled with link-time-optimization enabled. Previously this had subtle and hard to debug code generation problems, but it looks like this is fixed now (fingers crossed). Performance or code size doesn't seem to be different that much however.

Demos have been recompiled with the latest emscripten incoming branch.

I added experimental support for uncompressed textures if the WebGL implementation doesn't support DXT textures (e.g. mobile platforms). This will decompress textures on the fly after download. For now this is just a workaround/hack and hasn't been tested that much. Also, since uncompressed textures are 4..8x bigger, this isn't really useful for complex games.

22 Mar 2013

I recently realized that I have spent much more time with emscripten then any other "weekend project" so far. At least the emscripten-based demos became the most advanced on any of my spare-time coding platforms in the past 2 years like iOS, Android, Google Native Client, flascc.

I think it comes down to "open, free and painless", for spare-time projects these are all extremely important points. I want to spend my free time with stuff that is fun.

Let's look why the other stuff isn't as much fun:

iOS: The tools you need for development are all free, XCode is a very slick IDE to work in, and unlike VisualStudio there's no artificial distinction between a (feature-cut) free and a (pricy) professional version. So far so good. The pain starts when you want to run your code on your actual iOS device. Welcome to provisioning profile hell. First you need to hand over $99 per year for the privilege to run your own code on you own hardware, but that's the least of it. Next you need to create "provisioning profiles" on Apple's developer portal, registering each team member, device and application and set up who may do what. In the end you essentially get per-app/per-device code-signing-certificates which expire every three months. So all the iOS demos which I did 2 years ago don't work anymore unless I go through all that hell again. Nope.

Android: Android C++ development sucks, plain and simple. It's a pain in the ass to set up (it's less painful if you use nVidias ready-made installer), remote debugging a native app is so slow it's essentially useless, and you can't use the cool new stuff since most of the world is still running an Android version from the stone age. To be fair, this was all 1.5 years ago, but I have little motivation to waste further weekends in finding out whether things have improved since then ;)

Google Native Client: The main reasons why I stopped dabbling with Native Client is that it is still not opened up (only works with Chrome Web Store bundled apps), and pNaCl seems to take forever to be finished. To be fair, Native Client has very good middleware support (like FMOD or RakNet), but it doesn't look like it will ever be implemented outside of Chrome.

flascc: I played around with flascc for a weekend or two, 2 main reasons why it didn't set my heart on fire: (1) Compiling/linking is extra-ordinarily slow AND/OR uses infinite amounts of RAM. For reasonably big code bases (like Nebula3) it's unusable because my 4GB Mac simply ran out of memory. (2) since working with flascc is so damn slow I wasn't motivated to actually go on with writing a Stage3D wrapper for N3's rendering layer.

So all in all, emscripten is the most frictionless way to write and and actually publish 3D demos for me. I can host the demos wherever I want, update them without a certification or signing process getting in the way, the demos won't expire, they are automatically multi-platform and finally, there's no vendor or platform lock-in. Most of the code I'm writing is platform-agnostic C++ and will compile and run anywhere, and the host platform's "API foot print" is minimal: a subset of POSIX and OpenGL, which will also compile almost anywhere else with minimal changes.

18 Mar 2013

Update 3: I replaced SQLite with a TableData addon, this reduces the map-viewer-demo size from 8 MB down to 5 MB (uncompressed), and reduces startup time dramatically.Update 2: Demos should now properly work on all WebGL configs again (which support DXT textures to be exact). I've been using more then 254 vertex shader uniforms, and at least ANGLE restrict this number even if the GPU could actually handle a lot more).Update: Demos don't work on Windows and some other configs since one of the new GLSL shaders doesn't compile. Tested configs are: OSX 10.7.5 with GeForce 9400M, Intel HD3000, HD4000 and Radeon HD 6770M. Fix is coming later today.

Finally a new demo update! If you're a Chrome user, please be aware that you need to run these demos in the very latest Chrome Canary (Version 27.0.1444.3 canary) since this contains a bugfix in the V8 Javascript engine (details are here: https://code.google.com/p/chromium/issues/detail?id=177883). This bug was also the reason why I held back updates for so long, I couldn't overwrite the version which reproduces this bug, but I also didn't feel like setting up yet another AppEngine project.

The DSO map viewer demo is now much closer to the actual map renderer of the Drakensang Online client:

The ground-decals system has been moved over which helps a lot in hiding the tiling structure of the level. The rendering pipeline now includes posteffects like bloom and color-balancing. You're now controlling a "player character", and I added a few more "NPCs" to the map in order to check performance with a couple of characters on screen.

All demos now come in 2 flavours: "regular" and "asm.js".

ASM.JS is a Mozilla project to define a small subset of Javascript which can be exceptionally well optimized. More about that here: http://asmjs.org/

And I identified the long pause at the start of the map viewer demo, originally I thought this would be caused by generating the collision mesh, which is built at startup from tens-of-thousands of very small mesh fragments, but surprisingly this is extremely fast. The pause is actually caused by parsing the structure of an SQLite database file and reading many small items from the database. Replacing this with a more efficient "table data" subsystem is the next thing on my weekend todo list. The SQLite stuff is really a left-over from the single-player Drakensangs where the world-state was loaded from and written back to SQLite database files.

10 Feb 2013

Weekend was kinda semi-successful as far as coding is concerned. I tried various ways to reduce GL calls further, and was able to reduce the number of GL calls by about 25%: from about 4100 down to about 3000 in the initial screen of the Drakensang Online map viewer demo. Although this sounds pretty good, I'm a bit disappointed because I was hoping that bundling vertex data chunks into big vertex buffer would have a bigger effect:

- Bundling vertex data into big vertex buffers cut the number of glVertexAttribPointer() calls by almost half from about 950 down to about 500. With the GL_vertex_array_object extension however, I could save double the GL calls for "free" (so the demo would be down to 3100 GL calls without any additional optimizations), and the savings would be more consistent (right now it depends a lot on the order of draw calls). The bundling added *a lot* of complex code, so it's probably not really worth it, since at least Chrome already supports OES_vertex_array_object in WebGL, so it would make more sense to support that.

- All the rest was gained by simply filtering redundant texture updates (glActiveTexture, glBindTexture, glUniform1i). This was a big win for very little code, but this also varies with the actual textures applied to the objects. Fewer shared textures means more updates.

I also tried to generally filter redundant shader uniform updates, but with little effect. Apart from the texture updates, an entire frame had less then 10 redundant uniform updates, so not worth it.

I'll give the GL call optimization a little rest for now and concentrate on adding features. There's still some untapped potential in grouping transform matrix updates into arrays, and by better sorting inside batches. But right now I've had enough ;)

23 Jan 2013

The Nebula3/emscripten demos (http://n3emscripten.appspot.com) had a serious performance problem on Macs with Radeon GPUs in the instancing demos. Problem was that my pseudo-instancing code used an additional vertex-buffer with 1-dimensional ubyte vertex components as fake InstanceIds. This worked fine on nVidia and Intel GPU, but triggered a horrible slow-path in the OSX Radeon driver. After replacing this with ubyte4 components everything worked fine on Radeons, but I wasn't happy that the InstanceId buffer would now be 4 times as large, with 3/4 of the the size dead weight. Then today in the train from Hamburg back to Berlin the embarrassingly obvious solution occured to me to stash the InstanceId in the unused w-component of the vertex normals. These are in packed ubyte4 format, with the last byte unused. And with this simple fix I could get rid of the second vertex buffer completely and actually throw away most of the pseudo-instancing code. Win-Win!

And now on to the actual issue: I didn't really pay attention to the code path which is used if the GL vertex array object extenion isn't available, and I was shocked when I discovered that the dsomapviewer demo performs 7000 GL calls per frame (not draw-calls, but all types of GL calls), and then I was astonished that Javascript+WebGL crunches through those 7k calls without a problem even on my puny laptop. But something had to be done about that of course.

OpenGL / WebGL without extensions is very verbose even compared to Direct3D9. To prepare the geometry for rendering, you need to bind an vertex buffer (or several), bind an index buffer, and for each vertex component call glEnableVertexAttribArray() and glVertexAttribPointer(), aaaand each unused vertex attribute must be disabled with glDisableVertexAttribArray(). Depending on the max number of vertex attributes supported in the engine, this can add up to dozens of calls just to switch geometry. And whenever a different vertex buffer is bound, at least the glVertexAttribPointer() functions must be called again and if the vertex specification has changed, vertex attribute arrays must be enabled or disabled accordingly.

With the vertex array object extension all of this can be combined into a single call.

This particular part of defining the vertex layout is by far the least elegant area of the OpenGL spec, and even the vertex array object stuff could be nicer. To me it doesn't make a lot of sense to include the buffer binding in the vertex attribute state, keeping the buffer separate from the vertex layout would make more sense IMHO. But enough with the ranting.

Other high-frequency calls are the glUniformXXX() functions to update shader variables, and the whole process of assigning textures to shaders. Un-extended WebGL doesn't provide functions to bundle these static shader updates into some sort of buffers.

These types of high-frequency calls is exactly what we don't want in Javascript and WebGL. In a native OpenGL app, these calls are usually extremely cheap, so it doesn't matter that much. But when calling a WebGL function from emscripten, there's quite a lot of overhead (at least compared to a native GL app). First, emscripten maintains some lookup tables to associate numeric GL ids with Javascript objects. Then the WebGL JS functions are called, in Chrome, these calls are serialized into a command buffer which is transferred to another process, in this GPU process the commands are unpacked, validated, and the actual GL function is called. But it doesn't end there. On Windows, the ANGLE wrapper translates the OpenGL calls to Direct3D9 calls. So what's an extremely cheap GL call in a native app, comes with some serious overhead in a WebGL app. Considering all this it is really mind-blowing that WebGL is still so fast!

All this means though, that it really makes a lot of sense to filter redundant GL calls, especially in a WebGL application, and every GL extension which helps to reduce the number of API calls is many times more valuable under WebGL!

So my mission in the train from Berlin to Hamburg and back today was to filter out those redundant GL calls.

First I wanted to know what calls are actually the problem. The OSX OpenGL Profiler tool can help with this. It records a trace of all OpenGL calls, can create a quick stat of the most-called functions, and the sequence of calls with their arguments reveals which calls suffer most from redundancy.

Which are in the dsomapviewer demo: glEnableVertexArray(), glDisableVertexArray(), glBindBuffer() and glUseProgram().

All in all the results where encouraging: per-frame GL calls dropped from 7k down to 4k. In comparison: when using the vertex array object extension the number of GL calls goes down to about 3k.

This could be improved even more by reducing the number of vertex buffers, and bundling the vertex data of many graphics objects into one or few big vertex buffers, since then much fewer buffer binds and vertex attribute specification calls would be needed (at least if they occur in the right sequence). But for this I would either need the glDrawElementsBaseVertex() function, which is not available in WebGL, or I would need to fix-up lots of indices whenever vertex data is created or destroyed (but this would limit the size of one compound vertex buffer to 64k vertices, and limit the efficiency of the bundling, hmm...).

Anyway, to wrap this up, Chrome already exposes the OES_vertex_array_object extension, and an ANGLE_instanced_arrays extension seems to be on the way. Both should help a lot to reduce GL calls already. Then the only remaining problem is texture assignment and uniform updates in scenes with many different materials.

But I think before working on reducing GL calls even more I'll try to do something about then stuttering when new graphics assets are streamed in.