23 Jan 2013

The Nebula3/emscripten demos (http://n3emscripten.appspot.com) had a serious performance problem on Macs with Radeon GPUs in the instancing demos. Problem was that my pseudo-instancing code used an additional vertex-buffer with 1-dimensional ubyte vertex components as fake InstanceIds. This worked fine on nVidia and Intel GPU, but triggered a horrible slow-path in the OSX Radeon driver. After replacing this with ubyte4 components everything worked fine on Radeons, but I wasn't happy that the InstanceId buffer would now be 4 times as large, with 3/4 of the the size dead weight. Then today in the train from Hamburg back to Berlin the embarrassingly obvious solution occured to me to stash the InstanceId in the unused w-component of the vertex normals. These are in packed ubyte4 format, with the last byte unused. And with this simple fix I could get rid of the second vertex buffer completely and actually throw away most of the pseudo-instancing code. Win-Win!

And now on to the actual issue: I didn't really pay attention to the code path which is used if the GL vertex array object extenion isn't available, and I was shocked when I discovered that the dsomapviewer demo performs 7000 GL calls per frame (not draw-calls, but all types of GL calls), and then I was astonished that Javascript+WebGL crunches through those 7k calls without a problem even on my puny laptop. But something had to be done about that of course.

OpenGL / WebGL without extensions is very verbose even compared to Direct3D9. To prepare the geometry for rendering, you need to bind an vertex buffer (or several), bind an index buffer, and for each vertex component call glEnableVertexAttribArray() and glVertexAttribPointer(), aaaand each unused vertex attribute must be disabled with glDisableVertexAttribArray(). Depending on the max number of vertex attributes supported in the engine, this can add up to dozens of calls just to switch geometry. And whenever a different vertex buffer is bound, at least the glVertexAttribPointer() functions must be called again and if the vertex specification has changed, vertex attribute arrays must be enabled or disabled accordingly.

With the vertex array object extension all of this can be combined into a single call.

This particular part of defining the vertex layout is by far the least elegant area of the OpenGL spec, and even the vertex array object stuff could be nicer. To me it doesn't make a lot of sense to include the buffer binding in the vertex attribute state, keeping the buffer separate from the vertex layout would make more sense IMHO. But enough with the ranting.

Other high-frequency calls are the glUniformXXX() functions to update shader variables, and the whole process of assigning textures to shaders. Un-extended WebGL doesn't provide functions to bundle these static shader updates into some sort of buffers.

These types of high-frequency calls is exactly what we don't want in Javascript and WebGL. In a native OpenGL app, these calls are usually extremely cheap, so it doesn't matter that much. But when calling a WebGL function from emscripten, there's quite a lot of overhead (at least compared to a native GL app). First, emscripten maintains some lookup tables to associate numeric GL ids with Javascript objects. Then the WebGL JS functions are called, in Chrome, these calls are serialized into a command buffer which is transferred to another process, in this GPU process the commands are unpacked, validated, and the actual GL function is called. But it doesn't end there. On Windows, the ANGLE wrapper translates the OpenGL calls to Direct3D9 calls. So what's an extremely cheap GL call in a native app, comes with some serious overhead in a WebGL app. Considering all this it is really mind-blowing that WebGL is still so fast!

All this means though, that it really makes a lot of sense to filter redundant GL calls, especially in a WebGL application, and every GL extension which helps to reduce the number of API calls is many times more valuable under WebGL!

So my mission in the train from Berlin to Hamburg and back today was to filter out those redundant GL calls.

First I wanted to know what calls are actually the problem. The OSX OpenGL Profiler tool can help with this. It records a trace of all OpenGL calls, can create a quick stat of the most-called functions, and the sequence of calls with their arguments reveals which calls suffer most from redundancy.

Which are in the dsomapviewer demo: glEnableVertexArray(), glDisableVertexArray(), glBindBuffer() and glUseProgram().

All in all the results where encouraging: per-frame GL calls dropped from 7k down to 4k. In comparison: when using the vertex array object extension the number of GL calls goes down to about 3k.

This could be improved even more by reducing the number of vertex buffers, and bundling the vertex data of many graphics objects into one or few big vertex buffers, since then much fewer buffer binds and vertex attribute specification calls would be needed (at least if they occur in the right sequence). But for this I would either need the glDrawElementsBaseVertex() function, which is not available in WebGL, or I would need to fix-up lots of indices whenever vertex data is created or destroyed (but this would limit the size of one compound vertex buffer to 64k vertices, and limit the efficiency of the bundling, hmm...).

Anyway, to wrap this up, Chrome already exposes the OES_vertex_array_object extension, and an ANGLE_instanced_arrays extension seems to be on the way. Both should help a lot to reduce GL calls already. Then the only remaining problem is texture assignment and uniform updates in scenes with many different materials.

But I think before working on reducing GL calls even more I'll try to do something about then stuttering when new graphics assets are streamed in.

19 Jan 2013

Update 2: The OSX/Radeon performance problem should be fixed now. See here: http://flohofwoe.blogspot.de/2013/01/a-radeon-fix-and-more.htmlUpdate: Just found out that the demo runs incredibly slow on a 15"Mac when running on the discrete AMD Radeon HD 6770M chip (it's actually much faster on the integrated Intel HD 3000). This is both on Chrome and Firefox, reason unknown yet. So if you have one of these, note that the demo runs actually a lot smoother ;)

I did a very simple proof-of-concept Drakensang Online map viewer in Nebula3/emscripten (as always, Chrome or Firefox required), to see how JS+WebGL can deal with a close-to-real-world 3D scenario:

This is work in progress and I will spend more time with optimizations before moving on to the next demo.

You'll notice that there's still frame-rate-stuttering when moving around the map (with left-mouse-button + dragging). The bad type of stuttering is caused by asset loading which happens on demand when new graphics objects are pulled in as they enter the view volume. I don't know yet what causes the lighter stuttering when moving around in areas which are completed loaded. I need to do a detailed profiling session to figure out what's going on there exactly. The stuttering also happens (to a lesser extend) in the native OSX version of the demo. It's most likely the preparation and creation of OpenGL resources, like vertex buffer, index buffers and textures. I will need to figure out how to move more of the asset creation stuff out of the main thread.

The demo is also quite demanding on WebGL. Despite the pseudo-instancing which I implemented recently there's still a lot of OpenGL calls per frame. Support for the OES_vertex_array_object (Chrome already exposes this) and something like ARB_instanced_arrays would help a lot to reduce the number of GL calls drastically (the JS profiler currently shows the vertex array definition as the most expensive rendering-related code, followed by the matrix array uniform updates for the pseudo instancing code).

Finally I've added a new Nebula3 code module to this demo: the ODE-based physics and collision subsystem is now also running in emscripten (no changes were necessary), the demo sets up a static collide world at startup and uses this to perform stabbing checks under the mouse pointer. Unfortunately adding ODE almost doubled the size the of the generated Javascript code. This is another incentive to finally get rid of our (somewhat bloated) physics wrapper code and ODE, and build a new slim collision system, probably on top of the Bullet collision classes (we're mainly using the current physics wrapper for simple collision checks on a static collide world in the live version of Drakensang Online, so not much of value will be lost).

Also, originally I wanted to include SQLite into the demo, since additional map info is currently stored in an additional SQLite file (lighting information, player start position, etc...). But this didn't work out of the box because SQLite's file i/o code must be adopted.

This wouldn't be hard to fix, but I actually want to get rid of SQLite for a long time. SQLite was really useful as save-game system in the single player Drakensang games, but if you don't need to save game world changes back, a complete SQL implementation in the client is just overkill. So this is another good reason to finally get started with a nice and small TableData-subsystem in Nebula3.

The frame-stuttering is a tiny bit disheartening, but on the other hand this is to be expected when bringing a complex code base over to a new platform. Most important right now is to really know what's going on, so I will probably spend some time adding profiling code and do some performance analysis next - together with text rendering to get some continuous debug statistics output on screen.

13 Jan 2013

Multithreading in emscripten is different from what us C/C++ coders are used to. There is no concept of threads with shared memory state in Javascript, so emscripten can't simply offer a pthreads wrapper like NaCl does. Instead it uses HTML5 WebWorkers and a highlevel message-passing API to spread work across several CPU cores.

You basically pass a memory buffer over to the worker thread as input data, the worker thread does its processing and passes a memory buffer with the result data back to the main thread.

The downsides are (1) you can't simply port your existing multi-threaded code over to emscripten, (2) it is (currently) somewhat expensive to pass data around since it involves copying, and (3) you cannot express all multithreading patterns in emscripten. The upside is though, that it's really hard to shoot yourself in the foot, since there's no shared state, and all the multithreading primitives you love to hate (like mutexes, semaphores, cond-vars, atomic-ops) simply don't exist.

Let's have a quick look at emscripten's worker API, only 4 API-functions and 2 user-provided functions are necessary:

worker_handle emscripten_create_worker(const char* url);

This create a new worker object, it takes the URL of a separate emscripten-generated Javascript file.

The worker file must export at least one C-function (name doesn't matter, but the function name must be explicitely exported using emscripten's new "-s EXPORTED_FUNCTIONS" switch so that it isn't removed by dead-code elimination. The worker function prototype looks like this:

This takes the worker handle returned by emscripten_create_worker(), the name of the worker function (in our case "dowork"), a pointer to and size of the input data, a completion callback function pointer, and finally a custom argument which is passed through to the completion callback to associate the completion call with the invocation call.

At some point after emscripten_call_worker() is called, the dowork-function will be called in the worker thread with a data pointer and size. Since the worker has its own address space, the actual pointer value will be different from the pointer value in the emscripten_call_worker call of course.

The worker function now uses this input data to compute a result, and (optionally) hands this result back to the main thread using this function:

void emscripten_worker_respond(char* data, int size);
The return-data will be copied inside the function, so if the worker function had allocated a result buffer it remains the owner of that buffer and is responsible to release it.

Finally, once the worker has finished, the completion callback will be called on the main thread with the result data, and the custom arg given in the emscripten_call_worker() call:

void completion_callback(char* data, int size, void* arg);

The callee does not gain ownership of the data buffer, thus it must read / copy the received data but not write to, or free the buffer.

Finally there's a function to destroy a worker:

void emscripten_destroy_worker(worker_handle worker);

As with threads, creating and destroying workers is not cheap, so you should create a couple of workers at the start of the application and keep them around, instead of creating and destroying workers repeatedly. It's also wise to batch as much work as possible per worker invocation to offset the call-overhead as much as possible (don't call a worker many times per frames, ideally only once), but this is all pretty much common sense.

The worker Javascript file must be created as a separate compilation unit, it's a bit like on the PS3 where the SPU code also must be compiled into small, complete "SPU executables". To keep the code size small I decided to keep the runtime environment in the worker scripts as slim as possible, there's no complete Nebula3 environment, only a minimal C runtime environment. But this is not a limitation of emscripten, only a decision on my part. Most of the time the workers will contain simple math code which loops over arrays of data instead of high-level object-oriented code. To avoid downloading redundant code it might also make sense to put several worker functions into a single JS file.

The updated Nebula3/emscripten demos at http://n3emscripten.appspot.com now decompress the downloaded asset files in up to 4 WebWorker threads in parallel to the main thread, this speeds up asset loading tremendously and avoids the excessive frame hickups which happened before. This is important, since real-world Nebula3 apps stream asset data on demand while the render loop is running. The whole stuff took me about half a day, but unfortunately I stumbled across a Chrome bug which required a small workaround (see here: http://code.google.com/p/chromium/issues/detail?id=169705).

It's not completely perfect yet. There's data copying happening on the main thread, and there's also some expensive stuff going on when creating the WebGL resources (for instance vertex and index data is unrolled for the instanced rendering hack). The ultimate goal is to move as much resource creation work off the main thread in order to guarantee smooth rendering while resources are created.

There are also browser improvements in sight which will make WebWorkers more efficient in the future, mainly to avoid extra data copies by transferring ownership of the passed data over to the web worker, basically a move instead of a copy.

4 Jan 2013

I've been playing around a bit more with the Nebula3/emscripten port over the holidays. Emscripten had some nice improvements during the past 2 months, mainly to generate smaller and faster code, and to drastically reduce code generation time in the linker stage (read this up on azakai's blog).

The work I did on my experimental Nebula3 branch were only partially emscripten-related: The biggest chunk of work went into refactoring to adapt the higher level parts of the rendering pipeline for the new CoreGraphics2 subsystem (lighting, view volume culling, and the highlevel graphics subsystem which is concerned about Stages, Views and GraphicsEntities). A lot of code was thrown away or moved around, but from the outside everything looks quite similar as before. External code which depends on the Graphics subsystem must be fixed-up, but not rewritten.

Another big chunk of work went into implementing instanced rendering for the new CoreGraphics2 system. OpenGL offers several extensions for instanced rendering, but since none of the current WebGL implementations support any of these extensions I first wrote a fallback solution which works without extensions, but uses bigger "unrolled" vertex- and index-data, and a instance-matrix palette in the vertex shader. With the current implementation, up to 64 instances can be collapsed into a single drawcall. This depends on the number of available vertex shader uniforms, and since the ANGLE wrapper used by Chrome and Firefox on Windows generally restricts the number of vertex shader uniforms to 254 I had go with only 64 instances per drawcall. This restricts the usage scenarios of this approach, but when rendering a Drakensang Online map (for instance), this comes pretty close to the average number of instances of environment objects in the view volume. For particle rendering this approach would be useless though.

I also rewrote the emscripten filesystem wrapper. The original implementation was only a quick hack to get data loaded into the engine at all. I wrapped this now into a proper subsystem which uses new emscripten API calls to directly download data into a memory buffer without mirroring the data into a "virtual filesystem", and the new implementation also accepts the file compression of Drakensang Online's HTTP filesystem (it's not the complete HTTP filesystem implementation yet though, the table-of-content-files are ignored, as well as the per-file MD5 hashes, and there's no local file cache apart from the normal browser cache). Also, while the emscripten filesystem wrapper is asynchronous, it is not yet multithreaded through the new WebWorker API. Decompression currently happens on the main thread and may lead to frame stuttering, but the plan is to move this into separate worker threads.

First, here's the old Dragons demo, recompiled with the latest emscripten version. Thanks to the improvements in emscripten, and the house-cleaning to remove old code, the (compressed) download size of the Javascript-code is now only 308kByte:

Next is a demo for the new instanced rendering. On startup, 1000 independently animated cubes are rendered, and by pressing cursor-up you can add 1000 more. There's also 128 point lights in the scene. Every 1000 cubes require about 32 draw-calls (that's (1000/64)*2, the instancing collapses 64 cubes into one draw call, and then *2 because of the NormalDepth- and Material-Passes of the Light Pre-Pass Renderer. For every cube, a world-space transform matrix is computed per frame on the CPU (a conversion from polar-coordinates to cartesian coordinates, involving two sin() and two cos(), and a matrix-lookat involving several normalizations and cross-products.

And finally I wrote a little Drakensang Online monster viewer. With cursor-up/down you can switch to the next/previous monster, with cursor-right you can flip between different skin-lists (appearances), and with cursor-left you can toggle a few animations (usually idle and running anims). Obviously the material shader is different from Drakensang Online (the color texture is replaced with just white, the specular effect is exaggerated (which actually is a nice show-case for the really good normal-maps of our character models). This is only a snapshot of what's currently in the game, especially most of the animations are not included. The strange cubes which are displayed sometimes are the mesh-placeholder objects, I think I'll remove them and just use no placeholder as long as the mesh is not loaded, at least it shows that the placeholder system is working right ;)