16 Dec 2012

I've basically written a vertical slice of the new Nebula3 Render Layer during the past few weekends where I'm trying out a few ideas of what the Nebula3 rendering system will look like in the future.

The lowest-level subsystem is CoreGraphics2, which I wrote about already a little bit.

It wraps the host platform's 3D API (e.g. OpenGL or Direct3D), the rendering vocabulary is higher level / less verbose then OpenGL/D3D. It runs the render thread, but can be compiled without threading (on the emscripten platform for instance). There's a facade singleton object (CoreGraphics2Facade) which wraps the entire functionality into a surprisingly simple interface.

CoreGraphics2 works with only 5 resource types:

Texture: Just what the name implies, a texture resource object. This also includes render targets.

DrawState: This wraps all the required shader and render-state data for a drawing operation: a reference to a shader object, shader constants (one-time-init, immutable), shader variables (mutable) and an (immutable) state-block for render-states.

Pass: A pass object holds all required data for a rendering pass, this includes a render-target-texture object, and a DrawState object which defines state which is valid for the rendering pass. All rendering must happen inside passes. Typical passes in a pre-light-pass renderer are for instance the NormalDepth-Pass, the Light-Pass, the Material-Pass, and a Compose-Pass. The pass object also contains the information whether and how the render target should be cleared at the start of the pass.

Batch: A batch object just contains a DrawState object which defines render state for several draw operations, so this is just a way to reduce redundant state switches.

Resource objects are opaque to the outside. To the caller, these are just ResourceId objects, there's no way to directly access the data in the resource objects (since they actually live in the render thread).

Resource creation happens by passing a Setup object to one of the Create methods in the CoreGraphics2Facade singleton. There's one Setup class for each resource type (so basically TextureSetup, MeshSetup, DrawStateSetup, PassSetup and BatchSetup). The Setup object basically describes how the resource should be created and shared (for instance when creating a texture resource, the Setup object would contain the path to the texture file, whether the texture should be loaded asynchronously, whether the texture object should be a render target, and so on). The render thread will keep the Setup objects around, so it has all information available to re-create the resource (for instance because of D3D's lost device state, or for more advanced resource management where currently unused resources can be removed from memory, and re-loaded later).

All rendering happens by calling methods of CoreGraphics2Facade:

Begin / End methods:

These methods structure a frame into segments.

BeginFrame / EndFrame: Signal the start and end of a render frame.

BeginPass / EndPass: Signal start and end of a rendering pass. BeginPass takes the ResourceId of a Pass object, makes the render target of the pass active, optionally clears the render target, and applies the render state of the DrawState object of the pass.

BeginBatch / EndBatch: Signal start and end of a rendering batch. This simply applies the render state of the DrawState object of the batch.

BeginInstances / EndInstances: This is where it gets interesting. BeginInstances sets all the required state for a series of Draw commands. It takes a Mesh ResourceId, a DrawState ResourceId, and a "shader variation bitmask". The bitmask basically selects a "technique" from the shader (in D3DXEffects terms). For instance, to select the right shader technique for rendering the NormalDepth-pass of a skinned object, one would pass "NormalDepth|Skinning" as the bitmask.

ApplyVariable: applies a shader variable value to the currently active DrawState object (which has been set during BeginInstances). This is a template method, specialized for each shader variable data type (float, int, float4, matrix44, bool).

ApplyVariableArray: same as ApplyVariable, but for an array of values.

Draw methods:

This method group performs actual drawing operations:

Draw: Performs a single draw call, must be called inside BeginInstances/EndInstances. Renders a PrimitiveGroup (aka material group) from the currently active Mesh, using the render state defined in the currently active DrawState. For non-instanced rendering one would usually perform several ApplyModelTransform() / Draw() pairs in a row.

DrawInstanced: Like Draw, but takes an array of per-instance transforms to render the same mesh at many different positions. Tries to use some sort of hardware instancing, but falls back to a "tight render loop" if no hardware instancing is available.

DrawFullscreenQuad: simply render a fullscreen quad with the currently set DrawState, this is used for fullscreen-post-effects

And that's it basically. I'm quite happy with how simple everything looks from the outside, and how straight-forward the innards work. For instance, leaving the shader system aside (which is implemented in a separate subsystem CoreShader), the OpenGL specific code in CoreGraphics2 is just 7 classes, and the biggest file is around 600 lines of code.

And it's simple to use, for instance here's the render loop to render the point lights in the new LightPrePassRenderer (hopefully the Blogger editor won't screw up my formatting):

The only thing that's still missing from CoreGraphics2 is dynamic resources and a plugin system to extend functionality of the render-thread side with custom code (for instance for non-essential stuff like runtime resource baking).

As much as I'd love to have a rendering system where dynamic resources aren't needed at all, there's no way around them yet. We still need them for particle systems and UI rendering.

On the front-end of the render layer, there's the new Graphics2 subsystem. The changes are not as radical as in CoreGraphics2 (with good reason because changes in this subsystem would affect a lot of high level gameplay code). There are still the basic object types Stage, View, Camera, Light and Model. There's now a new GraphicsFacade object, which simplifies setup and manipulation of the graphics world drastically. And I tried out a new component-system for GraphicsEntities (Models, Lights and Cameras). Instead of a inheritance hierarchy for the various GraphicsEntity types, there's now only one GraphicsEntity class which owns a set of Component objects. The combination of those components is what turns a GraphicsEntity into a visible 3D model, a light source, or a camera. The main driver behind this was that 90% of all data in a ModelEntity was character related, but less then 10% of graphics objects in a typical graphics world are actually characters.

I've split the existing functionality into the following entity components:

TransformComponent: defines the entity's position and bounding box volume in world space.

This component model hasn't really been written to allow strange combinations (you might be tempted to attach a CameraComponent to a Character-entity for a first person shooter). Theoretically something like this might be even possible, but I don't think it is a good idea. The driving force behind the component model was cleaner code and better memory usage.

23 Oct 2012

Ok, let me just say that I went from "Saulus to Paulus" (as we say in Germany) in the past few days. In my ongoing stealth mission to evaluate all the C++-to-Web technologies currently available (Google Native Client, Adobe's Flash C/C++ compiler, and Mozilla's emscripten) I actually wanted to pick Adobe's solution next, since I didn't really believe that emscripten's approach of compiling C++ to Javascript could possibly work. I had a fixed idea in my mind, how fast a C++-to-Bytecode VM solution would be (that's what Adobe is doing), and how fast Javascript could possibly be, and Javascript would lose by a long shot. No way I thought it possible to run really math heavy code in a language with such a shitty type system (excuse my french).

There are numbers flying around like 25% to 50% of native performance (even up to 80% for Adobe's solution) which I thought to be extremely optimistic even for hand-picked benchmarks. For instance if you look at Epic's famous Unreal Flash demo, there's not a lot of dynamic stuff happening in the 3D world you're moving through. Sure it looks impressive, but it mainly demonstrates a good art style and how fast your GPU is, but doesn't say much about how efficient the "CPU code" is running in the Adobe VM.

Then I started to look closer at emscripten, spent a few days with porting, and imagine my surprise when I first started the optimized version of this:

...and I added dragons and more dragons, and even more dragons, until the frame rate finally started to drop. Of course it's not Native Client performance, but it is much (much!) better then I expected.

Let me explain what you're seeing:

The demo is built from a set of Nebula3 modules consisting of about 120k lines of C++ code cross-compiled to Javascript+WebGL through the emscripten compiler infrastructure. There is a lot (really a LOT) of vector floating point math C++ code running in the animation engine because I must admit that I actually wanted to "break" the JS engine and show how incredibly much faster NaCl would be. Well, that didn't quite work out ;)

Of those 120k lines of code, only a few hundred lines are actually specific to the emscripten platform. So there's less then 0.5% of platform specific code, and about 99.5% of the code is exactly the same as in the NaCl demo, or an actual "native" (OpenGL-based) desktop version of the demo. If you take all of Nebula3 (about half a million lines of C++ code), then the ratio of platform specific code is even more impressive.

Let this sink in for a while:

You can take a really big C++ code base with dozens of man years of engineering effort and a mature asset pipeline attached to it, spend about 2 weekends(!) of tinkering, and can run your code at a really good performance level in a browser without plugins! You still have to be realistic about CPU performance of course. It helps if a game is designed to make relatively little use of the CPU, and move as much work as possible onto the GPU, but these are the normal realities of game development. Of all the target platforms for a project you should choose the weakest as the "lead platform", make the game run well there, and use the extra power of the other platforms for non-essential, but pretty "bells'n'whistles".

And you don't have to burn bridges behind you: You can still use the exact same code base and asset pipeline and create traditional native applications for mobile platforms, desktop apps, or game consoles. And all of this in a programming language and a graphics API which many considered almost on its death bed a few years ago.

I'd say that C++ and OpenGL are on a goddamn come-back tour right now :)

Not all is golden though, each of the C++-to-web solutions has at least one serious weakness:

Native Client: extremely fast and feature rich, but only supported by Google

emscripten (more specifically WebGL): not supported by Microsoft

Adobe flascc: Adobe wants a "speed tax" if you're using certain "premium features" required for high-performance 3D games and earn money with it

Sucks badly, since most of this is purely politics-driven, not for technological reasons.

So... originally I wanted this blog post to be a technical post-mortem of the emscripten port, but on one hand it's already quite long, and on the other hand: there's really not much to write about, since it went so smooth.

I installed the SDK and a few required tools to my MacBook, wrote (yet another) simple CMake toolchain file, and was able to compile most of Nebula3 out of the box within an hour. The most exciting event was that I found a minor bug in the OpenGL wrapper code, which was fixed within the day by the emscripten team (kudos and many thanks to kripken (aka azakai) for being so incredibly fast and helpful).

The only area where I had to spend a bit more time was on threading (or rather the lack thereof). Since emscripten (like NaCl) is running in the browser context, it suffers from many of the same limitations - like not being able to do synchronous IO, and you cannot "own the game loop" but need to run in little per-frame time-slices inside callbacks from the browser. NaCl offers a working pthreads API to workaround these limitations, but emscripten cannot support true threading since the underlying Javascript runtime doesn't allow sharing state between threads. This looked like a real show stopper to me in the beginning, but after a few nights of sleeping over this problem I found a really simple solution by moving to a higher level into Nebula3's asynchronous messaging system. Up there it was relatively easy to replace the multithreaded message handler code with a frame-callback system with minimal code changes.

It sucks a bit right now that everything runs on the main thread (so the current N3 demo cannot take advantage of multiple CPU cores), but a solution for a higher level multithreaded API based on HTML5 webworkers is in the works right now (I really can't believe how fast these guys are!).

I'm tempted to write a lot more of how incredibly clever the C++-to-JS cross-compilation is, and how the generated JS code can be even faster then handwritten code, and how surprisingly compact the generated code is, but if you're getting all exited about this type of stuff it's better if you're reading it first hand:

So where next? I'll polish the emscripten port a bit more and implement support for the new web worker API to load and uncompress asset files in the background, and then take a little detour and port the Dragons demo to Adobe's flascc (perfect timing since they just went into open beta). After that I need to do some cleanup work on all three ports, and on the higher level parts of the render pipeline since the low level rendering code has been replaced with the new "Twiggy" stuff.

25 Sep 2012

Ok, it's not really a post-mortem since I consider the work on the Native Client port far from finished. For my current "weekend stuff" NaCl actually provides a pretty cool distribution channel for simple tech demos, so this gives me enough motivation to keep the port fresh and uptodate.

The Chrome Web Store lets me publish those small tech demos in an extremely simple way: just zip a directory and upload it to the web store dashboard, without any review or certification process getting in the way, the demos automatically work on Windows, OSX and Linux, and it's extremely easy for the user to install and run the demo (just a few seconds and 2 button clicks from start to launch).

Ok, let's go:

Build System and General Workflow

Native Client offers a standard GCC-cross-compiling toolchain and is using regular Makefiles (they switched away from scons some time ago). Nebula3 is using CMake as meta-build-system. CMake generates platform- and IDE-specific low-level-build files from a set of generic project description files.

For Native Client I wrote a very simple CMake toolchain file which points the build process to the right GCC tools and directories. This is the same general procedure I already used before for the Android and Alchemy platforms (both were just relatively short-lived experiments, so don't hold your breath). CMake generates Makefiles and an Eclipse wrapper project from the NaCl specific toolchain file and the generic project description files.

I'm currently doing most of my spare-time programming work on OSX, and the most practical option there was to generate Eclipse projects. I didn't manage to automatically generate an XCode project with cross-compiling support through CMake, only by manually creating an XCode project with an "external build system target" which points to the CMake-generated Makefiles.

A special feature of working with NaCl is the 32-bit vs. 64-bit support. NaCl projects are *required* to offer both a 32-bit and a 64-bit executable. I simply solved this by generating 2 separate sets of build files from two different CMake toolchain files.

Working with the NaCl projects still isn't as fluid as working on a native OSX project in XCode or a Windows project on VisualStudio. There's no IDE debugger, compile times are much longer (especially compared to XCode), and the edit-compile-test-loop is more complicated because the NaCl executable can only be run inside Chrome.

That's why I'm still doing most of the general coding and debugging on a regular OSX application project inside XCode, and use the NaCl build system as a pure cross-compiling solution with command-line make. Only if there are bugs in the NaCl-specific code, side-effects which only happen on the NaCl platform or new NaCl functionality must be implemented I spend reasonable time inside Eclipse.

NaCl will support a CPU-agnostic binary format called PNaCl in the (hopefully not to distant) future. The build process will then generate some sort of LLVM bitcode, and finish the compilation process on the client-side for a specific CPU instruction set. This will probably speed up the build process and get rid of the several build-passes (and hopefully doesn't increase the startup time in the browser).

Also there's a VisualStudio addin which seems to be nearing completion. I must admit that I didn't look at this yet. If this solution provides a usable UI debugger I'll be in heaven :)

GLIBC vs newlib

NaCl offers 2 options for the C runtime: the "fat" option is GLIBC, the slim option is newlib. I first went with GLIBC because it sounded like the more complete option and makes porting POSIX code easier. The problem with the GLIBC runtime is that it uses dynamic linking and needs to download a myriad of huge DLL files to the client before the actual client binary can be launched. After I saw the resulting download size just required to start the executable (I think it was near 20 MB?) I was scared and went with newlib.

Now, DLLs *may* be an option in the future if the DLLs could be shared between different NaCl projects, and NaCl becomes really popular (or most of the DLLs would already be included in the standard Chrome installation), but as far as I understand, none of this is true yet.

With newlib you get an all-in-one executable, everything is nicely statically linked, and dead code stripped away. For a simple Nebula3 viewer I ended up with 6.8 MB for the 32-bit nexe, and 7.2 MB for the 64-bit nexe, still quite heavy, but ok. The compressed size for the Chrome Web Store packaged-app ended up around 4 MB.

Porting Fundamentals

You should start with a "POSIX + OpenGL + 64bit-clean" codebase and implement the NaCl specific stuff from there. If you're coming from "Win32 + DirectX" this is quite a bummer of course. Thankfully I had all basic requirements already in place. For the Drakensang Online server we had ported the Nebula3 Foundation Layer to 64-bit-Linux, and for my various other porting experiments (iOS, OSX, Android, etc...) I had the (old) Nebula3 render pipeline already ported to OpenGL.

NaCl Restrictions

Ok, this is where it gets interesting. NaCl has a very "interesting" application model. If you're familiar with co-operative multitasking (like back in the days pf Windows3.1 or the Wii) you'll feel right at home :D

The problem seems to be that Chrome can't give you complete control of your game loop for security reasons. Each Chrome tab is running in its own process, but a NaCl application is sharing this process' main thread (called the "Pepper Thread") with other tasks which need to run on a web page (like the Javascript engine, the page layout engine, the HTTP communication code, etc...). The details are not so important and may change between Chrome versions, the important info is that a NaCl application can completely block a Chrome tab or lead to unresponsive behaviour if one "NaCl frame" takes to long.

So that's the first problem: you're only allowed to spend very little time on this "Pepper Thread" and then need to give control back to Chrome.

And then there's the famous "Call-On-Main-Thread" restriction: most of the "system level stuff" needs to be called from the main-thread. If you want to download a file: main-thread. If you want to call an OpenGL function: main-thread. Getting input events: main-thread. And so on. Fortunately this restriction doesn't apply to the C runtime functions like malloc().

So problem #2 is: "Call-On-Main-Thread-Restriction".

Problem #3: Tasks can be scheduled to execute on the Pepper thread, but they are processed through a simple worker queue... which means: you can start an asynchronous task to execute on the Pepper Thread from *another* thread and then block the thread until the execution of the task has finished to simulate a blocking operation. But you can't do this on the Pepper Thread! Since the asynchronous task will only execute after you gave control back to the Pepper Thread you just shutdown the whole execution pipeline by waiting for an event which will never trigger.

All of these 3 problems combined result in a very nasty restriction for "real world code": you can't simply do an fopen() followed by an fread(), or a bit more general: you can't do any blocking operation from the Pepper Thread.

Unfortunately, the first thing the Nebula3 render thread does, is to load shader files in a synchronous manner, and it can't really continue without the shaders being setup. Ok, no problem, that's why it's a render *thread*, we can simulate blocking i/o if we're not on the Pepper thread, right? But wait a minute, to issue OpenGL calls we need to run on the Pepper thread...

And that's basically the core problem... I'll get to the solution later under "Rendering".

Application Structure

A Nebula3 desktop application on OSX has 3 "important" threads:

The "process main thread": (let's call it the event thread), this creates the app window (an OSX-specific restriction), spawns the "game loop thread" and then goes into an event loop, waiting for and occasionally waking up on operating system events

The "game thread": this is where the game loop is running, the important part here is that it runs decoupled from the event thread. While the event thread is blocking, waiting for events, the game thread is running uninterrupted in a loop

The "render thread": with the new experimental "Twiggy" render pipeline this is a very simple thread behind a push-buffer which receives and executes high-level rendering commands from the main thread, the render thread will starve (currently) if the main thread doesn't issue rendering commands.

A NaCl application looks very similar, but doesn't have a dedicated render thread, instead all the render-thread code is actually running on the Pepper Thread:

The "Pepper Thread": This is the starting thread, there's isn't a traditional event loop, instead there's an one-time Init method and a couple of callback methods which NaCl calls for various events (input, focus change, etc...). Also, this is where NaCl callback tasks from the other Nebula3 code are executed.

The "game thread": this works exactly like in a normal desktop application, except that it *doesn't* spawn a render thread, instead it will enqueue a callback to the Pepper thread once a whole frame of rendering commands have been written to the push-buffer.

The "render thread" is actually running as a callback on the Pepper Thread. But the actual code is identical to other platforms: highlevel rendering commands are pulled from the push buffer and translated to low-level OpenGL calls.

The key idea here is that the game's main thread isn't identical with the process main thread. This is actually a pretty cool pattern for any event-driven platform (on Windows you get around this problem because the game loop may control the Windows event processing through PeekMessage()/DispatchMessage()). On other platforms like OSX, iOS or Android this isn't so simple though, and the above pattern is a very elegant way to workaround the "event-loop vs. game-loop" problem (IMHO).

Loading Data

NaCl offers 2 ways to get data into the engine. One is a wrapper class for loading data from an HTTP server, the other is read/write access to a local filesystem sandbox (basically a wrapper around HTML5's local storage system). A real-world application should use both: HTTP requests to get the data from a web server, and the local storage system for caching to become independent from the browser cache.

Nebula3 has a pretty nifty virtual file system layer which associates URL schemes (like http:, file:, ...) with pluggable file systems, and we wrote a very sophisticated "HTTP filesystem" for Drakensang Online. Apart from that the Nebula3 IO system is multithreaded. There's a single dispatcher thread which consumes IO requests and schedules them to a number of IO threads which process the request. There can be as many IO requests "in flight" as there are IO threads, which is important for data streaming over slow and unreliable HTTP connections.

The end result is that we can use the exact same code to load data from the local disc or from web servers (in fact there's only one line of code required to switch the "root location").

For the current NaCl port, I only implemented an extremely simple HTTP virtual filesystem using the NaCl URLLoader class without local caching which allows me to pull files from a web server, and which can also simulate blocking IO as long as we're not on the Pepper thread.

Handling Input

This is a no-brainer. The Pepper Thread gets called when input events are available for mouse and keyboard. These get converted to "Nebula3 events" and pushed on a thread-safe queue. There's a design-fault in my current implementation that the render-thread pulls and handles these events, don't do this, because in NaCl this will execute on the Pepper thread (this is still legacy code from the old render pipeline). What *should* happen is, that the "game loop thread" pulls and processes these events.

Rendering

This is an interesting area again because OpenGL calls must happen from the Pepper thread, and this is the area where I spent the most time, and eventually this restriction was an important contributing factor to the new "Twiggy" render pipeline.

Traditionally, Nebula3 uses a "fat render thread". The render thread runs a completely autonomous graphics world, and only gets updates like position changes from the main thread. It works, but has lead to several problems over time which I won't get into here...

When I first dabbled around with NaCl I kept this design and wrote an "OpenGL stream codec". Whenever the original code would issue an OpenGL call, on NaCl this call would be encoded into a binary stream and when the render-thread frame has finished, a Pepper Thread callback would be issued which decodes the command and executes the OpenGL call.

This was basically the right idea but had 2 very serious downsides:

excessive granularity: a real-world game might have thousands of OpenGL calls per frame, and the overhead this would produce is just silly (especially since NaCl has to do the same encode/decode *again*).

excessive syncing: Every OpenGL call which produces a result needs to block until the decoder thread has issued the OpenGL call. This is a problem with resource creation and uniform location lookups.

I recently wrote a new render pipeline which gets rid of the "fat render thread", instead the new render thread just sits behind a push-buffer, and consumes and executes rendering commands. This is the better design because:

command granularity is a lot lower since the command protocol is higher level then OpenGL

the main thread *never* blocks mid-frame for resource creation, only at most once per frame if no free push buffer is currently available (because the main thread has run too far ahead of the render thread)

For NaCl this is the perfect design: all OpenGL calls happen on the Pepper thread, and the Pepper thread doesn't do any heavy processing; just some simple translation from Nebula3's high-level rendering protocol to low-level OpenGL calls.

Oh, I forgot: remember when I said that the first thing the render thread does is to synchronously load shader files? I got around this problem in a radical way by compiling the shader code into the program. A little command line tool reads the shader files which are otherwise loaded as data files and converts them into a big monolithic C++ source file which sets up the complete shader "database". All other file accesses in the render thread had already been asynchronous, thankfully.

From Here Onward

The only big show-stopping must-have feature that NaCl doesn't provide yet (IMHO) is UDP sockets, or any other type of "lossy but fast" type of communication channel.

I don't care that much about the call-on-main-thread restriction any more, I know there's a solution on the way, but it is only a viable option if the overhead is smaller then the current solution.

A proper sockets solution on the other hand is critical for action-oriented online games. Maybe I'm old-fashioned, and TCP sockets are "good enough" these days, but a guaranteed messaging layer is still overkill for fast-paced action games. It's just not important whether movement commands arrive, but it is a catastrophe if commuication stalls *if* packets are lost.

I looked into WebSockets, and I don't have a lot of praise for them. I think I have a basic understand of the security implications, but I will never understand why WebSockets work the way they do. It looks like they were only designed to make AJAX communication suck less, but not with the type of network communication in mind that any type of multiplayer games require (well maybe turn-based games like Chess or Civilization).

So if I had one wish free to the NaCl team: give us proper sockets, with TCP and UDP support, and create a security model around them, and don't care about HTML5. Ideas: only allow outgoing connections (the "client" in a client-server model), restrict where a socket may connect to (something like crossdomains.xml), etc... I'm no security expert, but I think / hope that it should be possible to create a secure networking system under the standard socket API (or a subset thereof).

24 Sep 2012

After I got the new experimental "Twiggy" pipeline up and running I decided to give Google Native Client another shot, since Twiggy works much better with the "call-on-main-thread" restriction of NaCl. The NaCl port now only differs by a couple of hundred lines of code from the other platform ports which I'm very happy with.

Here's the link to the demo in the Chrome Web Store. You need a recent version of Chrome (at least v.21). So far I only tested this on my MacBook with an nVidia 9400M GPU under OSX and Windows:

19 Aug 2012

A list of embarrasing screw-ups I tend to waste time on (repeatedly) while writing OpenGL based rendering code. The list will surely grow as I continue working on Twiggy :P
This is generally stuff which doesn't show up with glGetError() of course.

Sampling a texture just returns black pixels in the fragment shader:

Most likely the texture is incomplete because the texture sampler states GL_TEXTURE_WRAP_S, GL_TEXTURE_WRAP_T, GL_TEXTURE_MIN_FILTER, GL_TEXTURE_MAG_FILTER hadn't been setup yet, or the FILTER states expects mipmaps but the texture has none.
Also on some platforms / drivers it is more efficient to setup those states before calling glTexImage2D().

Clearing a render target doesn't work as expected (eg the color buffer is cleared, but the depth buffer isn't):

glClear() is affected by the state of glColorMask(), glDepthMask(), and glStencilMask(). If glDepthMask() is set to GL_FALSE, the depth buffer won't be cleared even though glClear() is called with the GL_DEPTH_BUFFER_BIT set.

17 Aug 2012

Twiggy's render thread is much simpler then Nebula3's current "fat thread". Instead of running a completely autonomous graphics world in the render thread it simply accepts rendering commands from a push buffer, fed by the main thread (and maybe other threads later on). Eventually the render thread will also be extensible through some sort of plugin-modules which may implement new low-level rendering commands.

Extra care has been taken to keep data structures and their relations simple, memory granularity low
(arrays-of-structs instead of arrays-of-pointers) and to keep related
data close to each other in memory (to increase CPU cache efficiency).

The push buffer is at least double-buffered (more buffers are possible but probably don't make much sense since it would only increase latency when the main thread runs too far ahaead of the render thread). The render thread will starve if the main thread stops feeding commands, but that's something which needs to be fixed for special cases like loading screens, probably by adding flow-control commands to the push buffer, so that the render thread can loop over a block of rendering commands.

When doing an experimental port of N3's current render pipeline to Native Client I implemented such a push-buffer driven render pipeline because of NaCl's "call-on-main-thread" limitation. Each OpenGL call would be serialized into a command buffer, and pulled by the "Pepper Thread" where the actual GL calls would be executed. This worked - but was a terrible hack and convinced me that it isn't a good idea to feed the render thread with such low level commands (way too many commands per frame, and the feeder thread had to wait for the render thread each time a getter-function or resource-creation-function was called.

Thus the command protocol of Twiggy's render thread is higher level, but not as high-level to be completely hard-wired to a specific type of rendering technique.

Something that works very well since Nebula2 is the frame-shader system (in N2 these were called RenderPaths). Frame-shaders describe how a complete frame is rendered by dividing the frame into passes and batches. It's a nice and generic medium-level renderer architecture. Much less verbose then D3D or OpenGL, but flexible enough to implement various rendering techniques (like forward rendering on low-end platforms versus pre-light-pass rendering on platforms which have a bit more fillrate). It was quite natural to use the frame-shader vocabulary for Twiggy's render command protocol.

A twiggy render frame is built from the following commands (this excludes display setup and resource creation):

1 BeginFrame

[UpdateProjTransform]

1 UpdateViewTransform

1..N BeginPass

1..N BeginBatch

1..N BeginInstances

1..N Draw

1..N DrawInstanced

1 DrawFullscreenQuad

EndInstances

EndBatch

EndPass

EndFrame

BeginFrame / EndFrame: this encapsulates an entire render frame, calling EndFrame() signals the render pipeline that all rendering commands for this frame have been issued, and the result can be displayed.

UpdateProjTransform: updates the projection matrix, some rendering techniques may require to change the projection mid-frame (for instance before rendering to a shadow-map).

UpdateViewTransform: updates the view matrix, depending on the actual rendering technique used, this can be called several times per frame.

BeginPass / EndPass: a pass sets (and optionally clears) the active render target, and sets render states which should remain valid for the entire pass.

BeginBatch / EndBatch: a Batch only sets a couple of render states which should remain valid between BeginBatch/EndBatch, this is just a way to reduce redundant state switches when rendering instances. Typical batches are solid vs. alpha-blended objects for instance.

BeginInstances / EndInstances: this sets up all necessary data for a series of Draw commands for the same 3d object (geometry, textures, shaders, render states).Draw / DrawInstances: Several ways to issue actual draw commands, the most simple version takes a world-space matrix and performs one draw call to render one instance. DrawInstances will be used for rendering several instances with a single command, preferably using hardware-instancing.

DrawFullscreenQuad: Just draws a fullscreen quad, mainly used for post-effects.

I'll keep resource management out for now, this topic is interesting enough for its own post.

There will be a way to write dynamic resources from the main thread, and use them from the render thread, to prevent excessive data copying over the push buffer (it's a bit silly to move thousands of instance matrices over the push buffer, especially if the matrices are mostly static).

15 Aug 2012

I'm currently playing around with some ideas for a new rendering pipeline in Nebula3, strictly experimental stuff, based on the experience so far with Drakensang Online. I'll call it Twiggy because it's so damn slim :)

The driving force is that my focus has moved away from consoles towards more "interesting" and especially more open platforms in the past 2 years, namely the web, mobile, and desktop platforms with some sort of built-in app-shop (or good old Steam for that matter).

The way-of-thought towards the "grand vision" is basically:

Selling software in stores on clunky silver disks is sooo last century...

Downloading and installing a complete game for half-an-hour or more before being able to play is not much better (it happens that I'm forgetting about half of the demos I'm starting to download on my 360 until I'm deleting them at some later time because I need to make room for new demos, which I will never play as well, ...)

Regardless of platform (web, mobile, or app-shops on desktop platforms) the experience should be true "click and play". Click a button, and start to play a few seconds later without the player wanting to start another activity in the meantime.

This means a small initial download of a few megabytes (5..20 MB or so), and from then on it's all about:

How much new data can we present to the user per second?

Let's be optimistic and assume somewhere between 2 and 12 Mbit/sec for the civilized areas of the world...

This means the asset data which is streamed into the game still must be small. It's no longer about how much data fits on a DVD or on Bluray, we can store as much data as we want on web servers. It's all about how fast the user eats through the data while playing the game and how this compares to the available bandwidth. Which is why data compression, sharing and reuse is so important today and the size of a Bluray disc which Sony hyped so much a few years back has become irrelevant.

It's a good thing if the game world is built from many small, similar building blocks which can be recombined as much as possible (of course without having an obviously repetitive game world). This is something which works very, very well in Drakensang Online.

With all those thoughts and the experience from the current rendering architecture (what works well and what doesn't), here are some starting points of what Twiggy will be:

It will initially be built on OpenGL instead of D3D, since the majority of target platforms has some flavour of OpenGL as rendering API.

The feature base-line will be OpenGL ES 2.0

The performance base-line will probably be an iPad2

BUT: it should scale up with additional features, and especially additional fillrate on more powerful target platforms (up to desktop GPUs)

GPU performance is much more important then CPU performance. Some potential target platforms (like Adobe Alchemy 2) will cross-compile C++ to some VM byte code, while providing full access to the GPUs power. It's important that a game runs smooth even with such a limited "software CPU".

Optimized for rendering scenes built from an extremely high number of simple 3D objects (that's where the data size savings mainly come from)

Ditch the "fat render thread" in favour of a very slim render thread behind a simple push buffer. The fat-render-thread design in Nebula3 works, but it is complex. It needs to run in lock-step with the main-thread, needs to transfer data back to the main-thread, and has a lot of redundant code on the render-thread and main-thread side. That's a bit too much hassle for the advantages this design provides.

More orthogonality throughout the render pipeline: for instance hardware-instancing should be usable with less restrictions (currently, only simple, static 3D objects can be hardware-instanced), make skinned characters and "normal" 3D objects more alike and share more features between them (e.g. using the powerful character animation system for everything, not just characters).

The way there is long, since this is a weekend-project, and I will happily throw away and rewrite large chunks of code if they don't feel right. The first step will be the new CoreGraphics2 and Resource2 subsystems, which will implement the new slim render thread, and a resource management system which is more powerful then the current one, yet with much less and simpler code. More on that later.