18 Feb 2007

Game applications often use file archives to reduce file system clutter and improve performance when many small files must be opened and read by the application. While Nebula2 used a proprietary archive format (NPK), Nebula3 uses standard Zip files. This has a number of advantages:

no self-written tools required to create the archives, just use the zipper of your choice

simple file encryption supported

smaller disc footprint

usually higher read performance because disc bandwidth is often the bottleneck, not decompression speed

The current implementation has a few disadvantages though:

no write support (not a big deal, NPK's didn't support writing either, and game resources are usually read-only anyway)

No random access (no seeks), this is a bit more critical, and could be solved with a more advanced implementation. Currently this is circumvented by decompressing the entire contents of a in-zip-file into memory and allow seeking on this in-memory-copy. This approach basically disables all types of file-streaming scenarios (especially streaming audio).

Accessing the content of Zip archives is completely transparent to the application once a Zip archive is mounted through the IO::Server::MountZipArchive() method. The IO::Server::CreateStream() method will check whether an URI actually is a file in one of the mounted Zip archives, and return a ZipFileStream object instead of a FileStream object if needed. The application just uses the returned Stream object as usual and doesn't need to care whether it's working with a "real" file, or a compressed file in a zip archive.

These are the URI schemes and associated stream classes that Nebula3 provides out of the box for now. Stream objects are usually created through the IO::Server::CreateStream() method, which takes an URI and returns a matching Stream object:

An application may associate its own stream classes with URI schemes using the IO::Server::RegisterUriScheme() method.

Streams only provide generic Read() and Write() methods to read and write chunks of memory. To access stream data in a more convenient way, StreamReader and StreamWriter objects are attached to the stream. Readers and writers provide specialized interfaces for reading or writing specific types of data. The Nebula3 IO subsystem comes with the following reader and writer classes:

BinaryReader/Writer: read/write a stream of typed data elements in binary form

Other Nebula3 subsystems may provide their own derived StreamReader and StreamWriter classes. For instance, the Messaging subsystem provides a MessageReader and MessageWriter class for serializing message objects.

Assigns have already been used since Nebula1 (actually the concept and name stems from the original Amiga OS back in the 80's), they are aliases (or shortcuts) for file system locations and very handy to abstract the actual location of files in a Nebula application.

For instance, instead of "C:/Program Files/MyApp/export/textures/category/mytexture.dds", you would simply use "textures:mycategory/mytexture.dds". The assign "textures:" would be defined as "C:/Programme/MyApp/export/textures".

Assigns are especially useful to describe locations like the application's installation directory or the user directory of the currently logged in user. For this reason, Nebula3 defines a few standard assigns which are initialized at application startup:

home: this points to the application's install directory

bin: this points to the directory where the application executable resides

temp: this points to a scratch directory which is guaranteed to be readable and writable for the current user

user: this points to the user directory, in an English Windows, this is for example "My Files/CompanyName/AppName". The user: directory is guaranteed to be writable (unlike the installation directory), and this is the place where config data or save game files should reside.

In Nebula3, Assigns have been extended to work with URIs. For instance, the "textures:" assign could be defined as "http://www.radonlabs.de/myapp/textures", which would automatically cause textures to be loaded from a http server instead of the local file system. Assigns can be nested, so for instance the "textures:" assign could also be set to "home:export/textures", which would automatically resolve to the location "export/textures" in the installation directory.

11 Feb 2007

In Nebula2, platform abstraction was done through subclassing and virtual methods. For instance, the nGfxServer2 class implemented the platform-independent interface of the graphics server, and a specific subclass (for instance nD3D9Server) implemented the Direct3D9 version of the graphics server, overriding the virtual methods of the base class. Client code would then talk to the nGfxServer2 class interface and doesn't need to care whether rendering is done through Direct3D or some other rendering API.

Depending on the host platform, the performance of virtual method calls is somewhere from slightly bad to very bad, because the additional memory lookup may trash the cache, flush the instruction pipeline, disable branch prediction, etc... Virtual method calls are still the fastest way for runtime-polymorphism, but that's usually not necessary for platform-abstraction, where "compile-time-polymorphism" is often enough.

Nebula3 uses typedef-ing for platform abstraction, and thus eliminates most virtual methods and even enables inlining for frequently called methods without sacrificing platform-independent code.

This is done by first writing a base class which defines the class interface. The base class usually doesn't have virtual methods (except the destructor, since the base class is usually derived from Core::RefCounted). From the base class, a platform specific class is derived, overriding most or all of the methods defined in the base class with platform specific code. Finally, the platform specific class is typedef-ed to the proper platform-independent class name, which is then used by the client code.

Here's an example: let's say we want to implement the RenderDevice class in the CoreGraphics namespace. First, the class interface is defined in the base class:

Client code just works with the RenderDevice class, and is completely unaware that it is actually using the D3D9RenderDevice or D3D10RenderDevice classes. All platform specific stuff is resolved at compile time, and all calls into the RenderDevice are normal method calls, not virtual method calls. This lets the compiler also do a much better optimization job (for instance inlining methods, better link-time code generation, and so on...).

3 Feb 2007

I'm trying to wrap my mind around the PS3 hardware specs for some time now, and somehow I don't get the point.

In the beginning, when there was just the Cell in the game it all made sense. The Cell was able to achieve an impressive vertex processing rate, at least compared to a traditional CPU. So Sony would use the Cell to implement a very flexible vertex processing pipeline, and connect it to a relatively simple graphics chip, which would basically just be a rasterizer... this could result in a relatively simple and cheap system, right? But then, surprisingly late, Sony announced that the PS3 would contain a traditional nvidia-made GPU and everything suddenly made less sense. Now there was a powerful and power-hungry CPU able to process a lot of vertex data per second, and another powerful and power-hungry GPU, also able to process a lot of vertex data per second. That's basically the main point why I don't get it. Why are there two completely different and completely separate vector stream processors in the PS3 which must be programmed in completely different ways? Of course there must be a secret masterplan behind all this soon to be revealed to us mere mortals. Or could it be that the Cell couldn't match a modern GPU in terms of vector processing power and Sony had to fallback to an emergency plan?

The PS3 is basically the Cell CPU, connected to 256 MB main RAM at roughly 25 GB/s, and the RSX GPU connected to 256 MB of video RAM at 22 GB/s bandwidth and connected to main RAM at 15..20 GB/s bandwidth, that's at least what the publicly available specs say. There's not much information how fast the Cell can read from and write to GPU RAM. According to this, write speed is about 4 GB/s, while read speed is 16 *MB*/s. Now the latter is not as dramatically bad as the Inquirer article implies. Reading from video RAM is generally a bad idea on any architecture, because it stalls the GPU. But the 4 GB write speed isn't something to write home about either.

CPU-to-System-RAM bandwidth is pretty good compared to a modern PC, GPU-to-Video-RAM bandwidth seems to be about on par with nvidia 6800 reference boards.

The Cell runs at 3.2 GHz and has one general PowerPC core, and 7 specialized stream processing units. Just as with the Xbox360, the 3.2 GHz is slightly misleading. The Xbox360 and Cell cores are stripped down PowerPC cores and lack an out-of-order instruction scheduler, which seems to yield a real-world performance comparable to a 1.8 GHz P4 (that's hearsay information though, but sounds reasonable). The Xbox360 makes up for this by having 3 identical cores. But on the Cell, there's only one of those cores.

So without the 7 additional stream processors of the Cell, the PS3 is basically comparable to a PC from 2001 (albeit with very good memory bandwidth) and a 2004/2005 era GPU. Graphics wise, that's ok. Consoles have the advantage of a hardware that is set in stone, and a very thin software layer, so that on a console, programmers can play some tricks that are impossible on the PC with its many hardware and software configurations.

What concerns me is the weak general processing power of the Cell. By raw numbers, the Cell is a vector processing monster thanks to its 7 (on PS3) specialized stream processing units (SPEs). But if one starts to look at the details, the Cell looks more like a solution in search of a problem. Generally, an SPE should roughly be capable of processing about 51 GB/s (3.2 GHz x 128 bit, assuming one vector operation per cycle). According to this, this is exactly the bandwidth to the local memory of an SPE. The bandwidth to the interconnect bus for a single SPE is 25 GB/s, and the Cell seems to be optimized especially for interchanging data between SPEs. There only seems to be a single channel to main memory with 25 GB/s. So the optimal scenario seems to be to chain the SPEs into a pipeline, and to stream a single data set through that pipeline. Feeding all SPEs in parallel from main memory would by far saturate the single 25 GB/s data channel. The second problem is the small local memory per SPE. SPEs don't have conventional caches to main memory, but 256 kilobytes of embedded memory both for code and data. It looks like that data transfers in and out of the SPE must be handled manually through a DMA engine. 256 kByte is really not much. Assuming 64 bytes per vertex in a 3d model, that's just enough room for about 4000 vertices. And then the code must also fit somewhere into those 256 KB.

Now lets quickly compare the Cell to a modern graphics chip. A GPU basically is made of lots and lots of little shader units, each of those is comparable to an SPE. In terms of flexibility, a modern GPU shader isn't that much different from an SPE, both can be programmed in a C-like highlevel programming language and can do branches and loops. A GPU has many more shader units then a Cell has SPEs (40..60 shader units compared to 7..8 SPEs on a Cell), however, a GPU is clocked much lower then a Cell (~0.5 GHz compared to ~3 GHz). The big difference is in the programming model. While the Cell exposes all of its internal complexity to the programmer, a GPU just looks like a very simple linear stream processor from the outside. A GPU consumes a single dataset, and all the complex parallelization happens inside the GPU, completely hidden from the programmer. From a programmers point of view it doesn't make a difference, whether there's only one shader unit in the GPU, or whether there are hundreds of them. Compare that to all the hoopla that's needed on the Cell. Data must be shuffled in and out of every SPE via DMA, there's an internal size limit on the data that can be processed at once, there are all types of bandwidth limitations to think of, and so on and on... Why oh why didn't the Cell designers take some inspiration from the graphics guys?

So what to do? There's only so much need for stream processing in a typical game, and that's already mostly handled by the GPU. One could build a pretty flexible and fast vertex processing pipeline with the Cell, but that's quite pointless, because the RSX already does that, GPUs have been specialized for years for this type of work. One could try to "prepare" vertices for the GPU, especially stuff that can't be done on the GPU, maybe some advanced dynamic LOD system, algorithmic geometry generation, etc... But then one also needs to continuously feed the vertex data to the GPU, and from the available bandwidth it looks like the the Cell could hardly keep the GPU busy. And frankly, IMHO there's not much need anymore for doing any per-vertex stuff on the CPU. For years, graphics hardware has been pushed into the direction of putting vertex processing off the CPU and onto the GPU.

So what else to keep the SPEs busy? Coding and decoding of several video and audio streams in parallel would be a perfect task. But on a game console?

The only area which comes to mind where the Cell could shine in a game context is probably physics. Physics needs a lot of floating point processing power, and doesn't have to interact with the rest of the game code too much (basically writing updates from the game world into the physics world, evaluating the physics interactions, and reading back the results into the game world). So maybe we will see some great physics in PS3 games, but is there really a need for it? Look how successful the Aegia physics accelerator has been...

Somehow, the hype surrounding the Cell reminds me of the Itanium-hype of the 90's. Today, we all ought to sit in front of cheap Itanium workstation with processing powers unheard of from mere PCs. The Itanium boasted very impressive theoretical peak-performance numbers. The problem was, that the Itanium relied on heavy parallelization on the instruction level, but most software just didn't have enough of that instruction-level parallelism to keep all execution units of the Itanium busy. The Cell could suffer from the same problem, just on a different level. There may just not be enough need for vector processing in games besides 3d graphics, which is already better handled by the GPU.