With the arrival of multicore CPU’s for mobile devices, the talk about threading the Cocos2D API has steadily risen, until it has reached a level where it seems to be consensus, that without multi threading, we will never reach the Shangri La of 2D rendering.Let me briefly state my background on this.Before I became lead developer of Cocos2D-iphone, I was schooled in threaded programming, and have spend several years doing low level threaded stuff, like Windows drivers, PCI, serial, and also threaded programming on ex. high end DSP’s.When venturing into multithreading, the first question you have to ask, is: Do I have code, where the order of execution is irellevant?If not, multithreading will not increase performance. Only decrease it. This has been described in Amdahl’s law, and while I do not fully agree with it, it still gives you a pretty good estimate of what you should expect.If your code can be executed in no particular order, the next step is performance.The overhead of syncronizing multithreaded code, is quite considerable. While working for an american company building ticket scanners for horse racing, I spoke to some of their low level network programmers, which is an area where threaded programming has been used a lot.They said, that with a proper implemented first thread, their ultimate goal would be an 50% speed increase. Second thread was only expected to give a 10% gain. So in best case, two cores are not 200%, but 150%, and three cores are not 300% but 160%.As for mobile game programming, performance has always been a key issue, so with an incresed number of CPU’s, it is natural to look into multi threading.This was also the case in mid 2000, when multi CPU’s became the standard in PC’s.One of the companies exploring this, was ID, with the title Quake 4. They build a multi threaded render engine, which improved performance. The huge difference was, that in the Quake4 days, the CPU was the bottle neck. This is not the case on mobile devices, where the GPU is severly hamstrung due to power consumptions.It is also worth to mention, that ID dropped multithreading in the successor Doom3.All in all, I do not think multithreading the Cocos2D API is worth it. GPU will on mobile devices, for some time to come, be the limitation, so that is where I think we should look for performance improvements.

In next release of cocos2d, I am planning to remove CCArray, and replacing it with NSArray ( or NSMutableArray actually ), but for analogy, and since CCArray should actually be CCMutableArray, I will just call it NSArray in this article.

There has been a lot of controversy about the performance of NSArray, and that it was slow. “Use CCArray, my grandma always told me“, seems to be consensus. It even lies so deep in the community, that some told me to expect a performance drop, if switching to NSArray. Nothing could be further from the truth.

I will not turn this into a “how to setup a test”, but unlike others – and I will mention no names – let me just try to just flesh out the basics of how to test this.

First, I have created a load function, only making one call to a function, returning a random integer, inside the range of the array.count, as I need that value later, for random access to the arrays.

for ( int i = 0; i < array.count; i++ ) {

int RandomPosition = [ self returnRandomIndex:array ];

// test function goes here

}

The test is performed on arrays with approximately 1000 objects. This is somewhat higher than what cocos2d will normally work on. Most arrays will be much smaller, but there seems to be little difference, unless you do linear large array searches, which I will not cover, as it simply is bad design. The loops are repeated, with each time sampled. When the spread of time samples drops below 1%, the test is stopped. Usually after a few seconds. As the basic loop time is known, the actual time spend in the test function, can be calculated accurately.

Every number is then put in relation to this load, so the milage will not vary, depending on your device. This will be a normalised value. If a test has a load of 1.0, it will take the same time to execute as the above load. Hold it up against how much else you do in your update loops, and do the math yourself. There could be fluctuations for older devices, sporting a much different hardware layout, but my guess is, that the result will be consistent for anything running iOS5 or better. For those who wants to know, the load loops ticks in at around 250 nS / loop, on an iPad3, running iOS6. The arc4random call in returnRandomIndex, alone takes 200 nS. For now – this very light load – will be our 100% load reference. For higher and more realistic loads … be patient, I will address it later.

Every number I present, is then rounded to 2-5 % accuracy, as even this test setup probably can not guarantee more. So I will not be writing “It took 138.1 mS”, as this is what we in danish call “talknepperi” ( No, Google has nu clue to what that is, but I bet you have an educated guess ). I will be writing, “It took 140 mS”

Before I get to the raisin in the hotdog-end, let me talk a bit about how cocos2d uses array. The fantastic four are.

1) Forward iteration

2) Backward iteration

3) Appending objects

4) Removing objects at a specific index

There will of course be exceptions from this, but this is basically what arrays will do in cocos2d. I have not made estimates as to the balance of this, but it is clear, that the number of times an object is iterated, compared to how many times objects are added and removed, easily could be a factor 100, 1000, or much higher.

As for the timing results.

1)

For forward iteration, both NSArray and CCArray has a load overhead of less than 0.1 when switching from an integer loop – as in the load – to fast iteration. Basically this means, that all what matters in loops, is what is done inside the loop. But we all knew that.

In this case, it means that fast iterations was timed at 22 nS for NSArray, and 24 nS for CCArray, compared to the 250 nS of the loop.

2)

For backwards iteration, NSArray spends 2.8 load, and CCArray 2.5, meaning that CCArray is around 10% faster on backwards iteration.

This equals the functionality of getting an object from a random index. I will come back to that.

Before we continue to 4, judging from these answers, you would say that CCArray is the way to go, and the numbers certainly are against NSArray. However, in cocos2d, any object added, will at some point be removed. So …

4)

For removing an object at an indexed position, NSArray spends 31 loads, while CCArray spends a whopping 150+ loads.

And yes, the loads are comparable, so the 0.5 load you earned when adding, you lost 250 fold when removing.

The bottom line is, that CCArray will be ~0.5 times an arc4random faster, for every time you add an object, but only if you plan to let your arrays grow forever. The fact that CCArray is slightly faster to backwards iterate is irrelevant, because the only reason you would ever want to backwards iterate, if is you want to remove on the fly.

If you have any load, just marginally more complex than what I used above, you can divide the importance of using XXArray, with the magnitude of the complexity of your load. Meaning, that at the end of the day, what kind of array you use, is completely and utterly irrelevant.

The reason I remove CCArray, is not because cocos2d will become faster, because that part will not be measurable. I only remove it to make the cocos2d codebase cleaner and easier to access.

Headers. This finely tuned instrument – this Stradivarius of programming – this final frontier against the total apocalypse of Java – is starting to fall apart.

Let us face it. No one reads the manual first, but dorks – and those of us who occasionally gets in doubt. That does not mean a manual is not a good idea. It just means, that the manual should not be at the top of the package, or glued to our foreheads. You should not be forced to read the manual on the program selector, each time you want to change the channel on your TV, or read the manual on how to operate Safari, each time you want to read the news. That would quickly get pretty annoying.

None the less, this is that what is happening to one of the most powerful tools of C programming. The headers.

A good header is invaluable, and alone for that reason, I will always despise Java. A good header tells you at a glance what a class contains, and what it can do. Naming conventions. Order of appearance, and even subtle things like spacing or spacers, will guide the reader through the class, and present him with – the essence of it all – right there at his fingertips.

Unfortunately somebody got the idea, that if we inserted the documentation into the header file, he could write a small script, and the documentation would be “for free”. What a load of … Not only are you without any kind of formatting when writing the documentation, you also ruin your header, and will have to spend countless hours, trying to figure out, why the auto generated piece of cr.. does in no way look like what you had pictured in your head. The idea ranks amongst the worst ideas in human history. It is that bad.

So do humanity a favor, and start writing the documentation as it is supposed to be. In a separate file.

Probably the biggest leap from cocos2d-v1 to cocos2d-v2, is understanding shaders, and as is also is the most powerful aspect of graphics programming, I thought I would briefly walk you through the absolute minimum required, to get shaders working.

Everything displayed on the screen, must run through a shader. The shader consists of two steps. A vertex shader ( basically ) calculating geometry ( vertices ), and a fragment shader calculating actual pixels ( fragments ). Nothing can be displayed on the screen, without these two shaders present. As openGL is a state machine, it really only needs one set of shaders, but that would not be much fun, so cocos2d supports custom shaders for each and every node.

First, create a new cocos2d-v2 project, and add the following code last in HelloWorld init.

Now, re-run the program, and watch the shader magic. The coconut is grey.

All the first \n\ code, is the shader code, added as a single string of chars. First the vertex shader, and then the fragment shader. This is the quick and dirty way to add shader code, but for large shaders it will quickly be a mess. In that case the shaders are added as separate files, but for clarity, I use the “simple” approach here.

The last 4 lines, is telling the shaders which data it will have to work with, and link it with openGL, so that it is usable. The data demonstrated here, is the absolute minimum required to draw a textured quad, namely vertex information, texture coordinates, and which texture to use.

I will not go into details with the shaders, there is plenty of information on the internet about this. A few things though, will help you understand the shader code.

Attributes are data supplied to a vertex shader, and which the vertex shader then interpolates. This could be vertex information.

Varying data are data you want to share from your vertex shader, to your fragment shaders. If assigned from an attribute, they will be automatically interpolated.

Uniform data are constants ( data which does not change for each draw, like ex. a constant colour ), and can be used in both vertex- and fragment shaders.

Furthermore there are an array of pre defined data, like gl_Position, which acts like a varying, or gl_FragColor, which assigns the actual colour to the fragment.

If you look at the last 4 lines again. Notice that two of them sets up the attributes needed for the shader, and one sets up the uniforms. As long as you use standard attributes and uniforms ( normally colour, vertices, texture coordinates and texture ), this will be all you need.

The idea is to create a fully deformable terrain, with a total of 5 parallax layers. When doing this many parallax layers, the final number of pixels which gets rendered on the screen, easily exceeds 10M pixels pr frame on an iPad3. The thing is, that if you need a bit of freedom to move the camera around, the number of pixel which gets drawn more than once, almost explodes. I wanted to avoid that.

Furthermore. As the terrain is zoomable with a factor 8, it puts quite a strain on the image sizes, as we of course did not want pixelation when zoomed all the way in. And even worse than pixelation, blending artefacts from scaled textures, causing dark outlines.

So the basic idea ( to avoid drawing any pixel more than once ) was to draw the entire game in only two triangles, drawing any pixel exactly one time.This of course needs a lot of textures for the shader. In this case, the maximum of 8.

1) Skybox.

2) Parallax 0

3) Parallax 1

4) Parallax 2

5) Object layer

6) Front layer

7) Terrain crust texture

8) Mask texture

The deformable terrain is made, using Objective Chipmunk. The terrain mask is stored in a CGBitmapContext, allowing Chipmunk to scan the pixels to create the terrain. Deformation is made simply by drawing to the CGContext. Several colors are used. The red channel defines the actual terrain surface, and is made with a heavy blur. The blur is then used to create texture coordinates for the terrain crust. Once again thanks to the guys behind Chipmunk, for the inspiration to this. The crust is masked using the green channel. This is made programatically, so that different types of crusts can be applied. At this point, crust is added to terrain pieces under a certain slope.

Rendering the entire terrain in a single quad, has the drawback, that no matter where ( and how much ) the texture is shown, it is drawn on the entire screen. After doing a bit of math on this, I realized that nearly 50% of the render time, was spend mixing pixels from outside textures.

So I dynamically broke the single terrain quad into horizontal strips, based on how many parallax layers covers the strip. The top most strip will in most cases only be the skybox, so that piece only needs to render the skybox texture.

Having 5 layers, this gives a total of 32 combinations. Even if some of these combination never will be called, I created 32 shaders ( basically creating one shader, and then just commenting stuff out ) and then use the correct shader, based on the layer combination.

This nearly doubled the framerate, and iPad2 + 3 is a solid 60 fps.

The last thing I wanted to avoid, is the blending artefacts around scaled objects. Having had a close look at Dreamworks Dragons ( which sports some really fancy parallax ), I realized that they solved the problem by adding dark borders to close objects, and disable texture blending on far objects. This results in a less crisp loop, and noticeable pixellation on far objects.

My approach with rendering the entire terrain in one quad ( breaking it into strips really doesn’t change the basic idea ) has the benefit, that I have full control of how the rendering is done, and how the blending is done for all layers in a single shader. If I render front to back, in stead of back to front, I have the huge benefit, that I can blend up against the background, and not up against some unknown layer behind. So by adjusting the background color, I can hit a point where the blending artefacts almost disappear. The video doesn’t really do it full justice, but you can see in some of the parts where I zoom in, that the terrain is very crips, and that there is no pixellation what so ever.

I am working on some serious stuff, including a lot of parallax layers, and a deformable terrain. Screenies below.

Right now, I apply brute force fragment shaders, rendering a total of 8 textures in one pass. This sound efficient, but it really isn’t. While it runs 60fps on ipad2 and 3 ( retina ), and somewhat slower on ipad1, the problem is, that I render way to many transparent pixels. Even if parallax 2 only is 5% visible at the bottom, I render a nearly entire blank screen of parallax 2, because I am forced to render only one piece of geometry. While mix( ) is insanely fast, with 8 full screens of textures, things will eventually start to slow down.

And then it struck me.

The parallax should be split into 5 quads, each being a band of overlapping parallax layers. First quad will be pure skybox. Second quad will be skybox and far most parallax. Third quad will be etc etc etc etc. If you adjust you parallax textures to include fewest possible transparent pixels, this will render it all, with an absolute minimum of pixel operation.

You of course have to write 5 shaders, in stead of one shader, but that is a small price to pay.

All the images below, is rendered in a single quad.

Enjoy the landscapes.

Notes:

The earth crust is made with inspiration from a demo I watched, and is fully deformable, meaning that if I dig a hole into the ground, the grass will disappear. Not by magic, but by the insane power of the shader.

Update:

I did some math, and calculated that I could save 35-80% of all texture look-ups ( currently the shader does 25.2M look-ups pr frame )

The final picture shows how I plan to implement it. The picture is a bit fuzzy, but so is my ideas on how to implement this also.