April 2nd, 2012

as3corelib is the defacto library for all common utility functions in ActionScript 3. While it’s a reasonably good codebase, it’s also unforgivably slow sometimes, which led to the creation of actionjson. Its SHA-1 implementation leaves a lot to be desired, and this article takes us through the steps I took to write new one, to help understand what makes AS3 slow.

Understanding SHA-1

First we need to understand how SHA-1 works. SHA-1 takes a series of bytes, and processes them in chunks of 64 bytes at a time. There’s some extra padding and data added to the input as well. Lots of operations take place during each chunk that ensure even the most insignificant changes have a radical butterfly effect on some variables that are returned in the form of a SHA-1 hash.

Here’s some pseudocode.

add some extra data to the end of the input
set the initial sha-1 values

for each 64-byte chunk do
extend the chunk to 320 bytes of data

perform first set of operations on chunk (x20)
perform second set of operations on chunk (x20)
perform third set of operations on chunk (x20)
perform fourth set of operations on chunk (x20)
end

return sha-1 values as a hash

You don’t need to fully understand it, but just get the general idea. Get chunk, process chunk, move onto next chunk.

Round One

This code isn’t half bad. Everything is reasonably easy to understand and modify. This kind of code is usually how I like to do things at first, before I start optimizing. It’s pretty slow though, roughly a third of the speed of SHA1.hash. If you’re familiar with AS3 optimization you can probably spot a lot of straightforward optimizations.

Reuse the w array, inline the bitshifts

Much better. That made a huge difference and now it’s roughly on par with SHA1.hash. The bitshift function has been inlined, and the w array is being reused instead of recreated during each iteration. Reusing the w array barely has an impact on speed though, it’s the inlining of the bitshift function that accounts for the 3x speed boost. It’s still good to be careful with memory regardless, since passing off work onto the GC isn’t good for performance and is more difficult to profile.

Unroll those loops

There’s not much left to inline that’ll make a big impact, so how about unrolling the loops? I wrote a small Python script to generate the code in the loops, so I don’t have to write it out by hand. Any changes to those lines of code will come from that script from now on. Overall though, this doesn’t do much.

Turn the w array into local variables

That w array has a constant size, so why not just convert each entry to a local variables? It gives us an impressive speed boost. In case you’re wondering why I didn’t convert w to a Vector I would prefer to stay compatible with Flash 9. Vectors are also not as fast as local variables, they only made it 2-3 times faster while local variables made it 3-6 times faster.

Lesson Learned: Array accesses are much more expensive then local variables. Even Vectors (which are typed and pretty fast when fixed) can’t compete with local variables.

Process a ByteArray instead of a String

This trick I learned while making actionjson. Strings are generally slow, so putting data in a ByteArray and processing from there can often be orders of magnitude faster. sha1 gets a nice boost out of this.

Lesson Learned: ByteArrays are faster than Strings for data processing.

Reuse w variables

Looking carefully at the SHA-1 algorithm, it only needs the last 16 w variables. For example if it’s setting the value of w76, the value of w60 is needed, but w59 is not used and will never be used again. Since this pattern repeats itself, the w variables can be reused, and we can reduce the number of w variables from 80 to 16. This doesn’t help much though.

Lesson Learned: Flash is pretty good at optimizing local variables already.

Reuse result of an operation, rather then getting it from the local variable later

Stop shifting around the values stored in a, b, c, d, and e

This one is hard to explain, but basically each time the a-e variables are changed, it’s mostly data being moved around, while only two variables are actually changing. By changing which variables get modified and used after each iteration, we can only modify the ones that need to be changed. This just removes a lot of code that shifts around values (e.g. c = b, a = b). This has surprisingly little impact.

Lesson Learned: Flash is still pretty good at optimizing local variables.

Some misc improvements

I’ve been focusing on the inner SHA-1 loop, since that’s where most of the work is done, but let’s try improving some of the code outside of it. Removing an extra variable doesn’t help much. There is a small boost from inlining and optimizing intToHex. There’s not much of an impact on the “long string” test, since this reduces overhead, which the “long string” test has less of.

Reduce unnecessary conversions

Now let’s get even deeper. Using Apparat‘s dump tool to examine the raw ABC, and you may notice these everywhere…

PushDouble(4.023233417E9)
ConvertUInt()

PushInt(271733878)
ConvertUInt()

What? There are no floats and ints in this code. 4.023233417E9 is 0xEFCDAB89, one of the constants. The other number is one of the uints used. Shouldn’t both these variables be a uint already and not need conversion? It appears that Flash is encoding uints as ints and floats, then converting them to uints. Weird. But since ints/uints in Flash are stored with two’s complement encoding all the uints can be converted to their int equivalent and the operations needed for SHA-1 will behave exactly the same. This dramatically reduces the ConvertUInt’s and ConvertInt’s in our code, and appears to have a pretty big impact. There’s still plenty of ConvertInt’s left, but I’m not sure how to get rid of them all.

Lesson Learned: SWFs have a uint number pool, but may not use it. This can impact uint performance.

Reuse the ByteArray instead of creating a new one

The only thing preventing this code from using a constant amount of memory is that ByteArray. Let’s reuse it instead, so a ByteArray isn’t created each time. This helps GC performance, and also lowers the overhead of our parser.

Lesson Learned: Not creating new objects is almost always a good idea.

Grasping at straws.

Here’s where I start running out of ideas. These changes provide basically no improvement.

Final results

I did all my testing on on Linux with Flash 11.1. I spent enough time writing this article and the SHA-1 implementation that Adobe released a new version of the Flash player (11.2). Here are the final results.

TL;DR

February 6th, 2012

Adobe added native support for JSON in Flash 11, which was released a few months ago. I’ve added a new argument to the blocking JSON functions (decodeJson and encodeJson) that will use native JSON if it is available.

Basically, this allows anyone who wants to get a free speed boost among Flash 11 users, while still staying compatible (and fast) amongst users still using Flash 9 and 10. Read the documentation in encodeJson.as and decodeJson.as for more information on compatibility differences.

As much as I hate to say it, projects targeting Flash 11+ should not use blocking JSON functions. While they are still very fast, AS3 can’t compete with native code, and libraries like actionjson should only be used when necessary. There’s still no equivalent to the asynchronous JSON encoder and decoder, so they’re still useful, although this will also likely change with the release of Actionscript Workers.

January 13th, 2012

As I recently found out, Adobe is dropping support for fast memory opcodes in Flash 11.2, making projects that use tools like Apparat, haXe and Alchemy 0.3 potentially break in upcoming versions of Flash. This is the first time I’ve ever seen Adobe break compatibility intentionally for non-security reasons. It’s a pretty messed up thing to do, considering that many projects use these those opcodes.

For the pre-compiled version of actionjson (actionjson.swc), I used Apparat to provide a small boost in performance. So, if you downloaded it before now, you need to download it again or risk projects breaking in Flash 11.2. I’m very sorry that this happened, this is an unprecedented move, seeing how important compatibility has always been to Flash.

People who used the uncompiled actionjson files are unaffected by this problem, since the master branch of actionjson does not use Apparat. I would recommend making sure they are up to date, since there was a minor bugfix a few months back.

November 23rd, 2011

I’ve been working on a game engine recently, and here are some of my experiences and lessons learned. Despite the title, there are many ways to approach this problem, and this is just the one I took.

So, what’s massively cross-platform? It’s a rejection of the ideology of picking a single toolkit or environment (Flash, Unity, XNA, iOS, Android, etc) to base code in. It’s about making the game itself the model in MVC programming with the controller and view being handled by whatever environment I’m porting it to. Many of these toolkits are cross-platform, but sometimes they have poor performance, limited functionality or don’t support many of the targets. I wanted to support everything and have it perform well across the board, which involves four major areas…

The Desktop. Linux, OSX, and Windows. The easiest to target, due to the ubiquity of free and open-source tools for these platforms.

Mobile. Android and iOS (maybe Windows Mobile 7). More limited in options, and wildly different in some ways, but the basic set of tools are readily available.

The Web. IE, Chrome, Opera, Firefox, Safari. The most unusual of the four targets, because of the limited choice of languages.

Consoles. The Xbox 360 (and/or XNA), PS3, and Wii. Excluding XNA, expensive to target. Still, there’s a lot of similarities between them and the desktop target. I haven’t gotten around to this part yet because it’s expensive, so it’s not covered here.

So, basically I want to write a game engine that can support 9+ wildly different platforms, and have it be pretty easy as well. Turns out it can be done.

Choosing a language (or the core environment)

So, at the core of this game, I want to write game code once that can be shared amongst the different ports. I also wanted the nice warm embrace of a quality scripting language, with minimal impact on speed. Here’s some of the options I went through until I found the right one.

Javascript

The web target is basically the hardest to target, since there’s really only Javascript, or Flash, which is also basically Javascript. I could go the Unity route as well, but a good web developer should avoid requiring plugins whenever possible. There’s also Java applets, but I’ve had lots of problems with applets in the past and they’re not particularly user friendly.

So, why not use Javascript itself and clear up the web target problems easily? I tried finding a portable Javascript runtime but had trouble. Rhino, the Javascript interpreter for Java, seemed plausible for Android. I could probably manage with V8 on the desktop. Initial research suggested I couldn’t use iOS’s Javascript interpreter easily, and V8 wouldn’t meet iOS’s code execution guidelines. This seemed like a minefield of potential problems, plus I had a huge bias, I don’t like Javascript over some of the other possible choices. I decided to look elsewhere first, but ended up never looking back.

PyPy / RPython

At this point I felt if I could get something to compile to C or LLVM bitcode I could make it work. I found a project called emscripten that converts LLVM bitcode to Javascript. Additionally, if this didn’t work there was always Alchemy, which does basically the same thing for Flash.

I started checking out PyPy, or more specifically, RPython. Python being my favorite language to code in, it might be perfect for the job. I could even get PyPy to generate C that seemed vaguely usable. PyPy however seemed to be made solely for creating binaries, not C code or llvm bitcode. Additionally, many cool Python features were not available in RPython, so there was just no way I was going to get the full Python experience. I moved on.

Ruby

Haskell

I tried getting ghc to generate LLVM bitcode, but this was consistently troublesome. It could also generate vanilla C, but this was also difficult. I tried getting ghc to use Alchemy’s tools directly, but they just never worked.

Then… Lua

To me, Lua was a toy language, something that non-programmers used to program. This isn’t true. It ended up being my final choice and proved itself to be a top tier programming language. I was impressed by it quickly, and was confident I could get it onto my desktop, mobile, and console targets with ease. Still, there was the web target, but I found ways around this problem, which I detail below.

Choices I didn’t investigate fully

Lisp. A solid lisp implementation could be easily ported everywhere. I think this would’ve been my choice had I not found Lua.

Javascript. I abandoned this choice pretty early. While I think Lua is a better language to work with for this kind of thing, Javascript still remains a valid possibility.

haXe. Created by Flash demigod Nicolas Cannasse, it could potentially be compiled to every target mentioned. It didn’t fit in well with the manner in which I wanted to develop this game though, and the C++ targeting didn’t seem mature enough, so I looked checked out other options first.

EDIT: playn. This was suggested in the comments, I never tried it out during this project. It does not currently support iOS and Console environments, and relies on Java, but it’s open-source and so it’s possible I could do that myself. Worth investigating.

Porting Lua to everything

Each platform usually had it’s own quirks and needs, so I had to figure out the best way to make Lua work on each of them.

Lua on the Desktop

There were no real problems here. I used Lua 5.1, and it just worked. Eventually I switched to luajit 2. Not because I needed the performance boost, which luajit did give me, but to familiarize myself with luajit’s much more complicated build process so I could use it in other targets. Both are fantastic pieces of software, but I would say only use luajit if speed is very important.

Lua on the Web

I first tried compiling Lua using Alchemy. Lua compiled easily, but some hastily made speed tests placed it at a few hundred operations per second, which is extremely low. I decided to try working with emscripten instead. It was also pretty easy, but my first live test of lua code running via the lua runtime via emscripten via a Javascript interpreter was also extremely slow (EDIT: This may have changed, emscripten now has emcc, a tool which may offer significantly better speeds than what I experienced). It seems obvious in retrospect, but I was hoping for the best. In the end it could barely manage 10 fps, even with rendering turned off.

I still stuck with Lua however, and wrote a Lua->Javascript source code translator called lua.js. This would avoid any speed problems due to Alchemy and emscripten. Javascript turned out to be a good host for translated Lua applications, approaching near-Javascript speeds.

Lua on Android

Originally I used standard Lua which compiled easily for Android. When performance was a problem, and improvements to the rendering had already been made I switched to luajit. Luajit 2 is in beta right now, and for unknown reasons crashed on Android with JIT turned on, but it can be turned off. There was a slight speed boost, but overall the rendering was still the problem so it may not have been necessary. I talk more about that below.

Lua on iOS

I didn’t waste any time here and went straight to luajit. Not much needs to be said about it, although the JIT compiler cannot be used on iOS because of Apple’s code execution guidlines. I have seen some suggestions that this is not true in certain cases, but if it didn’t seem necessary anyway.

Graphics

The easiest path here is to keep the art simple, at least at first, so I decided to make a 2D game. Generally speaking 3D games are more time-consuming and expensive as well. Knowing what I know now, it’s very possible that each target could handle a simple 3D game. For my own sanity though, I kept it 2D. Take a source image, draw it to the screen at a location. That’s it.

Drawing on the Desktop

I first went with SDL 1.2. It’s stable, wildly popular and portable, and also surprisingly slow. It turns out 1.2 is pretty much exclusively a software-rendering system with no vsync. The result was choppy animation that tears, and has a lower framerate than I’d like. I tried SFML, but found the API lacking, and for a while settled on Allegro 5.0.4. Allegro 5.0.4 has a lot of potential, but is rough around the edges, little niceties like the transition to fullscreen on OS X were missing.

I then decided on SDL 1.3, which is still being developed, but I haven’t had any problems. The core set of features I wanted all have worked flawlessly. It basically combined all the nice things about SDL and Allegro, with none of the bad things. Performance improved and the game looked smooth on all platforms.

Drawing on the Web

Originally, I figured Flash was the best option for this, since traditionally it’s been much faster to render in Flash. As I discovered, this changed with the advent of Canvas and HTML5, but I still wanted to support Flash for any users that might not have Canvas available. I tried several different drawing methods (copyPixels, using Bitmaps) but performance was worse than Canvas in every browser I tested, regardless of the method used. Compared to Canvas on Chrome, it was around 4x slower. With some extreme effort, I’m sure Flash could improve, but even still it didn’t think I could ever reach the dizzying highs of 60fps in Chrome. I eventually dropped the Flash target entirely, since it couldn’t meet my standards. I figured letting users play a poorly performing game would give them a bad impression, and soliciting them to upgrade their browsers was actually a better choice.

Drawing on Android

I first used Android’s Canvas, but it was way too slow. Apparently there’s hardware acceleration for Canvas in Android 3+ but I couldn’t see a performance difference when I tried to enable it, and I still wanted to support 2.x if possible. I then wrote my own OpenGL renderer, that mostly relied on glDrawTexfOES to draw images. It was much faster but still too slow.

I managed to find libgdx, and was immediately impressed. The fps doubled immediately compared to my more naive solution. libgdx is so good, I’d use it on the desktop targets if it didn’t require the user to have a Java VM installed.

Drawing on iOS

I was expecting this to be easy since iOS is popular and libgdx left me feeling positive about rendering libraries for mobile platforms, but all the choices on iOS either didn’t fit into my display model or weren’t free. Mostly both. I reluctantly wrote my own OpenGL renderer for iOS, but this time I learned a little bit more about what keeps performance high on mobile devices and relied on a method that used glBufferData and glDrawElements instead. The performance ended up being what I wanted, even on an iPhone 3G.

Audio

Like the art, I needed to keep audio simple. There are event sounds, which play once, and background sounds, which loop forever but can be stopped at any time.

Audio on the Desktop

Originally I planned to use whatever audio system was available with my display library, but after switching around I disabled sound in whatever library I was currently using and looked elsewhere. The first was libao, but it was prohibitively licensed. I investigated a couple alternatives, including PortAudio, until I eventually I found OpenAL. Despite a high learning curve it met all my needs, including some I didn’t think I had. It also favored pushing data over polling data (callback-based audio playback being pretty common), which was great since I wanted event sounds to be as responsive as possible.

OpenAL just plays sounds, it doesn’t decode them, so I embedded libogg and libvorbis, so I could play Ogg Vorbis files. Unlike other formats, using Vorbis doesn’t require me to pay a license. I eventually switched to stb_vorbis though, which is an entire Ogg Vorbis decoder in a single file, because it simplified my build process and appeared to be faster as well.

Audio on the Web

There’s only one real choice here, the HTML5 audio tag. This was also the most worrying, since delays in sound playback can’t really be controlled and I don’t have the option to seek an alternative. Overall though, it seemed to work great across all browsers.

Audio on Android

Audio on iOS

I had some performance issues here when I used AVAudioPlayer, so I wrote an OpenAL version instead. It was better overall, but the game still runs significantly slower during sound playback. This is actually an ongoing problem, so I’d say my next option would be to try a good sound playback library for iOS, since the selection seems a lot better than the rendering libraries for iOS.

November 16th, 2011

I’ve been toying with Lua a lot lately. Lua is in some ways, the ultimate scripting language. It’s simple, effective, and supports a wide range of environments. The only missing environment, in my opinion, is the web itself, so I wrote a tool to convert Lua to Javascript.

Time passed and I kept updating it and fixing bugs, eventually adding support for ActionScript, and finally rewriting the entire thing in Javascript itself. It’s still experimental at this point, but I’ve open-sourced the project and released it onto github.

October 23rd, 2011

When I was young, whenever me and my brother would play video games and either of us would lose, we’d immediately accuse the game itself of cheating. Most of the time this was just us expressing our frustrations, but sometimes game creators give their AI all-knowing powers, which can seem supernatural and unfair. Eventually, after playing a game long enough, I’d start to predict the AI’s behavior and play it against itself. No program is omniscient, because it can’t truly understand what I, the user, am thinking.

I’m noticing lately that a lot of the services I use online are starting to cheat me out of what I wanted. It is getting to the point where a good third of my Google searches return results for things I did not search for. They’re close, sure, but that extra keyword or seemingly misspelled word was intentional. It used to be that the “did you mean” link at the top was as far as Google would go to manipulating the keywords themselves, time passed and they started sending me straight there if my search yielded no results, then later they started sending me straight to the alternative results with a reverse “did you mean” link to my actual results. Now they oftentimes they don’t even let me know they’re selectively changing my results.

Still, I could always add a + in front of keywords. Like the AI in games I’d play, I’m manipulating the system to get what I want. This is exactly what Google shouldn’t want me to do, I’m not playing a game, I’m searching for something, I’m supposed to get exactly what I want with the least amount of work. Recently, in their ongoing fight to give me less relevant results, they “deprecated” the + symbol, encouraging the use of double quotes instead.

Facebook is possibly the most guilty of this problem. It is best expressed in this TED Talk by Eli Pariser. In the video he talks about how friends with differing opinions from him slowly disappeared from his news feed, even though he didn’t specifically unsubscribe from them. I’m aware of this effect, so I try to visit profiles of friends who I don’t want to disappear. Facebook’s ability to predict what I want to read is pretty good, but it is flawed enough that I wouldn’t want it to make decisions for me, but it does anyway.

I know why Google and Facebook do this. They are constantly testing the success of their services and constantly trying to improve these metrics. These results undoubtedly show things like, if a user misspells a word in a search they are more likely to find what they are looking for if Google automatically corrects their spelling. This comes at the expense of the minority, who intended to search for the misspelled word, or hear the opinions of people they don’t agree with.

Basically, I want services like this should stop making implicit assumptions about my explicit interactions. No matter how advanced their ability to predict my needs are, it can never be perfect. In the end I searched for what I searched for, I subscribed to friends that I want to hear from. Google could, for example, give me the option to disable these kinds of presumptions. Facebook could still prioritize friends who it thinks I am interested in, but limit this to a certain number of users (which are clearly identified) so I can skip past its predictions and start seeing news from all my friends instead of just some of them.

A predictive algorithm is ultimately only as good as the data it has, and there is never enough data. This problem should be assumed to exist, even when it seems like it doesn’t, and accounted for in the UI design of services like Google and Facebook.

October 6th, 2011

January 1st, 2011

Now available is the new apparatmemory branch that includes an updated version of decodeJson that has additional memory optimizations, making it faster than every library I’ve tested it against. It uses apparat to accomplish this.

Since it’s a bit more annoying to compile (although you can, if you set up apparat properly) I’m including a pre-compiled swc that I’ll be keeping updated alongside the branch. You can download it here.

December 6th, 2010

This one is largely a response to the impressive performance of the JSON decoder in blooddy_crypto. It was clearly better performing than my own, but I’ve made a large round of optimizations to keep up with the library. I was able to improve the performance over as3corelib a few more notches, reaching 8x faster on large objects. Admittedly, blooddy_crypto’s decoder performs better in some of my own tests (good work BlooDHounD!). My encoder is still much faster though.

Unlike blooddy_crypto, the source is provided, it’s written in Flash 9-compatible AS3, and it is pretty much bug free (blooddy_crypto doesn’t pass the barrage of tests in TestJson.as).

There’s also a new encoder, JsonEncoderAsync. It’s probably one of the more surreal pieces of code I’ve written, but it can work asynchronously and it’s pretty fast.

EDIT: Couple interesting notes. I tried using apparat again to gain access to the alchemy memory bytecodes. They still didn’t perform well. It’s surprising, but it seems like array access on a ByteArray in a local variable is still faster than those bytecodes (EDIT 2: turns out they are not, but the performance boost is pretty modest). Other ByteArray operations, like readInt, performed even worse than my already low expectations, so I removed them entirely.

Also, stack underflow errors (which came up once during testing) seem to be related to errors in catch statements outside of debug mode.

December 4th, 2010

So, this blog was located on a shared host, and one guy on the host had the wrong version of some software, so one thing leads to another. and a script kiddie to replaces the index.php of this blog. I’ve moved to another host (a vpn), and it looks like I’ve got everything working now. Sorry for any confusion, it turns out configuring wordpress on a temporary domain configures it permanently for that domain, leading to some interesting problems. I think it might be time for a change, maybe I should drop wordpress as well.

EDIT: Looks like my email has been down for a while, the MX record was invalid. It should be working but it will take a bit longer to propagate.