TypedArray Tidbits

I’ve been working, off and on, on Retroputer, a faux “emulator” for hardware that never existed. The intent is to evoke the feel of older hardware without giving up some conveniences (like 24-bit color and absurdly limited register files).

My initial implementation wasn’t optimized at all, but over the last few sessions, I’ve been looking at the various places where optimizations could be improved. As it turns out, my use of TypedArray is one of those, and my journey thus far has yielded some interesting observations. Most of this will focus on one particular area: updating the canvas based on the emulator’s screen memory.

First, a quick note on the screen memory configuration:

There’s a single 64K page of memory devoted to 320 × 200 256-color graphics. Each color is sourced from a 256 entry palette, each entry consisting of red, green, and blue bytes. A fourth byte is ignored (essentially always 1.0 alpha).

There’s four “tile” pages — essentially text-mode graphics. These pages use one byte to indicate the tile (character) to display, one byte to indicate the background color, and another byte to indicate the foreground color. Tiles consist of 64 256-color entries, resulting in a 8 × 8 character. This allows for full-color tiles as well as the more traditional text mode visuals.

There are additional “layers” that are rendered, consisting of the border and background color.

Graphics and tile pages can be in any order when composited, allowing for easy transparency.

Everything is composited to a 320 × 200 RGBA buffer before being drawn to an HTML5 canvas.

Ideally, compositing occurs 60 times a second. Obviously, compositing the screen shouldn’t take a long time — otherwise the simulated CPU won’t have any time left to do work.

The implementation uses TypedArrayeverywhere.

Clamped arrays are slow, especially in Chrome

Originally I was using a lot of clamped unsigned 8-bit arrays. It makes obvious sense when you think about it, but clamped arrays are slower than their unclamped varieties, since they have to ensure that no value goes out of range. If instead you need to mask values (instead of clamping), it’s faster to use an unclamped array — for example:

let buffer = new ArrayBuffer(32);
let unclamped = new Uint8Array(buffer);
unclamped[16] = someBigValue & 0xFF; // will never be out of range; acts like modulo 256

As will quickly become a recurring theme, Safari’s performance was much better than Chrome and didn’t benefit much from this particular optimization. Chrome’s performance improved quite nicely.

Cache and reduce reads!

My original method for rendering tile pages was quite naive: loop over all the (x,y) pixel coordinates, converting each (x,y) into a (column, row) that corresponded to each 8 × 8 tile. Then the code determined the pixel we’re referencing inside each tile, and looked up all the color settings.

Funny enough, Safari handled this naive algorithm with no problem. Chrome, however, was most unhappy. Performance was roughly half of Safari, and I isolated it to reading from various TypedArray arrays five times per pixel.

It’s not hard to figure out a way to optimize this, though, since there are only 1,000 tiles per screen (40 × 25). Instead of iterating over the tile page in pixels, iterate by tile. The new algorithm proceeds to look at each character, column by column, row by row. Each character is then drawn in the corresponding 8 × 8 cell. We still draw the same number of pixels (64,000), but the order has been rearranged.

This new order allows us to reorder our reads. Instead of looking up which tile we need to render 64,000 times, we can instead do it 1,000 times. The same goes for the foreground and background color. And, once implemented, the number of reads plummeted from 320,000 reads to 131,256 reads. Chrome was instantly happier, and performance was much closer to Safari.

The loser in this optimization was Safari, oddly enough. Apparently its optimizer saw a faster path using the naive algorithm. As such, I’ve coded two paths: Each browser gets the path with which it is most performant.

The downside is that this optimization only applies to tile pages. The 320 × 200 graphics page is still 50% slower in Chrome due to the fact that there’s no way to reduce the number of reads any further.

Builtins are slow, again, in Chrome

Being a “faux” emulator, I was able to add whatever features I wanted to the CPU’s machine language. To make some operations easier, I added some simple “blit” instructions — essentially faster ways to move memory around. Doing this one can achieve exceedingly smooth scrolling of the screen contents in a few instructions (compared to the hundreds of thousands of instructions that would be executed over the course of a single frame).

Well, that was the theory. Safari was humming along at 60fps, but Chrome’s performance tanked.

No wonder Safari was beating Chrome by such a wide margin here — it was those very builtins that I was using to fill, copy, and swap typed array regions quickly. But in Chrome, it was actually worse than writing a for loop to do the same thing (I’m assuming Chrome’s code does more checks).

Based on the entire issue thread, however, it looks like strides are being made here, so hopefully it won’t be long before Chrome matches or beats Safari here. For now I’m not writing alternative paths since Chrome should be improving soon.

Avoid when unnecessary

My original implementation of the register file used TypedArray as well. The logic was simple — four of the general-purpose sixteen-bit registers also have an eight-bit alias that reflects the lower eight bits. So register A has AL as an alias. When A contains 820, AL contains 52 (820 % 256). So using a TypedArray makes sense here: you can have a 16-bit view to read the full register, or an 8-bit view to read the lower portion.

Well, my implementation went a step further. The CPU also accepted writes against the lower eight bits while masking changes to the upper bits. An array-based implementation suited this quite well.

Yes, well, it turns out that using regular variables and masking as appropriate works just as well and has simpler logic. Because none of the data that the emulator processes ever exceeds 32 bits, we are assured never to run afoul of JavaScript’s limited integer representation (except in multiplication). So it’s simpler to mask off the upper eight bits when needed than it is to muck about with TypedArray in the first place.

And the logic is much easier to follow, which is a huge win. I won’t say it’s faster or slower (I didn’t notice much difference), just that it’s simpler, which I count as a win.

Not done yet…

Retroputer is far from being complete yet, and it will probably never be done! So there’s plenty of opportunities for optimization in the future, along with far more interesting algorithms that improve performance, especially as features are added (next up: sprites and sound).

Of course, most of this may not factor a great deal into your code — most apps aren’t emulators. But if you’re doing pixel manipulation (aka bitmap image editing) or audio waveform manipulation, it might be something to think about. Not that your optimizations would be identical to mine, but the fact that different JavaScript engines can have unexpectedly disparate performance for something as simple as a typed array.

Of course, don’t prematurely optimize your algorithm. Your implementation may be easily optimized by all of the JavaScript engines, or my findings may no longer apply to the engines you are using. The best thing to do is write your algorithm first, make sure you have the process right, and then optimize if necessary (premature optimization often overcomplicates things).