Mastering C# and Unity3D

Stage3D Readback Performance

At long last, Flash Player 11 has been released and carries with it a raft of exciting new features. Perhaps most exciting is the inclusion of the new Stage3D class (and related libraries) to enable GPU-accelerated graphics rendering. Today’s article will be the first to cover this new API and discusses one of its features: reading back the rendered scene into a BitmapData that you can put on the regular Stage. Surely this will be a popular operation for merging 3D and 2D, so let’s see how fast it is!

If hardware acceleration is being used, the pixels will need to be sent back from video card memory (VRAM) into main system memory (RAM), which can be a very expensive operation. If the software renderer is being used instead of hardware acceleration, the pixels will already be in RAM so the transfer will be—theoretically—a much quicker memory copy operation.

To test this theory, I wrote a little performance app. It draws absolutely nothing with the Stage3D API and only displays a little UI for controlling the app. This way, we can isolate the performance of Context3D.drawToBitmapData, which is responsible for reading the Stage3D‘s pixels into a BitmapData.

Hardware

Software

Resolution

No Readback

Readback (no alpha)

Readback (alpha)

640×480

1

2

2

800×600

1

4

4

1024×768

3

6

6

1280×720

3

7

7

1920×1080

7

15

15

Software rendering is clearly slower overall, even with a blank scene. Unfortunately, it seems no faster at reading the scene back into the BitmapData than the hardware-accelerated version. This would have been one of software rendering’s only performance advantages over hardware-accelerated rendering, but it seems as though this optimization is not (yet) in place.

Nonetheless, this test points out an important fact: reading the scene’s pixels back into a BitmapData is very expensive and possibly not feasible in real time with large scenes. For example, a game attempting to run at a smooth 30 frames-per-second has only 33 milliseconds per frame to do its work. If reading the 3D scene back into RAM takes 15 milliseconds, the rest of the game (e.g. physics, sound, 2D rendering, networking) must be quite fast to accommodate it. Also, it’s a good idea to think of older systems than my test machine, which is a relatively new MacBook Pro. Still, if adding 3D content to a 2D stage scene is very important, it seems like it can be accomplished so long as you limit the resolution of the 3D scene.

Spot a bug? Have a suggestion? Different results on a different OS or video card? Post a comment!

Comments

as you have done half the hard work already do you think you could do me a favour?
and tell me how fast read back is for a 1×1 pixel bitmapData.

my reason for this is that I used a texture readback to handle mouse interactions with a complex scene, by encoding object information into a colour texture and as you have discovered this can only really be used for debugging due to the fact that it is quite costly :(

BUT, with regards to the mouse in theory I only need to render 1 pixel of the whole screen (the pixel under the mouse) so I can use a teeny tiny frustum to cull away the vast majority of a scene and then render all objects that are in/intersecting that small frustum..

so can you get away with a 1 pixel read back? :D if it comes in sub 1ms then I think it has a use.

as you have done half the hard work already do you think you could do me a favour?
and tell me how fast read back is for a 1×1 pixel bitmapData.

my reason for this is that I used a texture readback to handle mouse interactions with a complex scene, by encoding object information into a colour texture and as you have discovered this can only really be used for debugging due to the fact that it is quite costly :(

BUT, with regards to the mouse in theory I only need to render 1 pixel of the whole screen (the pixel under the mouse) so I can use a teeny tiny frustum to cull away the vast majority of a scene and then render all objects that are in/intersecting that small frustum..

so can you get away with a 1 pixel read back? :D if it comes in sub 1ms then I think it has a use.
ben

(might have double posted)
oh and if it is fast (sub 1ms) how many can be done before one hits the 1ms mark

This sounds kind of like using an object buffer: writing color-encoded pixels to a screen-size texture, reading that back, and querying the values (e.g. via BitmapData.getPixel32). The problem there is that you double your fill rate: the number of pixels drawn per frame, which is extremely expensive. The bonus is that you get per-pixel accuracy with the mouse. Most programmers choose to cast a ray from the camera “into” the scene and intersect with bounding boxes around the objects in the scene, or possibly even the object’s triangle mesh if the bounding box test passes. This is much faster but not necessarily pixel-perfect.

As for your strategy, I’m not sure exactly what you’ve done so I’m having a hard time recreating it. You can’t have a 1×1 back buffer or read from a 1×1 texture (or any texture for that matter), so I’m not sure how you’re reading just one pixel. If you’re reading the whole screen back as in the “object buffer” approach above, the performance should be just as awful as in the article.

In any case, unless you really need per-pixel accuracy I would recommend going the “ray casting” approach with good-fitting bounding boxes. There are plenty of tutorials online covering this topic, which is called “picking”.

as a long-time openGL person, i thoroughly agree: if you want performance, don’t readback.

if you don’t care about graphics performance (and that’s understandable for folks used to flash)
then it’s fine, but if you want 60FPS with significant complexity, you have to take the realities of a GPU into account.

Why do you misquote? You ommitted an important part: “Ok – you can maybe do this ONCE per frame, if you are careful and build your renderer around it.”

I also have to disagree with a tip of capped framerate. In the time you wrote this post, there probably wasn’t a framerate police, but there is one now. Also I heard Total Buiscuit didn’t play Hyper Light Drifter, because it was capped to 30 FPS – you absolutely want to optmize your game for 60 FPS in the very least (which doesn’t give you a huge range in flash that is capped to 60 FPS).

This is very useful information. It seems like the size of the bitmap read affects the bottleneck as you have shown. The question has been asked – what about just reading a small region e.g. 1×1 under the mouse.

That suggests that you could shift the viewing coordinates so that the mouse position (x,y) is shifted to 0,0 and then pass in a small (1×1) destination BitmapData into drawToBitmapData to get the backbuffer color value under the mouse. Assuming the data transfer is linear in the size of the destination then this should be a lot quicker.

It’s worth trying out, but I’m guessing it’ll still be quite slow. Since the documentation says it’s “clipped”, that could mean that the whole back buffer is read and then simply discarded except for the pixel you care about. The only way to find out for sure is to set up a real performance test, so perhaps there will be a follow-up article.

All (multi-threaded) rendering must stop and a huge number of pixels (1920*1080*3 bytes ~= 6 MB) must be transferred from the frame buffer in video memory to a location in system memory where a BitmapData is allocated. The exact times will depend on your system (graphics system and driver, memory bus, memory architecture, etc.), but 30ms doesn’t sound unreasonable given that I got 15ms on my test machine. I could certainly see many Android devices with dedicated VRAM taking that long. In short, only do this when absolutely necessary. For example, it’s necessary to save screenshots but you shouldn’t try to record or stream a video with it.