I'm on a Mac, which seems to lack most profiling tools (e.g. no gprof) and the existing one or two like XCode's Instruments seem fairly command-line-unfriendly and Makefile-unfriendly. But while I learn to use DTraceforprofiling and get a Linux VM set up with other profiling tools, I can also check out the Ruby-side performance. Let's talk about that.

(One confusing point: there's the performance of the Ruby interpreter itself, and then the performance of the Ruby code running in the interpreter. Today's post is about the code running in the interpreter. So the function names and durations are from the optcarrot NES emulator, not the internals of the core Ruby language - yet. Capisce?)

The beginning of that Japanese post mentions stackprof, which I hadn't used before. I'm enjoying it more than perftools.rb or ruby-prof, and it's been much easier to set up. Let's talk about how to use it.

I got stackprof set up in a locally-compiled development Ruby. I'm not going to say too much about that here - it's kind of painful, and I don't recommend it without a good reason. So maybe I'll post about it later. Today, let's just talk about what stackprof shows us about optcarrot.

Stackprof needs to be in the local ruby, so "gem install stackprof" to make that happen. You may need to put it into a Gemfile with "gem stackprof" -- I did that in optcarrot. Then it wants to be required and configured. In optcarrot, you can do that in the bin/optcarrot file. Here's the original file:

Okay. So what's it do? Well, run optcarrot again, just like before... If you're not running a funky local Ruby, that's probably just "ruby bin/optcarrot --benchmark examples/Lan_Master.nes". If you need to instead add a bunch of non-standard includes and stuff so that Ruby can find its local standard library, you have my sympathies :-)

Once the run works, you should see the file that's listed in the StackProf.run line. I used /tmp/stackprof_optcarrot.dump as the filename, but you can pick what you like.

Stackprof will also output files, graphviz output, flamegraphs, specific methods and more. See its GitHub page for more details.

So - what does that actually tell us? The PPU code is the "graphical" output (headless in this case) where it calculates all the output pixels. Optcarrot's documentation (see doc/benchmark.md) even warns us that the vast majority of the time will be spent in PPU#run and CPU#run. That makes a lot of sense for a game console emulator. The time is spent emulating the hardware.

The render_pixel function isn't terribly long (see lib/optcarrot/ppu.rb and search for "render_pixel"). But looking at that array append at the end, I had a sudden suspicion. I tried just commenting out that one last line of the function (yes, the emulator is useless at that point, but humor me.) Delete /tmp/stackprof_optcarrot.dump, run the emulator, run stackprof again and...

See the left columns? That single line is using 75.2% of the total runtime. So again, the array append is slow...

If I were trying to optimize the optcarrot NES emulator, I'd restructure that -- preallocate the array, so there's only one memory allocation? Don't use an array at all, and find some other way to handle pixel output?

But I'm not really trying to optimize optcarrot. Instead, I'm trying to optimize the Ruby internal operations that make optcarrot slow. And now I know -- those are dominated by arrays.

By the way - don't take the one-line-commented-out op count too seriously. That's dominated by PPU#vsync, but PPU#vsync winds up blacking the screen if render_pixel doesn't write output pixels. And that's an array append in a tight loop. In other words, if you take out the slow array appends in render_pixel, it will fill in with a bunch more array appends anyway. This is one of the hazards of modifying the code you're profiling - it changes the behavior.

In a later post, perhaps we'll return to that array append. I'm still figuring out how to optimize it. I suspect realloc() being slow, myself. And in the mean time, you have learned a bit about how to use stackprof, which is a pretty darned awesome tool.