I've written what amounts to a random access video codec for my texture assets. Whenever the same GL texture handle is used in consecutive draw calls, my renderer automatically batches quads.

The original implementation calls glTexImage2D on textures from a pool for each loaded image. Images whose textures are not used by the end of the current frame are discarded for reuse. This was increasing my draw calls to ~70 per frame, mostly the fault of the drawn font strings (batched glyph quads). However, it does give me 29-34 fps according to the Instruments GL Driver profiler.

The implementation that I tried yesterday uses glTexSubImage2D to update portions of dynamically created atlases. Even though the draw call count dropped to ~10, and even though Instruments shows that my atlas rectangle choosing is not the cause of the slowdown, I am seeing a drop from ~30fps to ~22fps.

Is glTexImage faster than glTexSubImage? Does the iPad's tiled renderer hardware operate more efficiently with smaller textures? Would it be reasonable to expect that there are GPU caching issues in play and that the constant trimming down of in-memory GPU assets is making up the difference? Is there a possibility that if I queue up render calls in order to batch glTexSubImage calls, that my on-the-fly atlas approach could be faster?

I appreciate help from anyone with insight on how the particulars of the PowerVR SGX affects performance in my situation. Thank you so much for taking time to help me understand. Except for situations with some crazy multitexturing or complicated shaders, I haven't seen less draw calls be slower than so many more draw calls before.

Yes and by large amounts as you have found out, as there is a lot of management and extra work that is done when issuing that call. If you are building an atlas dynamically the best way is using an FBO or depending on size using NEON intrinsics on the CPU size.