Posted
by
Soulskillon Sunday March 11, 2012 @01:23PM
from the insert-poker-analogy dept.

MojoKid writes "At the iPad unveiling last week, Apple flashed up a slide claiming that the iPad 2 was 2x as fast as Nvidia's Tegra 3, while the new iPad would be 4x more powerful than Team Green's best tablet. NVIDIA's response boils down to: 'it's flattering to be compared to you, but how about a little data on which tests you ran and how you crunched the numbers?' NVIDIA is right to call Apple out on the meaningless nature of such a comparison, and the company is likely feeling a bit dogpiled given that TI was waving unverified webpage benchmarks around less than two weeks ago. That said, the Imagination Technologies (PowerVR) GPUs built into the iPad 2 and the new iPad both utilize tile-based rendering. In some ways, 2012 is a repeat of 2001 — memory bandwidth is at an absolute premium because adding more bandwidth has a direct impact on power consumption. The GPU inside NVIDIA's Tegra 2 and Tegra 3 is a traditional chip, which means it's subject to significant overdraw, especially at higher resolutions. Apple's comparisons may be bogus, but Tegra 3's bandwidth issue they indirectly point to aren't. It will be interesting to see NVIDIA's next move and what their rumored Tegra 3+ chip might bring."

Bought a Galaxy Tab for the Tegra 2, was so utterly disappointed. The real world performance was atrocious even compared to devices it was officially benchmarked better against. Sold it within 3 months. Still waiting on a great Android tablet....

PowerVR GPU's are integrated on a lot of ARM processors used by many mobile companies. Its not a secret, but only Apple related articles like to poke fun at it. PowerVR went from being a "brand name" to being the developer behind a lot of graphics on everything from PC's, to game consoles, to HDTV's, to cell phones etc.

For that matter Samsung had been integrating the both of them before the iPhone in any flavor came out. And continues to do so.

From what we know the A5X is pretty much the same as the A5 except it uses 4 PowerVR SGX543 cores instead of 2. Now this 4 core GPU configuration is the same as the PS Vita albeit the Vita uses a 4 core ARM as the CPU and the Vita runs a smaller 960 × 544 qHD screen. Comparatively, the Vita should beat the iPad on gaming given the hardware for intensely graphic games. For Angry Birds, it may not make much of a difference. At the present time, we don't know if Apple tweaked the A5X in other ways to boost game performance.

The "New iPad" also has twice as much RAM as a Vita (1GB vs 512MB), which could make a significant difference to practical gaming capability. As you note, as well, we have no idea what else Apple tweaked in the chip. Combined with the difficulty in an apples-to-apples comparison between two very different devices, it'll be hard to ever know how different the raw specs are. I think it's reasonable to say, though, that the "New iPad" will be excellent for gaming, as will a Vita.

Just ask Intel about Apple's benchmarking strategy: For years, the finest in graphic design publicly asserted that PPC was so bitchin' that it was pretty much just letting Intel and x86 live because killing your inferiors is bad taste. Then, one design win, and x86 is suddenly eleventy-billion percent faster than that old-and-busted PPC legacy crap.

This wasn't totally misleading. The G4 was slightly faster than equivalent Intel chips when it was launched and AltiVec was a lot better than SSE for a lot of things. More importantly, AltiVec was actually used, while a lot of x86 code was still compiled using scalar x87 floating point stuff. Things like video editing - which Apple benchmarked - were a lot faster on PowerPC because of this. It didn't matter that hand-optimised code for x86 could often beat hand-optimised code for PowerPC, it mattered that code people were actually running was faster on PowerPC. After about 800MHz, the G4 didn't get much by way of improvements and the G5, while a nice chip, was expensive and used too much power for portables. The Pentium M was starting to push ahead of the PowerPC chips Apple was using in portables (which got a tiny speed bump but nothing amazing) and the Core widened the gap. By the Core 2, the gap was huge.

It wasn't just one design win, it was that the PowerPC chips for mobile were designs that competed well with the P2 and P3, but were never improved beyond that. The last few speedbumps were so starved for memory bandwidth that they came with almost no performance increase. Between the P3 and the Core 2, Intel had two complete microarchitecture designs and one partial redesign. Freescale had none and IBM wasn't interested in chips for laptops.

No, he's referring to a conspicuous weakness of the final lines of (quad-core, btw) G5 macs compared to the company's own first competing Intel offerings. Another not-so-well-known weakness is that they also drew more juice under load than most full-sized refrigerators.

This wasn't totally misleading. The G4 was slightly faster than equivalent Intel chips when it was launched and AltiVec was a lot better than SSE for a lot of things. More importantly, AltiVec was actually used, while a lot of x86 code was still compiled using scalar x87 floating point stuff.

This was totally misleading, for any informed definition of misleading.

Just as there are embarrassingly parallel algorithms, there are embarrassingly wide instruction mixes. In the P6 architecture there were a three uop/cycle retirement gate, with a fat queue in front. If your instruction mix had any kind of stall (dependency chain, memory access, branch mispredict) the retirement usually caught up before the queue was filled. In the rare case (Steve Jobs' favorite Photoshop filter) where the instruction mix could sustain a retirement rate of 4 instructions per cycle, x86 showed badly against PPC. Conversely, on bumpy instruction streams full of execution hazards, x86 compared favourably since it had superior OOO head-room.

CoreDuo rebalanced the architecture primarily by adding a fair amount of micro-op fusing, so that one retirement slot effectively retired two instructions (without increasing the amount of retirement dependency checking in that pipeline stage). In some ways, the maligned x86 architecture starts to shine when your implementation adds the fancy trick of micro-op fusion, since the RMW addressing mode is fused at the instruction level. In RISC these instructions are split up into separate read and write portions. That was an asset at many lithographic nodes. But not at the CoreDuo node, as history recounts. Now x86 has caught up on the retirement side, and PPC is panting for breath on the fetch stream (juggling two instructions where x86 encodes only one).

The multitasking agility of x86 was also heavily and happily used. It happens not to show up in pure Photoshop kernels. Admittedly, SSE was pretty pathetic in the early incarnations. Intel decided to add it to the instruction set, but implemented it double pumped (two dispatch cycles per SSE operation). Of course they knew that future devices would double the dispatch width, so this was a way to crack the chicken and egg problem. Yeah, it was an ugly slow iterative process.

The advantage of PPC was never better than horses for courses, and PPC was picky about the courses. It really liked a groomed track.

x86 hardly gave a damn about a groomed track. It had deep OOO resources all the way through the cache hierarchy to main memory and back. The P6 was the generation where how you handled erratic memory latency mattered for important workloads (ever heard of a server?) than the political correctness of your instruction encoding.

Apple never faltered in waving around groomed track benchmark numbers as if the average Mac user sat around and ran Photoshop blur filters 24 by 7. That was Apple's idea of a server workload.

mov eax, [esi]inc eaxmov [esi], eax

That's a RISC program in x86 notation. Whether the first and second use of [esi] amounts to the same memory location as any other memory access that OOO might interleave is a big problem. That's a lot of hazard detection to do to maintain four-wide retirement.

Here is a CISC program in x86 notation. I can't show it to you in PPC notation, since PPC is a proper subset minus this feature.

inc [esi]

Clearly, with a clever implementation, you can arrange that the hazard check against potentially interleaved accesses to memory is performed once, not twice. It takes a lot of transistors to reach the blissful state of clever implementation. That's precisely the story of CoreDuo. It finally hit the bliss threshold (helped greatly that the Prescott people and their marketing overlords were busy walking the green plank).

Did Apple tell any of this story in vaguely the same way? Nooooo. It waved around one embarrassingly wide instruction stream that appealed to cool people until it turned blue in the face.

Cure for the blue face: make an about face.

Do I trust this new iPad 3 benchmark? Hahahahahaha. You know, I've never let out my inner six year old in 5000 posts, but it feels good.

soo... I'm guessing you just read the headline and skipped the: "*Update - 3/9/12: We became aware of an error in calculation for our GLBenchmark Egypt Offscreen results here and have since updated the chart above. As you can see, the iPad 2 boasts a significant performance advantage in this test versus the Tegra 3-powered Transformer Prime."