There’s been a lot of speculation about whether dual-core phones would be battery hogs or not. Turns out that voltage scaling does win, and P=V^2/R does indeed apply here. The 2X delivers middle of the road 3G and WiFi web browsing battery life numbers, and above average 3G talk time numbers.

We’ve also got another new test. Gaming battery life under constant load is a use scenario we haven’t really been able to measure in the past, but are now able to. Our BaseMark GUI benchmark includes a battery test which runs the composition feature test endlessly, simultaneously taxing the CPU and GPU. It’s an aggressive test that nearly emulates constant 3D gaming. For this test we leave WiFi and cellular interfaces enabled, bluetooth off, and display brightness at 50%.

I’m a bit disappointed we don’t have more numbers to compare to, but the 2X does come out on top in this category. Anand and I both tested the Galaxy S devices we have kicking around (an Epic 4G and Fascinate), but both continually locked up, crashed, or displayed graphical corruption while running the test. Our constant 3D gaming test looks like a winner for sifting out platform instability.

Conclusion

The 2X is somewhat of a dichotomy. On one side, you've got moderately aesthetically pleasing hardware, class-leading performance from Tegra 2 that doesn't sacrifice battery life at the stake, and a bunch of notable and useful extras like HDMI mirroring. On the other, you've got some serious experience-killing instability issues (which need to be fixed by launch), a relatively mundane baseband launching at a time when we're on the cusp of 4G, and perhaps most notably a host of even better-specced Tegra 2 based smartphones with more RAM, better screens, and 4G slated to arrive very soon.

It's really frustrating for me to have to make all those qualifications before talking about how much I like the 2X, because the 2X is without a doubt the best Android phone I've used to date. Android is finally fast enough that for a lot of the tasks I care about (especially web browsing) it's appreciably faster than the iPhone 4. At the same time, battery doesn't take a gigantic hit, and the IPS display is awesome. The software instability issues (which are admittedly pre-launch bugs) are the only thing holding me back from using it 24/7. How the 2X fares when Gingerbread gets ported to it will also make a huge difference, one we're going to cover when that time comes.

The other part of the story is Tegra 2.

Google clearly chose NVIDIA’s Tegra 2 as the Honeycomb platform of choice for a reason. It is a well executed piece of hardware that beat both Qualcomm and TI’s dual-core solutions to market. The original Tegra was severely underpowered in the CPU department, which NVIDIA promptly fixed with Tegra 2. The pair of Cortex A9s in the AP20H make it the fastest general purpose SoC in an Android phone today.

NVIDIA’s GeForce ULV performance also looks pretty good. In GLBenchmark 2.0 NVIDIA manages to hold a 20% performance advantage over the PowerVR SGX 540, our previous king of the hill.

Power efficiency also appears to be competitive both in our GPU and general use battery life tests. Our initial concern about Tegra 2 battery life was unnecessary.

It’s the rest of the Tegra 2 SoC that we’re not completely sure about. Video encode quality on the LG Optimus 2X isn’t very good, and despite NVIDIA’s beefy ISP we’re only able to capture stills at 6 fps with the camera is set to a 2MP resolution. NVIDIA tells us that the Tegra 2 SoC is fully capable of a faster capture rate for stills and that LG simply chose 2MP as its burst mode resolution. For comparison, other phones with burst modes capture at either 1 MP or VGA. That said, unfortunately for NVIDIA, a significant technological advantage is almost meaningless if no one takes advantage of it. It'll be interesting to see if the other Tegra 2 phones coming will enable full resolution burst capture.

Then there’s the forthcoming competition. TI’s OMAP 4 will add the missing MPE to the Cortex A9s and feed them via a wider memory bus. Qualcomm’s QSD8660 will retain its NEON performance advantages and perhaps make up for its architectural deficits with a higher clock speed, at least initially. Let’s not forget that the QSD8660 will bring a new GPU core to the table as well (Adreno 220).

Tegra 2 is a great first step for NVIDIA, however the competition is not only experienced but also well equipped. It will be months before we can truly crown an overall winner, and then another year before we get to do this all over again with Qualcomm’s MSM8960 and TI's OMAP 5. How well NVIDIA executes Tegra 3 and 4 will determine how strong of a competitor it will be in the SoC space.

Between the performance we’re seeing and the design wins (both announced and rumored) NVIDIA is off to a great start. I will say that I’m pleasantly surprised.

What is the point in having a high performance video processor when you cannot do the two things that actually make use of it? Those two things are: 1. Watch any movie in your collection without transcoding? (FAIL) 2. Play games. No actual buttons = FAIL. If you think otherwise then you dont actually play games. Just stick with facebook flash trash.Reply

There are some issues I've found with some information in this article:

1) You mention that Cortex-A8 is available in a multicore configuration. I'm pretty sure there's no such thing; you might be thinking of ARM11MPCore.

2) The floating point latencies table is just way off for NEON. You can find latencies here:http://infocenter.arm.com/help/index.jsp?topic=/co... It's the same in Cortex-A9. The table is a little hard to read; you have to look at the result and writeback stages to determine the latency (it's easier to read the A9 version). Here's the breakdown: FADD/FSUB/FMUL: 5 cycles FMAC: 9 cycles (note that this is because the result of the FMUL pipeline is then threaded through the FADD pipeline) The table also implies Cortex-A9 adds divide and sqrt instructions to NEON. In actuality, both support reciprocal approximation instructions in SIMD and full versions in scalar. The approximation instructions have both initial approximation with ~9 bits of precision and Newton Rhapson step instructions. The step instructions function like FMACs and have similar latencies. This kind of begs the question of where the A9 NEON DIV and SQRT numbers came from.

The other issue I have with these numbers is that it only mentions latency and not throughput. The main issue is that the non-pipelined Cortex-A8 FPU has throughput almost as bad as its latency, while all of the other implementations have single cycle throughput for 2x 64-bit operations. Maybe throughput is what you mean by "minimum latency", however this would imply that Cortex-A9 VFP can't issue every cycle, which isn't the case.

3) It's obvious from the GLBenchmark 2.0 Pro screenshot that there are some serious color limitations from Tegra 2 (look at the woman's face). This is probably due to using 16-bit. IMG has a major advantage in this area since it renders at full 32-bit (or better) precision internally and can dither the result to 16-bit to the framebuffer, which looks surprisingly similar in quality to non-dithered 32-bit. This makes a 16-bit vs 16-bit framebuffer comparison between the two very unbalanced - it's far more fair to just do both at 32-bit, but it doesn't look like the benchmark has any option for it. Furthermore, Tegra 2 is limited to 16-bit (optionally non-linear) depth buffers, while IMG utilizes 32-bit floating point depth internally. This is always going to be a disadvantage for Tegra 2 and is definitely worth mentioning in any comparison.

Finally I feel like ranting a little bit about your use of the Android Linpack test. Anyone with a little common sense can tell that a native implementation of Linpack on these devices will yield several dozen times more than 40MFLOPS (should be closer to 1-4 FLOP/CPU cycle). What you see here is a blatant example of Dalvik's extreme inability to perform with floating point code that extends well beyond an inability to perform SIMD vectorization.Reply

It is mostly FP64 calculations done on Dalvik. While this may not be the fastest way to go about doing linear algebra, it is a fairly good representation of relative FP64 performance (which only exist in VFP).

And let's face it, few app developers are going to dig into Android's NDK and write NEON optimized code.Reply

Then let's ask this instead: who really cares about FP64 performance on a smartphone? I'd also argue that it is not even a good representation of relative FP64 performance since that's being obscured so much by the quality of the JITed code. Hence why you see Scorpion and A9 perform a little over twice as fast as A8 (per-clock) instead of several times faster. VFP is still in-order on Cortex-A9, competent scheduling matters.

Maybe a lot of developers won't write NEON code on Android, but where it's written it could very well matter. For one thing, in Android itself. And theoretically one day Dalvik could actually be generating NEON competently.. so some synthetic tests of NEON could be a good look at what could be.Reply

Linpack as it currently exists on Android probably doesn't tell very much at all. But if you're just going to slap together an FP heavy app (pocket scientific computing anyone?) and aren't a professional programmer, this likely represents the result you see.

I wouldn't mind seeing SpecFP ported natively to Android and running NEON. But alas, we'd need someone to roll up their sleeves and do that.

I did do a native compile of Linpack using gcc to test on my Evo, though. It's still not SIMD code, of course, but native results using VFP were around the 70-80MFLOPS mark. Of course, it's scheduling for the A8's FPU and not Scorpion's.Reply

2) Make that 2 for 2. You're right on the NEON values, I mistakenly grabbed the values from the cycles column and not the result column. The DIV/SQRT columns were also incorrect, I removed them from the article.

I mentioned the lack of pipelining in the A8 FPU earlier in the article but I reiterated it underneath the table to hammer the point home. I agree that the lack of pipelining is the major reason for the A8's poor FP performance.

3) Those screenshots were actually taken on IMG hardware. IMG has some pretty serious rendering issues running GLBenchmark 2.0.

4) I'm not happy with the current state of Android benchmarks - Linpack included. Right now we're simply including everything we can get our hands on, but over the next 24 months I think you'll see us narrow the list and introduce more benchmarks that are representative of real world performance as well as contribute to meaningful architecture analysis.