Post Your Comment

108 Comments

1) I wasn't aware that Microsoft released DirectX 9.3. Perhaps you meant 9.0c or 9.1?2) Why is nVidia still using a single LPDDR2 channel when everyone else has gone to dual channel memory?

I do look forward to seeing what the next generation of GPUs will provide. Seems like we've stayed in this console generation too long with cell phones having graphics nearly on par with their 200W cousins.Reply

It's actually more complex than that. When it comes to programming for Direct3D11, there are a number of different GPU feature level targets. The idea is that developers will write their application in DX11, and then have customized render backends to target each feature level they want to hit.

As it stands there are 6 feature levels: 11, 10_1, 10, 9_3, 9_2, and 9_1. Unfortunately everyone has been lax in their naming standards; DirectX and Direct3D often get thrown around interchangeably, as do periods and underscores in the feature levels (since prior to D3D 11, we'd simply refer to the version of D3D). This is how you end up with DirectX 9.3 and all permutations thereof. The article has been corrected to be more technically accurate to clear this up.

In any case, 9_1 is effectively identical to Direct3D 9.0. 9_3 is somewhere between D3D 9.0b and 9.0c; it implements a bunch of extra features like multiple render targets, but the shader language is 2.x (Vertex Shader 2.0a, Pixel Shader 2.0b) rather than 3.0Reply

Microsoft made such a mess out of its DirectX nomenclature in the DX9 timeframe that the rest of the industry started to ignore it and invent their own. Hardly anybody even bothers to distinguish between Direct3D and DirectX anymore...they're used interchangeably, even though the former is a subset of the latter.

Windows 8 requires Shader Model 3.0 to be supported by the hardware. Whether you call that 10Level9_3 or 9_3, or DX9.3, or D3D9.3, who cares...from a graphics perspective, it is all just Shader Model 3.0 in the end, whatever you want to call it. All of the Windows 8 launch chipsets from nVidia, TI and Qualcomm, including this MSM8960 will all support Shader Model 3.0 as far as I can tell.Reply

Feature level 9_3 isn't the same as Shader Model 3 support. The Qualcomm docs say DX9.3 though, which is quite confusing since it doesn't exist. That said, I agree with your assessment that it means Shader Model 3, and not feature level 9_3.Reply

Although Scorpion featured a dual-channel LPDDR2 memory controller, in a PoP configuration only one channel was available to any stacked DRAM. In order to get access to both 32-bit memory channels the OEM had to implement a DRAM on-package as well as an external DRAM on the PCB. Memory requests could be interleaved between the two DRAM, however Qualcomm seemed to prefer load balancing between the two with CPU/GPU accesses being directed to the lower latency PoP DRAM. Very few OEMs seemed to populate both channels and thus Scorpion based designs were effectively single-channel offerings.

I can tell you with a modicum of confidence that this is true, at least partially.

Aren't you the same person who went on (ranting, obviously) about Krait using HKMG and hitting 2.5GHz next year, in another article?Reply

I suggest you stop embarassing yourself. metafor knows what he's talking about, and you clearly don't. I read that previous thread - he was pretty much spot on for everything, as you would expect. I honestly don't know why he even bothers here given the reception he's getting...

Anyway unlike what the article says, the MSM8x60 indeed only has single-channel 32-bit LPDDR2. However there's a twist: Qualcomm offers it in a PoP (Package-on-Package) configuration at up to 266MHz or an 'ISM' (i.e. SiP or System-in-Package) at up to 333MHz. I wouldn't be surprised if many OEMs used the PoP for cost reasons.

I think the confusion might come from another (older) Qualcomm SoC working like the article described iirc, but this does not apply to the MSM8x60 AFAIK. Reply

This information does come from Qualcomm, although the odd PoP + external DRAM configuration (that no one seems to use) basically means that MSM8x60 is a single-channel architecture (which is why I starred it in the table above). I will ask Qualcomm once more for confirmation that this applies to MSM8x60 as well as the older single core variants.

Scorpion does support dual-channel, however, the 8x60 series does not have two controllers. The 8x55/7x30 does, however and in most cases, are used in the configuration you described in the article.Reply

I knew MSM7x30/8x55 was dual-channel but I thought it was also available as a 64-bit LPDDR2 PoP solution? While it makes sense for most people to use it as single-channel LPDDR2 as opposed to dual-channel LPDDR1 these days, why would anyone ever have used both PoP and non-PoP DRAM at the same time? Maybe that old leaked presentation on Baidu listing all the MSM7x30/8x55 packages is wrong though.Reply

Hmm, that would certainly be news to me, it's possible but you'd still need a second memory controller and PHY so it makes very little sense. I can see a few possibilities:- The LPDDR2 and DDR2 subsystems aren't shared so in theory for tablets you could do 32-bit SiP LPDDR2+32-bit off-chip DDR2. Seems weird but not impossible.- You can do 32-bit ISM+32-bit PoP. Once again, why do this? Were they limited by package pins with a 0.4mm pitch? Seems unlikely with a 14x14 package but who knows.- You can genuinely do 32-bit PoP+32-bit on the PCB. Still seems really weird to me.

The MSM7200(A) had a separate small LPDDR1 chip (16-bit bus with SiP) reserved mostly for the baseband while the primary OS-accessible DRAM was off-chip. This was obviously rather expensive (fwiw Qualcomm only 'won' that generation on software and weak competition IMO) and removed it to reduce cost (making the chip's memory arbitrage more complex) on the MSM7227. I'm not sure about the QSD8650, maybe it still optionally had that extra memory bus (SiP-only) but it was more flexible and never used, it's hard to find that kind of info.

I don't know where you or the OP are getting your information from (3GHz A15's, quad 2.5GHz Kraits hitting next year, Kraits using HKMG etc.), but that's been pretty inaccurate. All you're doing is speculating based on bits and pieces floating around in PDF's and slides. I still remember one of his claims from the previous thread '2x A15's > 4xA9's' . While no one in their right sense of mind would argue that a the wider, deeper, single A15 is better than a single A9, to make such an uninformed, blanket statement (and to back it up with useless DMIPS numbers!) just doesn't bode very well.Reply

ST-Ericsson has publicly indicated the A9600's A15s can run at up to 2.5GHz, and GlobalFoundries has publicly said that the A9600 uses their 28nm SLP process which uses High-K but not SiGe strain. Is it really hard to believe a 28HPM or 28HP A15 could easily reach 3GHz? I'm not sure anyone will do that in the phone/tablet market, but remember ARM also wants A15 to target slightly larger Windows 8 notebooks and (I'm not as optimistic about this) servers.

As for Krait, Qualcomm's initial PR mentioned 2.5GHz (not just random slides) and APQ8064 is on TSMC 28HPM which uses High-K. If you don't trust either me or metafor on that, Qualcomm has also publicly stated that most of their chips will run on SiON but that they were considering High-K for chips running at 2GHz or above: http://semimd.com/blog/2011/02/07/qualcomm-shies-a...

As for 2xA15 vs 4xA9, metafor's point is that most applications are still not sufficiently multithreaded. It has very little to do with DMIPS which is a worthless outdated benchmark (not that Coremark is perfect mind you - where oh where is my SPECInt for handhelds? Development platforms could support enough RAM to run it by now). Unlike him I think 4xA9 should be relatively competitive even if clearly inferior in some important cases, and as you imply it's a difficult and even fairly subjective topic, but I don't think metafor's opinion is unreasonable.Reply

That is the point I'm trying to make! Semiconductor companies, by virtue of the fact that they have to sign OEM/ODM deals before they really even have working products almost always posture about how much their designs can go 'up to' or 'indicate' ratings and numbers. My beef with the earlier thread was that statements were being passed on as facts based purely on stuff posted in press releases. I can tell you, for a fact, that no 2.5GHz Krait (dual or quad) based product will be shipping in '12. I can also tell you for a fact that you will not see anything more than 1.8-2.2GHz (optimistic) in shipping A15's for mobile devices. I understand the A15 architecture is capable of much more, but to try and draw comparisons between a near-shipping mobile-spec quad-core A9 and an on-paper 3GHz A15 powering servers is not correct!

If you did follow the previous thread closely, you will see that this was the only point I was trying to get across, in vain. No matter how you slice and dice it, the 2xA15 > 4xA9 argument is wrong. This is very similar to what we're seeing in the x86 market with Intel and AMD where the older, tri and quad core AMD's are still able to keep-up with or beat dual-core Intel's in threaded situations. Now it is an entirely different argument as to whether or not Google/MS/whoever else makes effective use of multi-core CPU's in their current mobile platforms and their relatively crude/simple kernels (as compared to desktop operating systems), but come Windows 8, I am willing to bet that quad core (or multi-core in general) SoC's will prove their worth.Reply

ST-E could underdeliver on the A9600, sure, but they've got a better process than OMAP5 and enough clever power saving tricks up their sleeve (some of which still aren't public) that I feel it's quite likely they won't. Remember 2.5GHz is only their peak frequency when a single core is on - they have not disclosed their throttling algorithms (which will certainly be more aggressive for everyone in the 28nm generation, especially on smartphone SKUs as opposed to tablets where higher TDPs are acceptable).

Also multiple companies will be making A15s on 28HPM eventually. TSMC has indicated they have a lot of interest in HPM, and that should certainly clock at least 25% higher than GF's Gate-First Non-SiGe 28SLP. However the problem is that the A15 is quite power hungry, so I expect people will use that frequency headroom to undervolt and reduce power although a few might expose it with a TurboBoost-like mechanism. On the other hand, exposing the full 3GHz for Windows 8 on ARM mini-notebooks should be a breeze, and I don't see why you'd expect that to be a problem.

As for 2.5GHz Quad-Core Krait in 2012 - I think they're still on schedule for tablets in late 2012, but then again NVIDIA was still on schedule for tablets in August 2011 back in February, so it's impossible to predict these things. Delays happen, and it'd be foolish not to take metafor seriously simply because he is unable to predict the unpredictable.

Finally, 2xA15 vs 4xA9... metafor's point is that given the lower maturity of multithreading on handheld devices, it's more like high-end quad-core Intel CPUs beating eight-core AMD CPUs in the real world. As I said I'm not sure I agree, but it's fairly reasonable.Reply

I believe the comparison was simple: dual-Krait compared to 4xA9. I claimed Krait would be much closer to A15 level than A9 -- I was right.

I claimed that 2xA15 (and 2xKrait) will be far better than 4xA9. I hold to that but some may disagree. I can understand that point.

I claimed that both Krait and A15 were set to target similar frequencies (~2.5GHz) according to release -- I was right.

I claimed that Krait will initially be ~1.4-1.7GHz on 28LP and is planned to reach 2.5GHz on HKM -- I was right.

On every point, you disagreed with me -- and stated "I know for a fact that such and such". Did Krait turn out to be "a modified A9" as you claimed? No.

Is its projected performance and clockspeeds far closer to A15-class than A9? Yes.

Also, how often do you think that quad-core on your desktop actually gets utilized? Are you under the impression that multithreading is some kind of magical pixie dust that you sprinkle on to an OS kernel and all of a sudden, your applications will run faster?

Hint: Android is fully multithread capable -- 3.0 even includes a great pthread library implementation. That doesn't mean individual applications can actually be threaded or that they even can be. This should be common knowledge by now: only certain workloads are highly parallelizable.Reply

On top of that -- as we've discussed previously -- there is a very small subset of computationally intensive, highly thread-scalable applications out there. Specifically: compression, video transcoding and image processing (which will likely be the biggest performance-demanding app for the CPU on tablets what with the Photoshop Touch series).

So yes, on 4xA9, that could potentially scale to all 4 cores. But here's the thing: those are all very NEON/FPU intensive applications.

And guess what subsystem was substantially improved in A15 compared to A9?

Double the data path width, unified load-store, fully out-of-order VFP + NEON and lower integer execution latency on top of that (which, IIRC, is what most image processing algorithms use).

Even assuming A15 runs at the same clockspeed as an A9, it would still be 2-3x faster in typical arithmetic-intensive workloads.Reply

It seems pretty clear. As temperature increases (right on the X axis), 40G transistors consume more power (up in the Y axis). The power increase vs temperature increase curve of 28LP doesn't grow as fast.

This, of course, has more to do with it being an LP process. 40LP transistors would have a similar curve.Reply

Metafor is right about the curve having to do with the process. His explanation kinda makes it seem like a temp increase causes the power increase though. It's the power increase that causes the temp increase, and "G" transistors are designed to handle more power without wasted heat(temperature increase) compared to "LP" transistors. There's also a second reason why 28nm is hotter than 40nm.

If you have a certain amount of heat energy being produced at a certain power level, the 40nm transistors will be a certain temperature.

Now take that same amount of heat energy being produced, and shrink the transistors to half their size. This increases their temperature within the same power envelope.

Of course they labeled a thermal limit on the power side, because the holder of whatever phone this chip goes into is going to feel the heat coming from the chip due to how much power it's using(how much heat energy is put out), not just due to the temperature of the transistorsReply

This is a problem in a lot of circuit design. Power dissipation (both due to scattering and increase in resistance of the charge channel) increases with temperature. But temperature also increases as more power is dissipated. It's a positive feedback loop that just gets hotter and hotter.

When simulating a circuit, this problem has to be taken into account but simulating the heat dissipation is difficult so one can never be sure that a circuit wouldn't overheat under its own operation.

It's an on-going research area in academics of how to simulate such a situation beforehand and avoid it.Reply

Basically, it's increasing the power of the chip, which increases heat energy output, that increases the temperature. And with that increase in temperature, comes an increase in power.

Heat dissipation is the only way for the chip to keep itself from burning up. It's just impossible to really tell how much can be dissipated under even certain conditions due to heat exchange kinetically between atoms, and most likely the radiation amount differs between atoms.

It's basically impossible to simulate an exact scenario for this exchange.Reply

The minute a company gives you a bit of attention,you forget about objectivity.

"The key is this: other than TI's OMAP 5 in the second half of 2012 and Qualcomm's Krait, no one else has announced plans to release a new microarchitecture in the near term""Qualcomm remains the only active player in the smartphone/tablet space that uses its architecture license to put out custom designs."

Yes, I have heard of Marvell and Armada, isn't that what's left of XScale? Honestly I thought they had given up on what was XScale and licensed the RTL like everyone else instead, but it looks like I was wrong.Reply

I'm hearing that this one has slipped, and now the ST-Ericsson chipset with Rogue won't sample to OEMs until the first half of next year and therefore won't be in commercial products until the very end of 2012 if at all, whereas the MSM8960 is already sampling to OEMs according to Qualcomm. In other words, schedule-wise, you're probably comparing apples to oranges. I do agree with you though that we'll likely see A6 from Apple, in some form, by this time next year, but I think it'll be higher spec'd and will blow the doors off anything from STE. The more interesting question is whether the quad core 8064 that Qualcomm has mentioned for next year, can keep up with A6, both from a CPU and a GPU standpoint. Reply

Marvell indeed hasn't had much luck in the high-end so far, but the same latest PJ4 core has been fairly successful both as the HSPA Pantheon 920 at BlackBerry (including some new BlackBerry 7 devices) and the TD-SCDMA Pantheon 910 for various China Mobile-specific phones. So I'm not sure it's a good idea to exclude them completely although they're certainly not in the same league as Qualcomm so I really don't blame you.Reply

Are you really sure that the upcoming Nexus Prime will use OMAP 4?Seems unlikely to me... A SGX 540 to power a 720p display when Samsung as their own better SOC with Mali 400?? And the rival iPhone 4S use a GPU that is 7/8 times faster than SGX540.Sounds really stupid... I can't believe that OMAP 4 is the reference SOC for Android ICSReply

The SGX540 in the OMAP4460 is clocked significantly higher than (something TI is great at). So, while it isn't a powerhouse compared to Exynos or A5, it's more than sufficient. Google has never traditionally been top-of-the-line in terms of processors with their Nexus series. So it isn't out of the question they wouldn't be this time around either.Reply

Technically, the Hummingbird SoC used in the Nexus S was top of the line at the time it was released (although it was eclipsed by dual-coe phones after 1-2 months). Also the Nexus One was the first major phone I can remember that had 512MB of RAM.

However, it should be said that Google doesn't think about getting the best SoCs available for their product, but instead seeks to get the best deal on it's components from bidding against vendors. It's very possible that Samsung knows they had the most powerful SoC outside of the A5 and wanted Google to pay it for the value it would provide. Google instead apparently went with TI, likely because TI is selling it's chips cheaper in order to be the reference platform for Ice Cream Sandwich.Reply

"Qualcomm claims that MSM8960 will be able to outperform Apple's A5 in GLBenchmark 2.x at qHD resolutions. We'll have to wait until we have shipping devices in hand to really put that claim to the test, but if true it's good news for Krait as the A5 continues to be the high end benchmark for mobile GPU performance."Reply

Static ram (the kind used in CPU caches) has always been much faster than the dynamic ram used for main system memory.

SRAM uses a block of a half dozen transistors to store a bit as a stable logic state; as a result it can operate as fast as any other transistor based device in an IC. The number of clock cycles a cache bank needs to complete an access operation is primarily a factor of its size, both because it takes more work to select a specific part and because signalling delays due to the speed of electric signals through the chip become significant at gigahertz speeds. Size isn't the only speed factor in how fast a cpu cache operators; higher associativity levels can improve worst case behavior (by reducing misses due to pathalogical memory access patterns) significantly at the cost of slowing all operations slightly.

DRAM has a significantly different design, it only uses a single transistor per bit and stores the data in a paired capacitor. This allows for much higher memory capacities in a single chip and much lower costs/GB as a result. The catch is that reading the capacitors charge level and then recharging it after the check takes significantly longer. The actual memory cells in a DDR3-1600 chip are only operating at 200mhz (up from 100-133mhz a decade ago); other parts of the chips operate much faster as they access large numbers of memory cells in parallel to keep the much faster memory bus fed.Reply

Isn't it amazing how these low-power architectures are surpassing Atom in both power and performance? Atom isn't even an OoO architecture. Windows 8 and OS X Lion will be allow these architectures in netbooks and ultrabooks before we know it, and Intel's value-stripping at the low-end will finally die a terrible death.Reply

Even with FinFet it's impossible Atom will run at 4GHz which it needs to get comparable performance as an A15 or Krait at 2.5GHz. And in less than 2W. Atom has been dead in the water for a while now - it cannot keep up with ARM out-of-order cores on performance, power consumption or integration despite Intel's process advantage.Reply

And in that case you don't really need a tablet. Tablets would be very useful if they could deliver some kind of alternative user input and display. For instance: holographic displays (like in sci-fi movies) and voice control and input. That would actually make these mobile devices truly useful. The current generation and the coming generations would be nothing more than mobile browser and mp3/mp4 player.Reply

And, if it could work at all, how well would voice control work outside of an anechoic chamber (with a lone user talking to it)? But, I agree, tablets (used for decades in warehouses, by the way) will soft keyboards and the like are just a cheap and dirty way around the I/O issue.Reply

What I mean is we'll get a quad core phone with 4GB of RAM and all we'll need is a tablet to dock it to. The tablet will have the other connectivity options-- HDMI, USB, etc which will let the non-gamers replace their desktops and the people not needing Windows for business, replace their laptops.Reply

The dedicated gaming desktop will largely be obsolete in a few years. Games are much more GPU dependent than CPU dependent. What we will see is the era of external graphics cards which you can hook up to your laptop to play on your monitor @ 1080p with all the special effects turned up. Then you "undock" and rely on the integrated GPU to get longer battery life.

Thanks for the write up, pretty interesting stuff to come next year. Hopefully developers will embrace the new power, so that I can make fuller use of my Galaxy S2. I don't think I'll upgrade that soon though, pretty content with Galaxy S2 and it was expensive enough for now. I'll see what Christmas 2012 brings. :DReply

I'm pretty sure they said it would sample by the end of this year and ship late 2012. If they were to delay it any more, they would be in serious trouble. ST Ericsson has quite a lot against them recently, and if they can't keep to their promises, TI is going to beat them quite badly. I'd estimate the Rogue to show up in an OMAP 5 in H2 2012 or H1 2013.

All in all I'm just really excited by the PowerVR Rogue. Seeing the specifications of the Nova A9600 and what the Rogue can do is quite amazing. It's almost on par with the PS3.

Metafor is right about the curve having to do with the process. His explanation kinda makes it seem like a temp increase causes the power increase though. It's the power increase that causes the temp increase, and "G" transistors are designed to handle more power without wasted heat(temperature increase) compared to "LP" transistors. There's also a second reason why 28nm is hotter than 40nm.

If you have a certain amount of heat energy being produced at a certain power level, the 40nm transistors will be a certain temperature.

Now take that same amount of heat energy being produced, and shrink the transistors to half their size. This increases their temperature within the same power envelope.

Of course they labeled a thermal limit on the power side, because the holder of whatever phone this chip goes into is going to feel the heat coming from the chip due to how much power it's using(how much heat energy is put out), not just due to the temperature of the transistors.Reply

The graph is conceptually correct. While it's true that consuming more power produces more heat, the inverse is also true. The temperature of a transistor affects its leakage characteristics because resistance increases with heat. So at higher temperatures a CPU is going to consume more power to maintain its performance, compared to the same CPU at a lower temperature.

You're basically looking at the principles of a superconductor applied in reverse.Reply

According to this article we'll have to wait for at least another 3 years or maybe more until we get tablets with enough power and good battery life that would be actually useful.Yeah, maybe at 14nm and with tri-gate transistors somewhen in 2016 we'll be able to enjoy true mobile computing all day long (at least 16 hours without a recharge).

Yeah, progress is good but way to slow sometimes. Too bad I was hoping for a ultracool and powerful e-book reader that delivers more tablet like experience rather than what currently is available.Reply

Define "useful". I'd argue that a lot of CPU cycles are wasted doing meaningless background tasks in apps that you can't see when it would be better to just pause and resume them later when the user brings them back into focus (aka Windows 8).Reply

While Single instruction multiple data whether short or long vector is a great idea, sadly under utilized except in graphics processing, compressible signals and cryptography. Does the NEON technology just an additional graphics/compression engine? Does it require special Neon programming/compiling or does it enhance normal MIMD programming?Reply

I'm a fan of efficient coding and design and hope that Qualcomm is following that path. I think there's been too much "just get it done no matter what" in the programming business for far too long. That's not to say there aren't very good free open source apps out there by astounding programmers but the mainstream seems to have forgotten or don't care.

This is good news because mobile is still in the early phases and if efficiency is priority things can only get better if not easier to debug, code, change, etc.

The bad is of course increasing speeds, any speeds like in the PC industry of yester-years. Sure we want more powerful hardware but lets not make it because of shotty code and design architectures. Again, in the mobile industry I believe these two points should be highly consider by anyone.Reply

You could add the ZiiLabs ZMS-20 (and quad core ZMS-40) to the table. That'll probably see action in a Creative Android tablet/PMP and other OEMs might pick up the Jaguar reference tablet. (Could make a good review piece?)

Or how about Marvell's 628 ARMADA tri-core SoC? Marvell are getting pushed out by Qualcomm but it could get some design wins next year (It has two 1.5GHz cores and a low power 624 MHz core, similar to Tegra 3)

Wouldn't the use of L0 impact performance?If L1 is shut-down, there will be penalty on a L0 miss. Powering up L1 on a L0 miss would cost thousand of cycles at 1.5GHz.If L1 is on, then what is the point?Reply

Hi, thanks for your great article.You may want to add to the list the Samsung Exynos 5250, that should be the A15 version of the Exynos SOC, running ARM Mali-T604 GPU, don't know if single or multicore. Let's hope it will be used in Samsung Galaxy S3 :)Reply

Hey guys, Im a dev from xda forumrecently new adreno drivers have been tested and improve 3d performance by 50%, so things are looking very good for the new gpu, itll perform much better than current devicesReply

re shaky153.''Hey guys, Im a dev from xda forumrecently new adreno drivers have been tested and improve 3d performance by 50%, so things are looking very good for the new gpu, itll perform much better than current devices ''

Brilliant, so does that mean that the adreno gpus do have a lot of underutilised power as anand attested in this article?... if so will the adreno 220 get much closer to the a5 in benchmarks? (taking into consideration we dont know a5 gpu clocks and 2x tmu's)

Im not going to say what everyone else has already said, but I do want to comment on your knowledge of the topic. Youre truly well-informed. I cant believe how much of this <a href="http://www.luxurysarasotarealestate.com/west-of-tr... rel="follow"> West of Trail Real Estate in Sarasota </a>Reply

reading it. Thank you for your information,i will be the regular reader of your blog and have a look of your website,thank you! <a href="http://www.themana.gr”rel=“follow”> customer experience </a>Reply

you command get bought an nervousness over that you wish be delivering the following. unwell unquestionably come further formerly again since exactly the same nearly very often inside case you shield this<a href="http://cottageme.com/offers/all/canada”rel=“follow”> rent cottage </a>Reply