I know chip stacking/tsv/3D integration is supposed to reduce latency (vs multiple chips in a 2D layout) and I believe it should allow for greater throughput.

Anyone more knowledgeable care to comment on the potential advantages and issues? I would think heat and noise would be potential issues, but I have no idea how difficult managing those would be.

Also would anyone care to guess at the implementation in the A6? From what I've read it sounds like companies are aiming to stack the processor, memory, and baseband. I assume the CPU(s), cache, GPU(s), and DSP(s) would all be on one die, memory on another layer (or layers), baseband on another. Would there be any advantage to having any part of the application processor on a separate layer? Say, any benefit to a large configurable cache on a separate layer? Maybe have a dedicated frame buffer over the GPU? Is it practical/beneficial to have all system memory as part of the stack? I don't know much about this stuff, but I'd love to learn and I'm eager to know where the technology will go.

I don't think you want to stack any high bandwidth stuff like cache memory on a different layer, since off die access is much slower then on die. You probably do want to stack things like DRAM though since its usually not practical to make DRAM on the same die as high performance logic.

HMC Combines high-speed logic process technology with a stack of through-silicon-via (TSV) bonded memory die. HMC delivers dramatic improvements in performance, breaking through the memory wall and enabling dramatic performance and bandwidth improvements - a single HMC can provide more than 15x the performance of a DDR3 module. The revolutionary architecture of HMC is exponentially more efficient than current memory, utilizing 70% less energy per bit than DDR3 DRAM technologies.

Now these are claims from people who profit from this tech, but if the reality is even close that sounds pretty huge.

I think it'll be akin to the processor in Sony's PSP Vita. A four core Cortex-A9 but with a newer generation GPU (PowerVR Series6) than either A5 or Vita. Or.. almost identical to the processor in Vita: four core CPU and four core GPU, but manufactured on a smaller process.

Cortex-A15 isn't ready in time for A6 if we are suggesting a Q1/Q2'12 timeframe for the A6.

Die stacking also allows your chip to be crazy expensive and the TSVs steal area that you could use for logic.

I'm not sure about this. It will increase cost, but how much? For smartphones and tablets the processor is a fairly small portion of the total cost, so would this be enough to change that? If it increase the cost of the device by 10% (random example, not based on any product) and it increases performance and battery life wouldn't it be worth it?

From what I can find about HMC (which may not have anything to do with the A6, but probably is relevant to future ARM products) the memory blocks use TSV to connect to a DRAM control logic die; the RAM logic die may only be connected to the processor at the edge. If that's the case the memory controller on the application processor and the pin-out would be changed, but I'm not sure what the effect on die size or transistor count would be.

ARM SoCs are so small (<100 sq. mm) and so slow (<2 GHz) that it's hard to imagine that on-chip wiring is hurting them. Thus I would predict the benefits of 3D stacking to be almost nothing. Also, if the cost is mostly fixed (e.g. the assembly process), then it hurts a low-priced chip more than a high-priced one.

ARM SoCs are so small (<100 sq. mm) and so slow (<2 GHz) that it's hard to imagine that on-chip wiring is hurting them. Thus I would predict the benefits of 3D stacking to be almost nothing.

Um, maybe you mean off-chip?

Quote:

Also, if the cost is mostly fixed (e.g. the assembly process), then it hurts a low-priced chip more than a high-priced one.

If by "hurts" you mean represents a larger cost increase as a percentage of the current cost, well, duh. But so what? And that only addresses the cost of one component.

I'm pretty sure a quad core A9 needs more memory bandwidth than a single core A8, especially when paired with a GPU that is an order of magnitude more powerful. But new tablets don't have significantly more power available, so they need to achieve this in the same power envelope. I don't know what the most cost effective way to achieve that is, but if you don't think 3D stacking is a useful solution please suggest an alternative that you think would be better.

ARM SoCs are so small (<100 sq. mm) and so slow (<2 GHz) that it's hard to imagine that on-chip wiring is hurting them. Thus I would predict the benefits of 3D stacking to be almost nothing.

Um, maybe you mean off-chip?

No, he means on chip.

AAF wrote:

Quote:

Also, if the cost is mostly fixed (e.g. the assembly process), then it hurts a low-priced chip more than a high-priced one.

If by "hurts" you mean represents a larger cost increase as a percentage of the current cost, well, duh. But so what? And that only addresses the cost of one component.

So this: A 1 dollar per unit price increase for packing matters less when you make a million of something then if you make a billion of something. If you don't believe me I can provide you with the multiplications

AAF wrote:

I don't know what the most cost effective way to achieve that is, but if you don't think 3D stacking is a useful solution please suggest an alternative that you think would be better.

Continuing to lay out dies in a mostly planar fashion on a chip as we do now.

AAF wrote:

Do you mean the RAM speed, the processor speed, or both?

Both.

AAF wrote:

Can any of the current ARM SoCs dynamically adjust memory speed?

This is such an incredibly naive question that I have to ask: have you ever looked at any SOC made in the last 15 years?

I don't know much about this stuff, but I'd love to learn and I'm eager to know where the technology will go.

I declared my naivete at the beginning.

I've looked at SoCs in the past decade the same way every consumer in a first world country has. I haven't previously researched any of them. Sorry if it offends someone that I would post on this topic when I have a (freely admitted) near total lack of knowledge.

redleader,I've read enough of your previous post to feel confident saying A. You know a hell of a lot more about this stuff than I doB. You're not GENERALLY an ass

Maybe I have offended you, or maybe I'm taking your post wrong, but it seems to have a nasty vibe. Having a bad day?

When I asked for people to please present other ways of increasing memory throughput without increasing power use it was a genuine request. If you're saying process scaling handles the problem without any other changes, fine; it doesn't fit with what else I've read, but I haven't looked into it, and this is something you know better than I do.

I can certainly handle multiplication on my own, thanks. Now if a change cost you $10 and you can charge $20 more for it because it's better than any of your competitors how much money do you lose on billions of devices?

I don't know much about this stuff, but I'd love to learn and I'm eager to know where the technology will go.

I declared my naivete at the beginning.

Oh sorry, didn't read that. Just saw you arguing while not having any idea what you were talking about and pounced. FWIW I recommend not arguing about these things if you haven't done your research already. Its going to end very, very badly.

AAF wrote:

When I asked for people to please present other ways of increasing memory throughput without increasing power use it was a genuine request.

And also an impossible one

AAF wrote:

Now if a change cost you $10 and you can charge $20 more for it because it's better than any of your competitors how much money do you lose on billions of devices?

None of these devices sells for even $20, and the price paid is essentially fixed within a very narrow range. So in practice what happens is that if it costs you $10 dollars for something then you've doubled your price, lost all your customers and are in bankruptcy.

When I asked for people to please present other ways of increasing memory throughput without increasing power use it was a genuine request. If you're saying process scaling handles the problem without any other changes, fine; it doesn't fit with what else I've read, but I haven't looked into it, and this is something you know better than I do.

If we have two identically clocked DRAMs, one eight bits wide and one 16 bits wide, both of the same density (e.g. 1024 Mbit) then the eight bit DRAM will use around 60% of the power of the 16 bit one. Not so much savings that we can use two in parallel.

Now if we have those same two DRAMs and clock the 16 bit one at 200 MHz and the 8 bit one at 350 MHz, the 8 bit one is not running at twice the clock, but will have more than doubled its power from running it at the same 200 MHz.

With DRAM there is usually a very, very narrow sweet spot of both width and clock. Going outside it one way makes power go sky high, outside the other way throws away performance for only marginal thermal savings.

There *is* one way out, which gives much more performance but doesn't increase power that much. What we do is present an externally narrow bus but have a very wide internal bus. This is how DDR works, ye olde 400 MHz DDR is only a little more than a 200 MHz SDRAM. We can keep the internals clocked very slowly but run a much faster external bus. A twice faster bus, all else equal, needs less power than a twice wide one for the same bandwidth. DDR3 runs the internal DRAM cells eight times slower than the external bus is clocked, for example.

DRAM latency isnt helped at all, so eventually one does need to bump the DRAM cell clock up, but this is usually absorbed by process improvements.

None of these devices sells for even $20, and the price paid is essentially fixed within a very narrow range. So in practice what happens is that if it costs you $10 dollars for something then you've doubled your price, lost all your customers and are in bankruptcy.

When I said device I was thinking less A6 and more iPad.As I recall Apple even got Intel to make custom packaged chips. I guessing that if Apple felt the need for a super pricey $20 chip to justify charging people $600 for their tablet instead of $500 that would buy the otherwise similarly spec'ed competitors product that they could get the chip made.

None of these devices sells for even $20, and the price paid is essentially fixed within a very narrow range. So in practice what happens is that if it costs you $10 dollars for something then you've doubled your price, lost all your customers and are in bankruptcy.

When I said device I was thinking less A6 and more iPad.

I'm not seeing your point.

AAF wrote:

As I recall Apple even got Intel to make custom packaged chips.

I don't think that happened.

AAF wrote:

I guessing that if Apple felt the need for a super pricey $20 chip to justify charging people $600 for their tablet instead of $500 that would buy the otherwise similarly spec'ed competitors product that they could get the chip made.

Apple is notorious for squeezing every last cent out of their BOM. They generally spend less per chip then the competition, both by demanding cheaper parts from their suppliers and by including less features in the silicon.

Thanks Hat, but I have to ask:Does the double pumped bus approach do anything to help the next generation of products (again, mostly thinking tablets and smartphones) compared to current products? Don't HMC and similar designs represent another option? As I understand it, it's tight integration, low voltage, and redesigned logic allowing a wide interface without the normal power penalty.

Thanks Hat, but I have to ask:Does the double pumped bus approach do anything to help the next generation of products (again, mostly thinking tablets and smartphones) compared to current products?

To be clear, DDR memory has been in use since the late 90s/early 2000s. All modern devices use it.

AAF wrote:

Don't HMC and similar designs represent another option? As I understand it, it's tight integration, low voltage, and redesigned logic allowing a wide interface without the normal power penalty.

I think Wes explained why thats not correct pretty clearly. At such low frequencies stacking isn't really more helpful then lateral placement, and it does nothing to reduce the power consumption of the individual cells. I think the real advantage for mobile would be that you can just physically stick more dies on a chip, but we're not at that stage where we're running out of space just yet.

Let me clarify and elaborate my previous comments. I was thinking about stacking logic, like putting the cores on one die and the GPU on another die; this shortens the on-chip wires between those components, but as I said I don't think that matters. I was not thinking about stacking DRAM, but that is definitely worth discussing. PoP stacking is widely used for DRAM today, but AFAIK you only get one 32-bit channel. If you want more bits, HMC-like DRAM die stacking may be beneficial, but I suspect it's still more expensive than wirebond DRAM stacking, which is a mature and "boring" technology.

As for cost, you have to consider that (in my mind) stacking provides little or no advantages. So let's say your next gen SoC would be $20 if it's planar or $30 if it's stacked. That's a 50% premium for stacking; does it give you 50% more performance? Probably not, so stacking makes your price/performance worse. OTOH, if you increase the cost of a $500 chip to $510, even a 10% performance improvement could justify it.

The CPU in the MacBook Air is a 65nm Merom based Core 2 Duo, with a 4MB L2 cache, 800MHz FSB and runs at either 1.6GHz or 1.8GHz. The packaging technology used for this CPU is what makes it unique; the CPU comes in a package that was originally reserved for mobile Penryn due out in the second half of 2008 with the Montevina SFF Centrino platform. Intel accelerated the introduction of the packaging technology specifically for Apple it seems

I posted in this thread because of a rumor that Apple was doing something exotic. My point was that if Apple feels the need for something exotic they likely have the means to get it.

And if I had said make a custom chip package I would have been wrong. That chip being put into that package was done for one customer, by request, and would not have been done otherwise. The package was not custom, the chip was custom packaged. Like having a dealership paint a car in a color it's not sold in; the paint may not be a custom blend but it's still a custom paint job.

I'm still looking at the cost from a different perspective. If it was Marvell being discussed I would agree with your price/performance assessment. But Apple can take a different view. Apple seems happy to be viewed as selling premium products at a premium price. Product differentiation and bragging rights mean a lot more for them, and the application processor is a cheap component in a product that cost much more. The processor isn't the product for them. And the cost difference as a percentage of the device cost is relatively minor. I don't know if the cost, performance, and bragging rights would come out far enough in Apple's favor for this to make sense, I just think it would make more sense for them than most other companies.

That chip being put into that package was done for one customer, by request, and would not have been done otherwise. The package was not custom, the chip was custom packaged. Like having a dealership paint a car in a color it's not sold in; the paint may not be a custom blend but it's still a custom paint job.

Intel has quite a long history of customising parts. If it'll seal a deal for a large order, Intel will work with you. Xbox, for example, had a custom Coppermine core in it which was some strange hybrid of Celeron and PentiumIII.

From what I can tell Apple did a lot of bragging about the iPad, not much about the A4. And when Hannibal described the A4 he said

Quote:

It's lean and mean to a degree that isn't possible with an off-the-shelf SoC

So I really can't see the problem with leaving out stuff that wouldn't have been used.

Apple has been known to do a great job marketing things that aren't that special, and a decent job of marketing things that are, but real improvements, or the absence of improvements, still effect sales.

I'm not trying to push TSV or HMC as magic solutions. They either have value for a specific use or they don't. I just don't think the manufacturing cost has to be treated as an insurmountable barrier. That's just my take on it.

Leaving the Apple issues aside, do you think there will be a point where something like HMC will be useful for ARM SoCs? If so how far away do you think that is and what applications do you think it would be useful for?I realize I'm calling for wild speculation, but I'm curious what people are expecting to come of this.

Intel has quite a long history of customising parts. If it'll seal a deal for a large order, Intel will work with you. Xbox, for example, had a custom Coppermine core in it which was some strange hybrid of Celeron and PentiumIII.

That's kind of the impression I had (which may be why I compared it to a custom paint job, not a custom body). But I'm guessing the clients that can make those large orders have the means to get an unusual SoC made. I certainly don't think Apple is the only company that could do that. Microsoft is another example of a company that could. If the rumor had been that Microsoft was using TSV for a portable game console I think my take on this would have been much the same. Well, except I'd be surprised if MS made a portable console at this point instead of just putting the effort into Windows Phone.

Doesn't look like there's a lot in the way of gritty timing details but there's some new information nonetheless, like:

- There's a separate L1 TLB for data loads and stores.. probably to help decouple the two now separate pipelines- L1 caches are only 2-way set associative (after always being 4+).. don't know what ARM is thinking here. At least they bothered to make it LRU replacement.- Back to 64 byte L1 cache lines like in Cortex-A8- Only 4KB page capability in the L1 TLBs, but at least the L2 supports large pages- Seems to only fetch 128-bits at aligned boundaries, contrary to what the earlier presentation claimed- New no-allocate mode for cache, apparently added to avoid thrashing the cache when you know you'll be streaming through stuff that isn't in it.. and write-through is always no-allocate now

An ARMv8 document has also been posted. You need an ARM account to download it.

NVIDIA's 1.3 GHz quad plus low power companion core A9 Tegra 3 has made it into the Eee Pad Transformer Prime. The NVIDA SOC machine has slightly better CPU performance but less capable graphics than the 1 GHz dual core A9 Apple A5. One might think that the strength of the NVIDIA would be graphics performance but that is not the case. This article says that NVIDIA hopes to sell 25 million Tegra 3 SOCs in 2012.

One might think that the strength of the NVIDIA would be graphics performance but that is not the case.

Nvidia's strength is GPUs, but theres not much demand for them in Android tablets. CPU performance seems much more important, so SOCs allocate most of their milliwatts to the CPU (although maybe ICS will start to change that).

One might think that the strength of the NVIDIA would be graphics performance but that is not the case.

Nvidia's strength is GPUs, but theres not much demand for them in Android tablets. CPU performance seems much more important, so SOCs allocate most of their milliwatts to the CPU (although maybe ICS will start to change that).

Yes I would hope that GPU performance "affect" tests of GPU performance

But so what? Looking at that article they've gone for roughly a 2x increase in fill rate and some more memory bandwidth (shared with the CPU), verses a 2x increase in core count, 30% higher clock speed, and NEON. Its pretty obvious where they spent their power budget, and it wasn't the GPU.

And I think this makes sense. GPU performance on Android just has to be good enough for the moderate needs of the display layer. Given how tight the power budget is at 40 nm, and the extreme measures they went with to save CPU power, I doubt adding enough additional fillrate (or more likely memory channels) to make a difference was going to be possible. It won't be until 28 nm hits that we start to see more aggressive GPU designs and memory controllers.

Intel's benchmarks put Medfeld at ~30% faster then the now obsolete Tegra 2.

As for power:

Quote:

As it stands right now, the prototype version is consuming 2.6W in idle with the target being 2W, while the worst case scenarios are video playback: watching the video at 720p in Adobe Flash format will consume 3.6W, while the target for shipping parts should be 1W less (2.6W).

For comparison, the Galaxy Nexus, a rather large phone by most accounts, has a 1850 MAH battery, which would correspond to ~200 minutes of run time if the display and radio used no power, and they hit their "target". Putting aside the issue that these are Intel's numbers, the situation is a bit less grim for a largeish tablet, but again, unless you want x86 Windows theres not really much point. If Intel's benchmarks are remotely accurate its probably going to lose to krait and exynos in both performance and battery life.

Tablets are usually armed with a 5000 mAh battery, but yeah, those numbers are kinda bad. But as both Tegra 3 and Medfield are unavailable I wouldn't read too much into the results yet.

2W idle and 2.6W while busy sounds like way too much tho, especially the idle power. Intel is king with clock gating and dynamic power on the desktop, can they really fail this badly on a new platform? I'd say no - "Idle power consumption can drop as low as 0.01W to 0.1W" is a feature of the generation 1, April 2008 Atom. (source)

[edit] The peak power looks bogus too, even a 45nm Z560 maxes out at 2.5W. I assume now that they meant power usage of the whole prototype system - that would make sense, CPU only doesn't.

FWIW there has been Benchmarks for theTI OMAP4660 ARM Cortex-A9 PandaBoard ES posted over at Phrononix. Unfortunately they did not include any power consumption data at all. So, for me at least, it of limited interest.

My interest in ARM is focused on servers and sensor networks. Despite the occasional tease, I haven't seen much on ARM based servers... so if anyone is aware of any interesting articles or products, please let me know.