Post Your Comment

83 Comments

1- please add a comma in there somewhere. I had to read the sentence 4 times to understand it (page 1=: "VLIW designs will never achieve perfect efficiency in this regard, but the farther off real world utilization is the weaker the benefits of VLIW."

2- When, if ever, will we vile users see any benefits ? I get the feeling that most apps are still not optimized well, if at all, for multicore/threading. Come to think of it, most don't even use most of the x86 extensions more recent than SSE2. Now we're talking of yet another x86 extension, that is not only AMD-specific, but very task-specific. Apart from a handful of labs doing GPU computing, and the usual Photoshop filters... i'm doubtful ?Reply

I'm not an expert in this sort of design, but is AMD setting up this architecture to replace the x86 ALU? Bulldozer is already running 2 ALUs for every 1 FPU, which is promoting ALU-heavy software design. It may take a few revisions to meld them (or phase one out), but it certainly seems like that's a heterogeneous CPU in the end. Reply

I think it'll be quite awhile before the monolithic cores dissolve into the heterogeneous architectures, mostly depending on how fine-grained the power gating can get. When it gets to the point where the CPU can selectively turn off components inside a given SIMD unit, I think we'll see someone go "Wait a minute, then why do we even have this big core anymore?" and it'll go away. 2018ish, maybe?Reply

ALU is generally used to refer to a very simple unit that performs arithmetic, logic, and possibly bit shift operations on integers, not floating point values. The units labeled ALU in the GPU diagrams in the article may support some integer operations, but they mainly process 32-bit floating point values, and (IMO) should not be labeled as "ALUs". FPU would probably be more accurate, but I do not know what operations these units support and whether they include a native integer ALU or just convert to integers to FP.

I don't know what you would mean by ALU-heavy software design. Bulldozer has two integer execution cores per module. Each core is composed of 2 ALUs and 2 AGUs, not shared. It also has 2 128-bit floating point (FMA) units per module shared between the two threads. This isn't really much different than an intel hyper-threaded core. Intel has, I believe, 3 ALUs, 3 AGUs, and 2 FPUs per core which is shared between 2 threads. AMDs version of multi-threading just doesn't share as much hardware between threads, which may be better than Intel's HT (2-2-1 AMD vs 1.5-1.5-1 Intel ALU-AGU-FPU). Intel's version would allow a single thread to us all of the execution resources at once, if there is no competing thread. Sharing the FPU makes a lot of sense, since most code that runs on CPUs only uses the FPU intermittently. If the code uses FP more than intermittently, then it would be a candidate for vectorization, and execution on the GPU instead.

While AMDs next generation graphics hardware may be able to execute more general code compiled from a wider range of languages, it is not an x86 processor, and it can not replace the CPU. If you look at the diagram, it has a single scalar unit to handle non-vector code in each compute unit. It also has 64 units in the 4 vector arrays of each CU. If you actually tried to compile and run the kind of branch heavy, integer code that CPUs have to deal with on a CU, then it would probably run entirely and very, very slowly on that single scalar unit.Reply

I think you've got the right idea with this being melted into a Bulldozer-like design. however, it wouldn't replace the x86 ALUs, which are highly-optimized for high clock speed and low latency execution, as well as excellent handling of branches etc.No, it would rather replace or supplement a fat FPU shared between many "cores" (which, by then would basically mean ALUs + scheduling). Most tasks which requires massive fp number crunching can be executed well in parallel and therefore are suitable for execution on a GPU core. The question is just how to bond them together so that the software guys can actually use them..

Basically, what we have here is a math coprocessor. Back in the day, Intel's x86 processors were very good (relatively speaking) at integer math, but choked on floating point math. So Intel created the 8087 to handle the floating point calculations while the CPU handled the integer calculations (obviously this wasn't exclusive to Intel, but I'm generalizing). Eventually, the floating point unit was merged onto the CPU, and programs began using them interchangeably.

What we have today is very similar. CPUs, even with their advanced FPUs, are nowhere near as powerful as the massively parallel monstrosities we use for graphics. Eventually, they will be merged onto the CPU, and used as readily for general floating point processing tasks as FPUs are currently.

And this is the point of Fusion: to fully replace the aging floating point unit with an IGP.Reply

The benefits to home or enthusiast users of heterogeneous CPUs are still several years off. We need market penetration of hardware along with fundamental changes in software development models and smarter compilers.Reply

If AMD delivers in a timely manner they will have a bright future. This looks like a huge technological transition and I understand the need to get developers onboard now but it also tips AMD's hand to Intel who will steal any ideas that they can.

Unfortunately we are still waiting for most applications to be written for 64-bit use so I'm not holding out much hope for an expeditious migration on a complex technological transition though it does appear that maybe AMD has been working on this for some time and may be able to do a better job of executing with Trinity and future products. Time will tell but I hope AMD delivers on time and they will definitely get my dime - all of them.Reply

With Windows 7 having a 80 percent(or higher at this point) install base being 64 bit, it will take until late 2013 before we see the majority of the old 32 bit install base being phased out in the home computer market(as people replace their computers at the four-five year mark). Until then, application developers have to expect that they MUST support both 32 and 64 bit platforms. Lowest common denominator for your user base is what developers generally have to compile for.Reply

I assume you're using the steam hardware survey since they're showing 4:1. Unfortunately steam's not a good source for broad market stats since it excludes the low end boxes bought by non-gamers and corporate boxes. Surveys that capture these numbers only show a 2:1ish ratio for win7 64:32.

Beyond that, it's the people with the low end 32bit boxes that will keep their old clunkers the longest. You're also underestimating how long support for legacy OSes will continue despite their very small market shares. Firefox 4 still runs on win2k, despite it's market share having been negligible for several years and being officially out of support for almost a year.

Excepting apps that actually can benefit from going 64bit I expect most to stay 32bit for at least the next 5 years.Reply

Indeed. In the non-gamer realm, I know of people happy with 2003 Pentium 4s and Athlon XPs yet. I have no doubt that there are many people with even older hardware. This stuff tends to stick around until the PCs die and the owner is told it's not worth the money to upgrade. Fear of change and the simple lack of a true need to upgrade is the reason.Reply

I was at office max the other day and a guy was screaming at a sales rep because they didn't carry any serial mice that supported his rig. I don't mean ps2 either. He was carrying around a busted up brown serial mouse. He said his rig came with windows 95 but last year he upgraded it to windows 98. Seriously. This is the world we live in. Reply

I still have my Compaq (that came with Win95 which I upgraded to win98) running on a Pentium 133 with 32MB of EDO RAM and a 2.1GB HDD. Its sitting ilde in my basement collecting dust at the moment. :DReply

But Steam is good representation of those who could benefit from and will ultimately will be using these future technologies, professionals and enthusiasts. Such is always the way of high-end computing.Reply

exactly. people still running XP are probably not the target market for developers because if they are so slow on the uptake of new technology, it would follow that they are also relatively uninterested in other new programs.Reply

Nope, I am going on what my customers have and are upgrading to. If you BUY a machine with Windows 7 on it, 9 out of 10 have Windows 7 64 bit on them. Those that have 32 bit are either the very low-end machines with only 1GB of RAM(yes, they still sell those), or they are the result of doing an upgrade from Windows Vista 32 bit.

That is the thing about 64 bit, people don't "go to 64 bit" at this point, they get a new computer that comes with 64 bit Windows on it. The number of people who do an upgrade on an older machine has dropped, since those who would have done the upgrade did that back in 2009 and early 2010 when Windows 7 first came out.

Now, the real benefit to 64 bit isn't as much about the software as it is about how much RAM the machine comes with. If you get a machine with 4GB of RAM, you want 64 bit, just so you don't lose memory due to the 4GB limit on 32 bit Windows, and hardware mapping below the 4GB mark.

A part of this is also about the area you live in, and how much money there is going around. I live in an area where it is the norm to pay over $8 per person for lunch at a deli, and as a result, the value of the dollar isn't as high. Spending $20/day just on lunch and minor expenses is the norm, so with that in mind, replacing a computer every 4-5 years, even for the non-technical is NORMAL. The last time I encountered Windows 95 or 98 was around 6 years ago.Reply

There is a little more benefit. A few of us were doing an internal benchmark of our software using VStudio 2010 and all the random hardware we have around. 32bit, 32bit + SSE2, and 64bit + SSE2. We found across the board, 64bit is about 5-10% than 32bit + SSE2 and 5-20% faster than basic x86.

However, a 64bit OS gave no benefit (or penalty) for a 32bit program. The same 32bit software ran the same speed on XP32, XP64, Vista32, Vista64, and 7-64.Reply

It's good to see AMD more committed to the GPGPU. I use GPGPU for neural network simulations, and currently the default choice has been Nvidia with CUDA. It would be nice to see some competition in this space.From the article it sounds like AMD knows to put a lot of emphasis on the software side of things for the developers. Hopefully they'll have a capable programming system that's as good as CUDA, maybe even better.Finally, Given AMD's strategies in the past with medium sized GPU chips and multi-GPU for high-end, hopefully they'll put sufficient emphasis into support for easier multi-GPU programming.

I wouldn't really call it radical, Cayman already had the same theoretic 1/2 performance for FP64 adds compared to FP32. Muls/FMAs though are now 1/2 too it seems (though it might not extend to all products) whereas it was 1/4 on Cayman. Still, a factor two is not what I'd call a "radical" improvement.Reply

I really appreciated the letters A M D. Since Athlon, one could feel that AMD is lagging behind Intel more and more, but now with them beingh the first successful CPU\GPU combination (Llano out there now ) now AMD can make their own way and API's even into OS's just like what Intel and NVidia always do. This way I'm more than sure that we'll see titles (apps and games ) with the unified AMD brand instead of those ( meant to be played ) or ( smart solution ) with some stupid stars for Core i3,5 or 7Reply

Well, technically Sandy Bridge is also a CPU/GPU combination, and I think I would call it successful. Granted, the graphics are not up to AMD levels, but their CPU performance is much better. And considering the debacle of Bulldozer and the architecture that was not optimized for current software, AMD will have to do a much better job of integrating their hardware with software than they have done so far. Reply

Right now, there has been a shortage of software that really pushes the graphics limits, mostly because you have the substandard Intel graphics out there that still has a significant market share. How many games out there really make you feel that a Radeon 6970 just isn't enough? The polygon count for objects(characters) in games have not been going up as much as more world detail has been going in.

Now, when developers want to try aiming for 5 million polygon figures in games, THAT is where there will be a bigger demand for more graphics power, and with that level of detail, the CPU power needed to properly animate the objects needs to be higher. This is where all of this work with GPU compute comes in, to handle all the complexities of properly animating these super-high detailed objects.

I will note that The Witcher 2 is one of the first games I have seen in a long time where CPU power needs to be higher than a Phenom 2 945, and I am waiting for the AMD Bulldozer core CPUs(not APUs) to come out to see how big of an improvement it will make.Reply

There combining the CPU and GPU so that they are more able to talk to each other, and do the tasks there best at. this is done by remaking the way they build video cards.

C++... great for CPU not so great for gpu... they want to change this.

Out of order operations suck on the GPU. they want to change this, so it can hammer through more work faster.

There also throwing in a bunch of tools to help tell developers where there messing up in this regard.

fusion APUs will have a nice trick... they will be able to talk to each other without needing to send information back to memory. Imagion passing letters but having to use fedex, this would be like a move to passing letters in class (no fedex) its quicker :) and your mail isint delayed.

APU will talk over PCI-E... Im wondering how that will work to 0.oReply

AMD wants to put an end to the GPU in the chipset, but no one expects dedicated CPU and GPU to go away. Now, the code that would take advantage of the APU would probably work with a full AMD CPU/AMD GPU combination, so the software side of things would not need a lot of change to support both configurations.Reply

I think we see Eye to Eye on this. AMD wants to take full advantage of all its hardware, It looks like the way there trying to do it is by combining the CPU and Intergrated GPU into one package, after which they want to set it up so infromation that goes into that package dosent have to leave to be processed, like sending it out to ram from the CPU only to be read by the GPU.

Still want to see how this will work across PCI-E. I can already see future reviews and comparisons on how effetive GPU acceleration is on there intergrated aproach VS discreet cards. AND Buying those discreet cards :D

By the time these parts comes out my desktop will be right in the middle of its upgrade cycle :DReply

AMD needs to push for the HTX slot again for discrete video, where there is a direct HyperTransport link between the CPU and whatever is plugged into that slot. PCI-Express is decent, but HTX would and should blow the doors off PCI-Express.Reply

i wish this coming next year especially in Trinity but at lest they are heading in the right direction:) also, to those wondering about improvements in gaming ability, look what amd did with cayman vs cypress- the improved efficiency and noticeably improved performance on the same manufacturing. http://www.anandtech.com/bench/Product/294?vs=331GCN this is going to improve efficiency even farther and they are cutting the transistor size roughly in half. Reply

2013? I heard 2015, unless they recently changed dates to counter Nintendo. Anyways, I'm not so sure what benefits a console will realize from this, since full blown PC's barely get to utilize much of the technology we currently have access to. Multi-threading, 64-bit support, advanced cpu instructions are all available yet barely utilized features.

Also, consoles are designed to be cost effective and relatively cheap, so usually modified older generation architecture is used. For example, the new Wii uses Radeon 4700 class graphics, which sounds old but is roughly twice as powerful as the X360 (Radeon X1900) or PS3 (GF7000) graphics.Reply

That's true of the Wii because Nintendo doesn't subsidize the console, but MS and Sony have gone after higher end GPUs for their last launches. The XBox 360 launched using a GPU similar to that of the ATI 1900, a bare month and a half after the card hit the market.The PS3 used a GF7800 derivative and launched roughly 1 year after the GF7800 did. The GF7900 was nVidias top of the line card at the time, but it was only a marginal improvement over the 7800.Reply

PS3 actually launched about when G80 came out, which obviously made RSX look awfully retro when you saw 7900GTX SLI being beaten in reviews by a single board. ;) But G80 surely was never an option for a console due to size and power.

Xenos has less than half of the pixel fillrate of X1900. X1900 also has 48 pixel shader units + 8 vertex shaders so it might have an advantage over Xenos 48 unified units, especially when clock speed and the access to a large RAM pool over a 256-bit bus are taken into account.Reply

But we must also bear in mind that X360 and PS3 may have chosen high on the scale because of the concurrent shift to 720p/1080p resolution instead of the old 480p standard. At this point in time, the 1080p resolution is standardized, so greatly escalating GPU horsepower will show diminishing gains, since people aren't really going to be gaming on higher resolutions than the new standard tv resolution.

What I mean is, if a Radeon 5000 Series could maximize all graphics quality at 1080p, why would a console manufacturer bother with more power?

For example, you wouldn't buy a GTX590 or Radeon 6990 just to game on a 1080p monitor, would you?

The only exception I can think of for this TV resolution argument is 3DTV gaming, in which case I am not well versed in the added GPU overhead required to render a 3D game.Reply

I can't believe it's been 6 years since the X360 and PS3 release. It seems like this latest generation of consoles stuck around a lot longer than previous versions did. Any speculations on what kind of hardware MS and Sony will throw into the next gen?Reply

They have. The big console makers, at the gave devs requests, were trying to make the current generation last a decade to allow more time to recover the work expended figuring out how to best program them. The motion capture cameras were supposed to be the thing that kept the platforms from getting too stale. I suspect however, that by planning to launch its new console early Nintendo may have blown those plans out of the water.Reply

It all depends on what you expect. Things feel a bit stagnant on the PC game front because consoles are not evolving, and too many companies want almost exactly the same experience on the PC version as what you have on the console.Reply

Whereas VLIW is all about extracting instruction level parallelism (ILP), a non-VLIW SIMD is primarily about thread level parallelism (TLP).

Something doesn't feel right here. In itself, SIMD is about *Data* Level Parallelism, not Thread Level Parallelism. Sure, you could use SIMD units as part of some larger scheme that exploits TLP, but that's not what *SIMD* is about.Reply

If you use a strict definition of a SIMD programming model, then yes, you are probably right: SIMD is a single sequence of operations executed over multiple data elements.

However, over time SIMD has been used to refer to both the aforementioned programming model and the hardware used to implement it. The hardware typically consists of a single control unit that broadcasts instructions to multiple functional units. When people say "a SIMD", they typically mean that hardware implementation rather than the computing model.

If that wasn't confusing enough, in the 1980s GPUs started using that SIMD hardware to execute multiple threads as long as the threads were all executing the same instruction at the same time.

So the statement about using "a SIMD" to exploit TLP is accurate, if you take "a SIMD" to mean a processor pipeline with a single control unit that broadcasts to multiple functional units, and have some scheme for scheduling threads onto functional units.Reply

It should be interesting going forward. Now that AMD is finally into the 32nm process node, standalone GPUs also stand to gain quite a bit. As long as graphics don't become an afterthought to GPGPU, AMD should be in good shape. Radeon 7970(if that is the next generation GPU) may really be a game changer.Reply

Will the GCN architecture be able to be virtualized? Can a VMWare/XEN/KVM/HyperV hypervisor create vGPUs accessible by VMs in much the same way as vCPUs are today? With GPUs being integrated within the CPU package it would be a waste of resources if it could not be virtualized.

This will become a critical feature for enterprise computing beyond HPC applications. One example would be gaming in a cloud computing environment, where a company provides a service that runs a game on their compute and graphics hardware for a game and streams the output to your mobile device for you to enjoy.Reply

I hope that AMD delivers. This is exactly what I expected them to do once Llano was anounced. GPU as a coprocessor. Actualy I hoped that AMD would implement a HTX capable GPU, so I can just plug it into a C32 socket (for example) along with an Opteron.

It would be interesting if they produced a form factor with CPU+GPU on a separate card with memory. Ever since AMD moved the memory controller on die, I wondered if we would see CPU + memory on a separate card. It seems to make a lot of sense. A 4 socket motherboard is huge, especially where each socket has 4 to 6 memory slots associated with it. If the CPU and memory were on a separate card, then you could pack them a lot denser, like you can run 4 GPUs off an ATX board now. It might be cheaper than a massive 4 socket board also. I don't know how many HT links you can run through a slot, but you could always use extra cables/connectors like they use for multiple graphics cards.

With the GPU using the same memory space as the CPU, then why leave the CPU attached to the slow system memory? Just put one of these hybrid chips attached to some high-speed graphics card like memory on a separate board. Move the slow system memory out to the chipset again. The current memory hierarchy is not exactly optimal in my opinion. I am using a slightly older macbook pro, which only supports 3 GB of memory. With all of the stuff I run, it is paging a lot to a super slow laptop hard drive. I have been tempted to get an SSD to speed it up rather than a new laptop.

Anyway, with the way the memory hierarchy works now, system memory is kind of like a cache for the swap space on disk. System memory has gotten a lot faster, but disk have not, so people are using SSDs to fill the gap. If you directly connect the "graphics memory" to a CPU/GPU combo, then you don't need as much total memory in the system because you would not need multiple copies of the data. You would just pass pointers to data back and forth between the CPU and GPU components.

Also, it would be nice to switch to something non-volatile for the memory connected to the chipset; just use disk as mass storage only. "System" memory wouldn't need to be that fast, since you would probably have a GB or two of high-speed memory on each processor board. The "system" memory would be used more like the SSD boot/swap drive in a current system. I don't think flash is quite there yet, and the other types of non-volatile memory (magnetic RAM , phase-shift RAM, etc) that promise much better performance and durability seem to still be all talk with no real products.

With keeping the current form factor, it would be nice if they could put a large amount of memory in with the CPU/GPU package to act as high-speed memory for the GPU and L4 cache for the CPU. This form factor doesn't support scaling up to multiple chips easily (too large of main-board), but it would be very power efficient for laptops and other small form factor systems. It would require very little off-module communication which saves a lot of power. Maybe they could use a low-power, wide-interface dram chip originally meant for mobile devices.

On page 4 ("Many SIMDs Make One Compute Unit") there are two figures showing wavefront scheduling on VLIW4 versus GCN. As I read it the figures seem to indicate that in VLIW4, one 4-wide VLIW handles operations from four wavefronts in parallel, but that's not how I've understood AMD's VLIW4. Only a single work-item is executing on a VLIW4-core at any point in time, the occupancy problems of VLIW4 come from ILP within a work-item, not across wavefronts.

At any one point in time, a Cayman/VLIW4 compute unit is only executing instructions from a single wavefront (though they need at least two wavefronts to switch between on VLIW4). Again at any one point in time only 16 work-items are actually being executed, and it's within those 16 work-items that ILP must be extraced to fill the VLIW4 units. Since each work-item is executing on a VLIW4-processor, a total of 16*4=64 operations can be done in parallel, but that requires ILP within the work-items.

On GCN this is quite different, where the four 16-wide vector units are actually executing 64 work-items at a time (four times as many as in Cayman). However the point is that each of these work-items are basically executing on a scalar processor, there's no need for ILP anymore. So again we are executing 64 operations in parallel, but now without any need for ILP.

At least this is how I understood the presentation (I was at AFDS). Basically I agree with how the GCN scheduling is illustrated in this article, but the Cayman part looks wrong to me. A Cayman CU can only execute one wavefront at a time, and it only needs two wavefronts to switch between to be able to fully utilize the hardware, not four like the figures here seem to suggest.

Now I'm just a programmer, not an architecture guy, so if anyone could clear this up for me it would be greatly appreciated :)Reply

After further consideration you're basically right. I should have made a distinction in the figures between instructions and wholly distinct wavefronts. While there are some ILP considerations to be had, basically the elements Cayman accepts should all be instructions from the same wavefront rather than different wavefronts. Cayman can't really work on multiple wavefronts at once.

I don't have the original files on me, but we'll get this fixed in the morning to show that Cayman is consuming multiple instructions from the same wavefront.

Would a CPU/GPU integrated chip only be a replacement for integrated graphics, or does it have the possibility to move a little farther up? With multi-threading, 4 to 8 thread CPUs will be common in the mainstream, but that will not be a very big die on smaller processes. Most PC software doesn't make use of more than 4 compute intensive threads, so how much room does that leave for GPU hardware? If they solve the memory speed problem by integrating some high-speed memory into the socket (multi-chip module), or something, then it seems like they could possibly get more mainstream performance out of an integrated chip.

If the integrated GPU isn't being used for graphics, then I really don't see that much software that would use it for compute in the PC space. One of the main things mentioned was usually video encode/decode, but it seems that the best solution is to include specific media encode/decode hardware like sandy bridge does. It seems to be just as fast and much more power efficient. If AMD doesn't include a media processing engine, then that could still be a reason to go with Intel. What other PC software could use the compute power?

There is plenty of software that could use it in professional/HPC markets, so it makes sense to make a GPU that can be used for both if it doesn't sacrifice the graphics performance. The newest generations of GPUs have some things in common with Larrabee and Sony's Cell processors, except both of those tried to move too much of the graphics processing abilities into software. AMD didn't make that mistake, but talk of compute abilities for GPUs in the PC/consumer space seems a bit premature without any real applications to take advantage of it.Reply

Llano already has low level discrete GPU performance, and that's just the tip of the iceberg. You are correct that on smaller processes they will be able to allocate more space to the GPU while maintaining CPU performance. I believe the successor to Trinity (which is the Bulldozer based successor to Llano) is supposed to be on 28nm. If everything goes exactly right, you could potentially have some kind of monster that has i5-2500K CPU performance with Radeon 6800 GPU performance in some maintstream laptop chip a year or two down the road. (Those numbers are all pure speculation)

I encourage everyone to take a moment and remember the first computer you ever used, just to pay homage to what we are capable of as a species in just a few short years.

I remember an IBM computer flipped on by a big red toggle that took 2 minutes to boot to a dos prompt...Reply

I remember the Timex Sinclaire, with 2KB of memory standard hooked up to a black and white TV and cassette tapes to save/load programs. Z80 running at 1MHz...the old 5.25 inch floppies were MUCH better, at least you could get a list of what was on the storage medium without having to load it.Reply

"Because in the end, aren't all religions the same? They tell us what to eat, when to pray, that this lump of clay called Man can somehow shape himself to resemble the divine. But we can never attain that perfect grace if we have hatred in our hearts. So let us celebrate our commonalites. Some of us don't eat pork. Some of us don't eat shellfish. But we all eat chicken. So spread the word: peace and chicken!"~HOMER SIMPSON

7th day Adventist don't eat meat, yes, not even chickenANDin Christian religion it's God who sacrifices, not humanPLUSthere is a requirement of TOTAL change according to JesusThat is, the "ME" is buried, forgotten and God lives inside of youmeaning a total change in life

The next architecture will have 16-wide SIMD. How does that fit computational problems better than a 16-wide MIMD VLIW architecture? VLIW can act as if it were SIMD if necessary (simply make each instruction within the word the same, varying only the operands), so how on earth can SIMD be better? SIMD is strictly less general and less flexible than VLIW. This makes it applicable to a narrower set of problems--if you have problems that aren't 16-wide, then you're wasting those additional ALUs, and there's nothing you can do with them, ever. MIMD can't always use them, but there the restriction is unbreakable dependencies, not an inability to encode instructions.

And while VLIW heritage is indeed statically scheduled, nothing about VLIW mandates static scheduling. The next generation Itanium will use dynamic scheduling, for example.

This whole article reads like AMD has offered a rationale for its architectural change, and the author has accepted that rationale without ever stopping to consider if it makes sense.Reply

(FYI: the *real* reason to go for SIMD instead of VLIW is simply that VLIW takes up more die area. AMD has decided that the problems people are working on have enough data- and thread-level parallelism that it's not worth having extra decode logic to enable extraction of more instruction-level parallelism.

The result is a design that's actually *worse* for general-purpose computation--for non-vector computations, it'll only ever use one of those sixteen ALUs, whereas the previous design could in principle use them all--but better for embarrassingly parallel workloads.

I don't understand your argument. They have moved from 16-wide SIMD where each instruction is a 4-operation VLIW (where there are quite a few restrictions on what that VLIW instruction can actually be) to _four_ 16-wide SIMDs where each instruction is scalar. The new architecture is in every way more general and more suited to a wide range of computational problems while retaining the same power. It does presumably cost more (in terms of area/transistors), but hopefully it will be worth it.Reply

It is well-known that Cypress and Cayman both have arrays of 16 processing elements operating in SIMD mode, and they have to execute work-items from the same work-group over four cycles, leading to a wavefront size of 64. See for example the AMD APP OpenCL Programming Guide 1.3c section 1.2 where this is described. Specifically it says "All stream cores within a compute unit execute the same instruction sequence in lock-step".Reply

"well-known"? I assure you, the vast majority of people have not read AMD's OpenCL Programming Guide.

Nonetheless, the article still makes little sense.

A vector of 16 instruction-parallel processors is more versatile than a vector of 16 strictly SISD ones. In the worst case, with unbreakable data dependencies, the former degrades to the latter. In the best case, the former can do 4 (VLIW4) or 5 (VLIW5) times the work of the latter. The average case cited in the article was about 3.5 times.

If you only had one thread of work, the old architecture would tend to be better. For every 64 ALUs (one old VLIW vector or four new SIMD vectors), a single-threaded task would average usage of 56 out of 64 ALUs (3.5 per VLIW) on the old arch, but only 16 out of 64 on the new.

However, AMD is plainly counting on there being many, many potential threads. If you have abundant threads then you can guarantee that you can fill up the remaining 48 ALUs with different threads, whereas the 8 unused ALUs in the VLIW arch are off-limits.

This is a less general architecture, but as long as all your problems are massively parallel, creating all those extra threads shouldn't be a problem. AMD is sacrificing generality in favour of the embarrassingly parallel.Reply

I actually asked a similar question at the AMD Fusion Developer Summit.

The minimum number of wavefronts (i.e., batches of 64 work-items) needed to keep a Cypress/Cayman CU fed is two, while GCN requires four wavefronts (so twice as many). However it is the case that quite often (for all of my programs actually) you really do need four wavefronts per CU on Cypress/Cayman to effectively hide the global memory latency. The guy I was talking to at AMD seemed to thnik that in practice the number of work-items needed would stay about the same between Cayman and GCN for most applications.

I've asked this question on the AMD developer forums as well, but I don't know how many answers will be given about GCN there.Reply

I certainly wouldn't be surprised to hear that typical GPGPU workloads could inundate the GPU with threads and so provide more than enough wavefronts. The GPGPU workloads are pretty much all of the embarrassingly parallel kind, so creating more threads should tend to be pretty trivial.

So your experience certainly makes sense with what I'd expect.

It's not that I think this is necessarily a bad change for the applications that people use GPGPU processing for.

It's more that I'm disputing the implication that this somehow makes the GPU more general and easier to take advantage of; to my mind it's doing the exact opposite of that.

Or to put it another way: virtually every single program has a reasonable amount of instruction level parallelism. Data-/thread-level parallelism is much rarer. We're losing the former to improve the latter.

For problems amenable to massive thread-/data-level parallelism the result should be substantially more ALUs available to process on. But for problems with only limited data-/thread-level parallelism, it's a step backwards.Reply

"The next architecture will have 16-wide SIMD. How does that fit computational problems better than a 16-wide MIMD VLIW architecture? VLIW can act as if it were SIMD if necessary (simply make each instruction within the word the same, varying only the operands), so how on earth can SIMD be better? SIMD is strictly less general and less flexible than VLIW."

A VLIW system has to have instruction decoders and routers for every instruction, and thus for every data item that is processed.A SIMD system only has to have one instruction decoder and router for every 16 data items that are processed. If your computations consist primarily of doing the same thing to multiple data items this is a win. (More processing for less power and less silicon.) If your computations do NOT consist primarily of doing the same thing to multiple data items, it's a loss.

Or, to put it differently, is it worth investing silicon in moving instructions around with great facility, or is it better to invest silicon in moving data around with great facility? Seymour Crane thought (for the problems he cared about) the answer was data. I'd like to think AMD know enough about what they are doing that they have the numbers in hand, and have calculated that, once again for them the answer is data. Reply

What is being describe is tantamont Vector Processing that was featured on CRAY supercomputers available in the 70's through 90's. In the machines I once programmed (using CFT77 compiler), a vector was 64 64-bit words that was processed through a pipe. Reply

This pleases me, because this will likely mean that AMD no longer has such a performance per dollar and watt difference from Nvidia. Thus further degrading most arguments AMD fanboys have against Nvidia. I see this being a benefit for Nvidia in the long term. After AMD claiming what Nvidia was doing wasn't right, they basically give up and are doing it themselves now too.Reply

Looking forward to buying a 7970 (or possibly a 7950) to go along with my Sandy Bridge build. I'm currently running on Intel HD3000 and it's killing me. But just a few more days now. Hopefully I can hit the refresh button on my browser fast enough to catch one before they sell out.Reply