What do you get when you add macro-op fusion ability to your floating point engine? You get really impressive FP operations. This technology, no doubt introduced into Intel's x86 line from the knowledge gained by the company's much more impressive EPIC (Itanium) line, gives Intel a significant advantage when it comes to floating point operations. And the benchmarks show that Intel has been able to increase its FP engine by a fairly remarkable 38% when compared to its XE 965 P4 model.

It is clear that these combined abilities (multiply + add in a single op, for example) are well worth their weight in silicon. I'm wondering how many other combined abilities would be doable with the x86 instruction set?

The opcode sequences are still encoded as they are today, such as two assembly lines of code: fmul fadd

But what ends up happening is that the CPU core sees those two together and says, “Aha! Let's combine them internally into a single operation,” and thus the “fmul_add” instruction is executed. If FP code can be re-compiled to take advantage of this ability, what might seem like an unnecessarily high number or even an out of order series of computations can be created that exposes this ability to the core, making the code that happens to execute those instruction sequences in the necessary order today really take advantage of the coded-by-design models used tomorrow.

I would very much like to see this ability added to all x86 cores, as it seems to make perfect sense. Additional similar abilities, where possible, should also be exposed. There's no reason CISC can't make a comeback, especially when the RISC vs. CISC arguments have kind of been made null and void by bandwidth issues.

Read the full benchmark at HKEPC.com, and thanks to Alpha for the heads-up.

Of course…(12:25pm EST Mon Jun 19 2006)The real story here is that the X6800 is one SMOKIN' chip. Fascinating how well all this adds up. I'm sitting on the edge of my chair … waiting for AMD to deliver a response.

I wonder whether Intel's really using a fusion-of-operations, or the exact equivalent: independence of the calculation sub units “add” and “mul” in the conjoined ADD/MUL unit.

Being able to independently schedule them, OR to pipeline them should result in the best performance overall: mostly they would receive non-interlocked calculations, but on some segments of code, the 'fusion' concept would be readily attainable.

Multiplies typically take a whole lot longer than ADDs to execute. No matter how they're implemented, they still are essentially “combinatorial ADDs” in execution.

– by GoatGuy

single clock MAC (multiply/accumulate) has been part of DSP for years(12:28pm EST Mon Jun 19 2006)DSP's have had multiple single clock MAC engines in them for years. In fact TI's newest line of DSP's have dual MAC's in them and go for less then $10. I'm glad to see the x86 world finally dedicating these gates. I have always wondered why they had not done this. Now they have, the 38% to their floating point engine is more then enough reason to add these gates.– by EE

nice report(12:32pm EST Mon Jun 19 2006)rick, excellent job.

fx-62 has been dethroned. anyone know the price for a retail box of x6800? with fx-62 sitting at $1100, i don't know how much money gamers are willing to put out for 10% gain in performance.

any thoughts? – by kw

Competiton is good(12:48pm EST Mon Jun 19 2006)Thanks to AMD's improved X86 implementation, Intel finally answer back which is good for us consumers.

Of course the price of the products will still affect which one sells most. – by AM2mann

Sad(1:18pm EST Mon Jun 19 2006)RickGeek has not even mentioned that x6800 blows FX62. This is not even the best possible result. Stepping 5 conroes are performing even better.

I am sure k8L will have single cycle SSE too..(or perhaps a “double pumped” 64 bit SSE similar to the old netburt's double speed ALU, though a full SSE unit at 5-6 ghz may not be doable at 65 namo) – by Alpha

Now I'll just wait until after July 23/24 and upgrade my 3800+ 939 to either a fx60 or a 4600+x2 after AMD drops the price. And be 99.44% as fast in real life usage as Intel’s wonderful chips

Thanks Intel.– by The Traveler

The Traveler(3:00pm EST Mon Jun 19 2006)Funny thing is I too have been waiting to upgrade from a 3800+ socket 939 but the prices for the 4800+ never dropped a bit until Conroe came along. Now I understand that AMD cannot provide any of the CPU's with 1MB cache since the silicon is needed to keep up with Opteron demand.AMD's price drop comes in as too little, too late for me. I will take this chance to jump back on the Intel ship since I was not all that impressed with AMD/NVidia's poor documentation for higher features like bootable RAID (Instructions I got were not even for the same BIOS as my motherboard which I consider an unforgivable mistake) – by No name avoids flame

I prefer this approach(4:01pm EST Mon Jun 19 2006)In my opinion adding the gates needed to support a single mult/add operation makes a lot more sense to me then adding whole CPU's. Especially when you look at the number of gates needed for each and the relative performance bumps you get back from each.

I've never been a fan of adding cores, it has it's merits, especially for certian kinds of problems that have tons of redundent operations associated with them. But for more random processing tasks there are far smarter ways to “throw gates” at your compute problems then almost doubling your number of gates for what ammounts to much, much less then double your performance gain.

Better to add a fraction of those gates in your instruction pipeline (i.e. hyper-threading), your ALU (like they did here), in your high latentcy interfaces (memory, PCI, etc.), or where ever else you are experiencing the greatest bottleneck. I really hope this kind of intelligent design prevails at both companies, I prefer smarter gate implementations to just throwing down more gates which is all several of these dual core products really are.

Develop current “CISC translate to RISC u-ops on the fly”? Not so many $$$

Anyone think a REAL, GENUINE CISC will commercially exist and battle against an FX62? – by never

what a fanboy idiot — never(5:42pm EST Mon Jun 19 2006)Intel and AMD processors all use the same x86 instruction set you idiot. If they didn't code compiled for one would not run on the other. In fact most compilers don't know which one they are targeting unless you write special subroutines (useually using machine language) to do it.

AMD started out making EXACT duplicates of x486 CISC CPU's, they isn't a single gates worth of difference between an AMD x486 and an Intel x486, they were identical and everything since then still utilizes they same x86 instruction set.

AMD did go off and add to this instruction set when they went to 64bits before Intel did, in fact Intel now licenses the AMD 64bit x86 instruction set from AMD. Just like AMD used to use theirs.

If you don't know simple shit like that it's unlikely that you have the slightest clue what the cost difference between a RISC and CISC CPU might be or how many instructions you have to have before a CPU is clasified as RISC or CISC.

go back to junior high with the other fanboy idiots and let the grown ups discuss grown up things…

– by dumbass

what a fanboy idiot continued – never(5:49pm EST Mon Jun 19 2006)BTW an FX-60 is two FX-55 Athlon cores stuck together. The FX-62 is an FX-60 with it's updated DDR2 on chip memory controller and just enough other stuff changed so it won't fit in the same. They and the pentium P4EE all run the exact same CISC instruction set and have very similar benchmarks, there isn't a 10% different from top to bottom. – by dumbass

But not one single benchmark is mentioned. Not one comparison between the FX62 vs Core 2 is dicussed. Only improvement of Core2 vs P4 is discussed.

Whats the point Rick – by Not a retard

Hmmmm(7:43pm EST Mon Jun 19 2006)“If FP code can be re-compiled to take advantage of this ability” – If this happened, would the code be readable on a non macro fusion enabled cpu? (like AMDs or Intels older chips)?

– by Headley

re: Headley(8:55pm EST Mon Jun 19 2006)Only x87 FPU code will have to be recompiled, SSE/2 code should be fine.Compiles try to pair instructions up to avoid latencies with working on the same addresses(data) one instruction after another. Since optimized SIMD uses parallel execution the processor itself will do the op-fusion optimizations, BUT FPU code does 1 operation at a time so if its possible the compile scatters the operations between OTHER (unrelated operations) to optimize (avoid latency improve execution speed). This scattering will be UNnecessary with op-fusion, for mul add pairs anyways.

Great article.(3:11am EST Tue Jun 20 2006) I suspect it will not be long until AMD hit back and them some. But at the moment I am very impressed with the new Intel chips. I am glad we are still getting more invasions to the good old X86. But I cant help in thinking all the effort going in to trying to bring life to X86 is a bit of a waste of time. If Intel had (they have not) made the IA64 range a viable unit for the masses we would all probably be at a much higher level of FP perforce. I suspect as well that if in the early days Intel had really pushes the IA64 other manufactures such as AMD and VIA will have made there own architectures on this line and we would still have the chip race, its just we would all be much further ahead. I could be wrong though. – by Hougham

from a coders point of view(5:36am EST Tue Jun 20 2006)op fusion is a work around, yep i can see the benifits but fusing the operations has a footprint. Better to have pre fused operations in the instruction set in the first place. oh sh.. is that what we used to call cisc maybe we could call it expanded-cisc.

– by d.ddd

Wouter Tinus(10:48am EST Tue Jun 20 2006)

I appreciate your input to the comments section. I was not aware that it was only CMP/JMP operations. Your statement is the first I've seen stating that.

No…(12:00am EST Wed Jun 21 2006)AMD still hasn't been able to move to 65nm..

What a product they would have if they could have matched INTEL on silicon manufacturing and technology. Why they would be intel… – by AMD 90nm LOL

Hmmm(1:40am EST Wed Jun 21 2006)“Is socket F 65 nm?”

I have learnt myself, that a new socket is only a new socket. What goes in it can be anything made for it, regardless of the tech used. It could be 65nm, 90nm, 130nm etc. It probably will be 65nm though, to keep with the “new and exciting” theme of a new socket, new ram, new tech etc.

– by Headley

Thanks for the correction(6:43am EST Wed Jun 21 2006)So the correct question would have been, Is the socket F Opteron 65 nm. The same socket F Opteron that Digitimes says has been delayed right? There doesnt seem to be alot of info on this part, or Im not finding any anyway. – by TT

I laughed until I cried(10:27am EST Wed Jun 21 2006)This is funny, not ?