Posted
by
samzenpus
on Monday May 05, 2014 @02:24PM
from the brand-new dept.

crookedvulture (1866146) writes "AMD just revealed that it has two all-new CPU cores in the works. One will be compatible with the 64-bit ARMv8 instruction set, while the other is meant as an x86 replacement for the Bulldozer architecture and its descendants. Both cores have been designed from the ground up by a team led by Jim Keller, the lead architect behind AMD's K8 architecture. Keller worked at Apple on the A4 and A4 before returning to AMD in 2012. The first chips based on the new AMD cores are due in 2016."

They were never fast; but they were pretty much the only game in town if you wanted x86 within tight thermal constraints, for a time after they launched. VIA was similarly tepid and a bit hotter and Intel was pretending that a "Pentium 4 Mobile" was something other than a contradiction in terms.

Now, once Intel stopped pretending that Netburst was something other than a failure, and put some actual effort into lower power designs, it was Game Over; but they didn't do that overnight.

Intel has only shown you what's possible with a large number of advanced low-power transistors. That's still just one design (of the many possible ones) that uses this level of logic integration. Does that mean that it's impossible to do anything better with the same large number of advanced low-power transistors? Do you have any reason to believe that the Transmeta approach (that actually worked better back then) wouldn't work better now for some reason?

Transmeta was at the end of the era where decoding performance mattered. Keeping the translated code around was actually useful. These days decoding is approximately free on any CPU with half-decent performance -- the amount of extra die space for a complex decoder is not worth worrying about.

You can save a bit of power with a simpler decode stage, but you are unlikely to beat ARM Thumb-2 on power by software-translating x86 the way Transmeta did. Besides, most of the interesting code for low power applications is ARM or MIPS already, so what is the point?

These days decoding is approximately free on any CPU with half-decent performance

In what way? And what do you mean by "decoding"? Do you also include dependency solving, interlocking, reordering etc.? Because what I was thinking about was pushing even more to the SW component. The problem is, CPUs have been widening for quite some time because of our over-reliance on single-threaded SW. But even if it doesn't work nearly as well for eight-issue monsters, given that simple cores like Jaguar, which seem to be practicable if you have many more of them, push you back into the time of "quart

You cannot meaningfully do reordering and so on in software on a modern CPU. You do not know in advance which operands will be available from memory at which time. You have to redo that work every time you get to the code (unless it is in a tight loop, but modern x86's are REALLY good at tight loops) because circumstances will likely have changed -- and you cannot reorder in software every time, that is just too costly.

If you want to see an architecture which looks like it has a chance of breaking the limits on single-threaded performance, look at the Mill [millcomputing.com]. In theory you could software-translate x86 to Mill code and gain performance, but it would be really tricky and no Mill implementations exist yet.

I guess you're right on the reorderings, there are unpredictable aspects to the execution trace. But then again, there's the engineering maxim that every extra component has to justify its value to be included in a system. Surely these circuits made sense when Pentium III was competing with P4 was competing with K7. Whether their usefulness is undiminished in low-power parallel systems seems like the question to me, though. There appears to be a law of diminishing returns for everything.

You do not know in advance which operands will be available from memory at which time.... If you want to see an architecture which looks like it has a chance of breaking the limits on single-threaded performance

Do I really need to know that, or can I just switch to a different thread of execution until then? And do we really need to care about single-threaded performance that much these days? What if I want to program in Go instead of C++? (E.g., what if Google wants 0.5M of new servers for deploying of Go services?) Perhaps some level of "outoforderiness" is desirable, but a lower one would do? I really don't care in what way the performance gets squeezed into my battery-powered devices

That's a red herring. Many more tasks probably are than most people would think. See Guy Steele's work. I think I even came up with a scheme to run TeX passes using speculative execution (results always correct, and most of the times faster) the other day (the state to keep around fortunately isn't very large).

Not to mention that on most 'desktop' or 'server' machines, the OS is constantly juggling hundreds or thousands of processes, so while an individual program may be single threaded, the operating system can be spread across all available processes. The hard thing is knowing, for an individual process and core, when it is worth switching context - shunting it off to wait for I/O and shoveling a different process onto that core - or just idling that core for a while. IIRC (from _long_ ago), I/O typically cos

But that was a part of the very concept of VLIW, which both Crusoe & Efficeon were. But those processors were somewhat more RISC than VLIW, except that their integer units were 128-bit and 256-bit, as opposed to 32-bit or 64-bit. Essentially, the idea here was that the bottom core would be constant, and any time there was an instruction set upgrade in a CPU from Intel or AMD, the Transmeta CPU would implement those new instructions in terms of their own native instructions, which would presumably eith

Transmeta was at the end of the era where decoding performance mattered. Keeping the translated code around was actually useful. These days decoding is approximately free on any CPU with half-decent performance -- the amount of extra die space for a complex decoder is not worth worrying about.

Actually Intel has recently returned to that. They now keep a small microinstruction cache of decoded instructions around so that loops can be executed more efficiently.

The Transmeta chip was not a smash hit so probably not.The really cool thing is that you will see ARM and X86 will share parts. GPU cores are a no brainer. Throwing in things like cache and memory controllers could be a big deal.ARM sharing a socket with x86 will be really cool IMHO.

How old is x86 now? It took a long time (ten years?) to get just a basic 32-bit protected mode operating system out to people at large after the hardware (80386) was out. I hope you're not expecting AMD to roll out a full-blown ecosystem of HW, drivers, compilers, and thousands of applications within a year just because you're impatient. I'm afraid the free lunch is over, but still, HSA is hardly a complexity monster. To me, it didn't seem nearly as threatening as a single look at the total size of x86+AMD6

It took a long time (ten years?) to get just a basic 32-bit protected mode operating system out to people at large after the hardware (80386) was out.

Double facepalm!! That's one version of the story. In other news, the day after the first Prius was available for sale, there was a global recall on internal combustion engines—the kind of recall where they don't give back.

The hump where protected mode starts to drive real productivity benefit is somewhere above a 486SX/25 with 8 MB of RAM and a 120 M

Well, now it seems that the adoption of heterogeneous systems will be slowed down by the software cartel. Meet the new boss, not the same as the old one but you won't notice the difference. Still, the hardware has to start somewhere, I guess.

You should have tried OS/2 with the 486DX. The SX laptop would have been slow with anything but DOS; no local bus and a glacial hard disk is killer. I had OS/2 on a 486DX-40 with 8 MB RAM and it was great.

As the other poster said, you should have tried OS/2. I had pretty good multi-tasking on a 486DLC at 33Mhz on a good 386 board with 8MBs, localbus and a 120MB drive. It seemed to fly on a 486/100 with 32MBs, even X was fast and it was nice having 3 desktops (X, Win16 and WPS) running at once though you had only one displayed (actually Windows ran seamless with each program in its own session so when they crashed only the program died instead of the system).

I don't see how SSE is anything like it. Either you have a SSE or AVX unit or you don't. If you do, you use it exclusively. With a hybrid x86+ARM+GPU chip, you need to give work to at least all 3 of them, and it's nearly impossible to predict which unit will be the best for each task or even to schedule the damn thing dynamically.

As I recall, the shared registers were a real problem with MMX. It meant that there was a big latency cost as the chip switched between superscalar and traditional operating modes. It made for penalty that frequently negated the MMX performance benefit.

See, we can simply compile the program on the chip we want to use it on.

The problem is that humans are stupid. Languages at the Human interface level should never compile down into machine code. All languages should compile down into bytecode. You should NEVER distribute programs as binaries (that would be dumb). Then the hardware abstraction layer (your OS) can compile the bytecode INTO OPTIMIZED machine code f

Except that when you do this, you have the opportunity to effectively turn a hardware interpreter into a software compiler, reducing control logic (and its constant switching during code execution) and improving efficiency in the same way in which software compilers are better than software interpreters, even if the gap won't be nearly that wide. You can turn the same hardware interpreter into a hardware compiler, but then you have something like a trace cache and the logic has actually increased. Would the

Except that when you do this, you have the opportunity to effectively turn a hardware interpreter into a software compiler, reducing control logic (and its constant switching during code execution) and improving efficiency in the same way in which software compilers are better than software interpreters, even if the gap won't be nearly that wide. You can turn the same hardware interpreter into a hardware compiler, but then you have something like a trace cache and the logic has actually increased.

^^^^thatdoesn't support this,

Would the SW solution decrease performance per thread? Quite likely. Would it improve performance per watt, which is what will really matter in the future? Well, what if it will?

Actually, K7 - when Dirk Meyer's team left DEC to join AMD - was when they first made any technical challenge to Intel's CPUs. Until then, they were a series of one mediocre challenge after the other - first the Am386s & 486s, then the NexGen acquisition, then the K6. Finally, when AMD did the Athlon w/ the ex-Alpha team from DEC and extended CISC to 64-bit, that's when things started getting interesting.

My personal favorite was the Athlon XP 1700+. The best was date code JIUHB DLT3C, it had documented cases of getting above 4GHz - pretty good considering that it is still a feat to hit that 10 years later. Bought two or three 1700+'s on ebay before I hit the jackpot. Unfortunately, I never managed to put together the water cooling system I had planned, so I never got it over 3 GHz.

But that's only because Intel let the marketing department make engineering decisions and kept making chips with higher and higher clock frequency. As soon as they regained their sanity, they once again dominated the benchmarks.

I do love how AMD brilliantly capitalized on the blunder. By labeling their chips according to the clock speed of the performance equivalent Intel chip - every time Intel put insane engineering effort into ratcheting the clock up 10% and only getting 1% better performance, AMD simply

Yup, on the server side AMD was ahead from the first Opteron until Shanghai, and then Intel launch Nehalem and they've been ahead ever since. One the desktop Intel got competitive again with the Core2 but on a performance per $ metric it wasn't until Nehalem that they dominated.

On a performance per $ metric, AMD are arguably still competitive, at the expense of selling cheaply and barely breaking even financially. They are currently not competitive in performance per watt and absolute performance (both on the desktop, mobile looks a bit better).

AMD really fucked up with the Bulldozer, and while there have been modest improvements to that with Vishera and Steamroller, they were insufficient to close the gap to Intel.

I didn't say the K6-2 was the peak of AMD; just that it was the last time I really got excited about anything they came out with. AMD did some good stuff during the mid-2000's, but there were other computer upgrades that had more impact on performance -- particularly RAM. Those were the days when adding a stick of RAM was a legitimate means of being able to do amazing things like browse the internet while listening to music... at the same time!. Upgrading from 512 to 2GB was a huge boost in productivity.

I had an AMD486 80Mhz. It was cheaper than an i486 66Mhz and performed great. The Pentium had just come out at the time but was super expensive. I was able to find late model 486 board with PCI slots though and with the awesome value of the AMD chip was able to have a nice "budget" system for the time. It was even able to run Quake playably(a game which "required" the Pentium and it's baller FPU).

I have a really hard time believing this, and would state that your memory does not serve you very well. A 33MHz 486 couldn't handle more complex scenes in DOOM, and definitely not in Quake. I gamed actively at the time when Quake came out, and recall that only much later, on a P233MMX, I could get an fps amount rivaling the screen refresh rate. Any 486 is so much behind that machine, that it's not even funny.

A low ID number username like you probably won't believe a brat like me, so here's some proof: http [youtube.com]

I have a really hard time believing this, and would state that your memory does not serve you very well. A 33MHz 486 couldn't handle more complex scenes in DOOM, and definitely not in Quake. I gamed actively at the time when Quake came out, and recall that only much later, on a P233MMX, I could get an fps amount rivaling the screen refresh rate. Any 486 is so much behind that machine, that it's not even funny.

K6-2 was good, but the K6-III was much better. It was the first consumer-level CPU with on-die L2 cache. It scared Intel enough that they renamed the PII to PIII (because anything with a 3 in the name is clearly better than anything with a 2 in the name). The down side was that the K6-III overclocked for shit.

64 'cores' is 32 piledriver modules. That was a gamble that by and large did not pan out as hoped. For a lot of applications, you must consider those 32 cores. Intel is currently at 12 cores per package versus AMD's 8 per package. Intel is less frequently found with their EP line in a 4 socket configuration because the performance of dual socket can be much higher with Intel's QPI than 4 socket. AMD can't do that topology, so you might as well do 4 socket. Additionally, the memory architecture of Intel tends to cause more dimm slots to be put on a board. AMD's thermals are actually a bit worse than Intel's, so it's not that AMD can be reasonably crammed in but Intel cannot. The pricing disparity is something that Intel chooses at their discretion (their margin is obscene), so if Intel ever gets pressure, they could halve their margin and still be healthy margin-wise.

I'm hoping this lives up to the legacy of the K7 architecture. K7 architecture left Intel horribly embarrassed and took years to finally catch up with when they launched Nehalem. Bulldozer was a decent experiment and software tooling has improved utilization, but it's still rough. With Intel ahead in both microarchitecture and manufacturing process, AMD is currently left with 'budget' pricing out of desperation as their strategy. This is by no means something to dismiss, but it's certainly less exciting and perhaps not sustainable since their costs are in fact higher than Intel's cost (though Intel's R&D budget is gigantic to fuel that low-cost per-unit advantage, so the difference between gross margin between Intel and AMD is huge, but net margin isn't as drastic). If the bulldozer scheme had worked out well, it could have meant another era of AMD dominance, but it sadly didn't work as well in practice.

Unfortunately I do not have a link. I do however know some system designers.

They designed a 4 socket Opteron system, and did not make a dual socket. It was peculiar to me so I asked why not a dual socket and they said there was no point in a dual socket because there was no performance advantage.

They also designed both a 4 socket EP system and a 2 socket EP system. I asked why and they said that they could gang up the two QPI links between two sockets for better performance.

Is AMD just around so Intel doesn't get bogged down by anti-monopoly or antitrust penalties?

Somehow these days, I think it's yes. And I think Intel's lobbing customers AMD's way to ensure that AMD survives. E.g., the current generation of consoles now sport AMD processors. I'm sure Intel would be more than happy to have the business, but not only do they not need it, they see it as a way to give AMD much needed cash for the next few years.

Somehow these days, I think it's yes. And I think Intel's lobbing customers AMD's way to ensure that AMD survives. E.g., the current generation of consoles now sport AMD processors. I'm sure Intel would be more than happy to have the business, but not only do they not need it, they see it as a way to give AMD much needed cash for the next few years.

Consoles are primarily about graphics, not CPU power. While Intel's integrated graphics suck somewhat less than they used to, the PS4 has 1152 shaders backed by 8GB DDR5 and Intel has never had anything remotely close to that, maybe a third or quarter of that tops. An Intel CPU with AMD dedicated graphics would be very unlikely since AMD would almost certainly price it so their CPU/GPU combo came out better. So realistically it was AMD vs Intel+nVidia, neither of which like to sell themselves cheap. I don't

Well, in the *desktops*, core marked an end to AMD dominance in most practical terms, but architecturally they still were not very good for scalability. Basically, they turned back the clock to pentium iii on modern processes and that was enough to recover the desktop space.

Nehalem is the point at which Intel basically overtook AMD again and AMD has not come back since that point. So Intel's had the ball for 3 of their 'tocks'. AMD prior to K7 was pretty weak for a lot longer than that and I don't think

Intel is about ten times as big as AMD by every metric (except the negative profit metric - Intel actually makes $10bn profit a year, AMD is just losing money).

AMD is tiny, it's an irrelevance in the grand scheme of things. Pretending no one would notice Intel's demise whilst AMD will be around long after is comical. Anyway, AMD doesn't even make half the chips you're on about, that's comp

I was such an AMD fanboy ever since I built my first (new) computer with a K6-II. I have to admit I miss the days of the Athlon being called "The CPU that keeps Intel awake at night." After Bulldozer bombed so thoroughly I just gave up and haven't followed AMD's products since. I definitely wouldn't mind a comeback, if they can pull it off.

I don't get it. Do you, and just about everyone else who has posted in this discussion, only by chips that cost > $200? Because AMD is, and always has been, competitive with Intel in the sub $200 price range.

Sub $200 chips have, for a very long time, been very fine processors for the vast majority of desktop computer tasks. So for years now, if you're anything close to a mainstream computer user, there has been an AMD part competitive with an Intel part for your needs.

Of course, once you get to the high end, AMD cannot compete with Intel; but that's only a segment of the market, and it is, in fact, a much smaller segment than the sub $200 segment.

I personally have a Phenom II x6 that I got for $199 when they first came out (sometime in 2011 I believe) that was, at the time, better on price/performance than any Intel chip for my needs (mostly, parallel compiles of large software products) and absolutely sufficient for any nonintensive task, which is 99% of everything else I do besides compiling.

Anyway, if you only think of the > $200 segment, why stop there? I'm pretty sure that for > $10,000 there are CPUs made by IBM that Intel cannot possibly compete with.

I don't know how much of a profit they're making on their APUs, but they're the winners of the current console generation (somewhat surprisingly, the winner of the previous gen was IBM with PPC/Cell). I'm hoping they stay afloat - they may only be competitive (when it comes to general x86/x64) on very few tasks that require very many cores (and even then probably using more watts at that), but it's never healthy to have a monopoly.

Last quarter, they lost $3 million on CPU/APUs so in practice they're breaking even, but revenue is going down which means less and less goes to R&D. Their profits last quarter are a bit from dedicated graphics cards but mostly from console chips. Which is of course better than a loss, but consoles have a very special life cycle with high launch and Christmas sales with little in between so it's unclear how long that'll last.

It seems that it would be fertile territory for genetic algorithms to design the die. Sure, humans need to define the features, but run everything through a genetic algorithm, simulate and let the computer grow its own chips. Perhaps whole chips are not practical, but sub-processing units could do it.

Sounds useful, but for smaller cores. Having said that, the more you simplify the design the better for certain smarter methods. For example, it's my understanding that Chuck Moore optimizes his Forth cores to expand the envelope of operating conditions to such extent that AMD and Intel can't afford simply because their cores are too large to be understood. Too many state transitions to study, too many gates etc., whereas CM can afford to simply run a full physical model including individual transistor temp

Specifically, they don't scale well to large problems, which is exactly the opposite of what we need to be able to automate the design of an entire core.

Well, that's why one should try it with small problems instead! The core I've mentioned above is barely VLSI by modern standards; it has something like 30k gates. Is this still above the limit you mention?

Intel is going to have something on the market that runs more efficiently and with better performance. Try as they might, AMD just can't seem to get their act together for producing a decently performing product since the Athlon II.

Consoles are using AMD because the parts are cheap, not because the performance/watt is fantastic. AMD hasn't been able to produce a CPU with amazing performance, decent thermals, and high power efficiency for years now. Why do you think gaming PCs and nearly all laptops use Intel? Because Intel offers all three with ease.

Excuse me for injecting a note of reality into your rant, but I thought consoles care about heat. Also, aren't "thermals" and "power efficiency" the same thing? Or does that get in the way of your rhetoric.

I'm still waiting for an upgrade to my AMD FX-6300. I bought it on the promise that there would be an upgrade. I've liked AMD for a long time, but getting burned on the first processor I buy from them is no way to keep customers.

So you've been here longer than I have (UID), liked AMD for a long time yet never bought one in the golden years from 1999 (launch of Athlon) - 2006 (Intel launching Core) or relative competitiveness up to 2010 (with Phenom II x6 still giving Intel a fair fight) but waited until October 2012 when they were clearly well into a decline? Pardon me but your story smells worse than shrimps left out in the sun for a week.

On the bright side, you would no longer need a heater for that room in winter. Just run Folding@Home.

I still think Intel's business agreements in the mid 2000s that put AMD in its current position were immoral if not illegal, so I buy AMD anyway. But I don't buy because the product is better, I buy because the competition were assholes even though they're currently assholes with better products.