Posted
by
simoniker
on Monday April 19, 2004 @03:01PM
from the best-title-evah dept.

argonaut writes "Ace's Hardware has an in-depth article on Niagara -- Sun's upcoming parallel server processor with 8 cores and 4 threads each. The article discusses the chip's radical architecture and what kind of performance can be expected from it in traditionally thread-heavy server applications like web hosting, databases, and other multi-user applications. Given the recent cancellation of the UltraSPARC V, it seems this is going to be Sun's new direction for its in-house CPU design efforts. Furthermore, both Intel and IBM are working on other highly parallel processors and AMD is expected to eventually introduce a dual-core Opteron. So, will more threads prop up Sun's performance?"

I must write anonymously for the sake of my job at Sun Microsystems. Namely, I want to keep it.

The Niagara processor and its successor, Rock, are based almost entirely on the Hydra [stanford.edu] processor that Professor Kunle Olukotun developed at Stanford University. He co-founded the company, Afara Websystems, that Sun Microsystems purchased. If you want to know how Niagara works, just check out the Hydra processor.

The reason that Sun Microsystems abandoned the UltraSPARC V and successors is that the design teams who developed the UltraSPARC processors after the UltraSPARC II were just horrible. Normally, when engineers develop the microarchitecture and eventually the Verilog model of the chip, a documentation engineer documents all aspects of the chip. In the case of Sun Microsystems, there was no documentation engineer. Ultimately, on the very day that Sun released its processor to the market, no documentation existed.

Even Sun's own engineers did not have the documentation to develop the boards that would accept an UltraSPARC processor. The whole experience is incredibly stupid but true. Most engineers on the processor teams are Indians or Taiwanese, and they just "do not do documentation". Various Linux gurus complained about the lack of documentation needed to port Linux to the latest version of the UltraSPARC. Sun would have loved to produce the documentation if it existed. Unfortunately, it just did not exist.

UltraSPARC V had the same problem. The whole design process for the UltraSPARC V was a mess, and canceling the project fixed the mess.

Sun does not have the engineers with the skills to build a fat-core processor. So, Sun moved to thin-core processors like Niagara. They are easier to build and to document. They simply matched Sun's skill set, which is derived mostly from foreigners.

Unfortunately, for Sun, what is easy for Sun to design and build is also very easy for IBM and HP to design and build. If you IBM and HP engineers are reading this article, you are in luck. Just check out the Hydra processor, and you will know the 80% of microarchitecture of the Niagara processor. Fortunately, for you guys, building a Hydra-based processor that executes the Power instruction set architecture (ISA) or the HP ISA is much easier than building a processor that executes the SPARC ISA. Those damned 128-register register windows diminish the number of cores that can be squeezed onto the die.

I would like nothing more than to see Sun's processor department setting by 2008. Sun should not be in the business of designing processors. The UltraSPARC-III fiasco should have been a big clue.

If Sun were purely a software house, we'd have a chance of making a profit.

I can't comment on the specifics you mention, being a Sun customer/reseller rather than a true insider. However I am concerned about how the markets and community seem to have a down on Sun at present, which could itself be their undoing (i.e. the damage is being caused not by the danger but by the perception of danger).

Firstly, Sun are absolutely right to keep hold of their processor technology. The market has long since grown up to the feeds-n-speeds contest, and realised that memory latency and I/O thro

They simply matched Sun's skill set, which is derived mostly from foreigners.

I have a hard time understanding how one's skill set is determined by their nationality. Ability to communicate in English, perhaps. But then again, I would say the same about any Bay Area born engineer that moves to Bangalore and has to communicate part time in Hindi.

In any case, I think there's a major difference between one's ability to write prose or design a chip. Communication among team members is crucial, of course. Bu

The only really significant change needs to be in the lower levels of Solaris' scheduler, so that it handles the context switches properly. Solaris already does that for existing SPARC architectures with thread level parallelism support. The only difference the OS sees is the caches and the number of available "slots" for running LWPs.

Of course, you'll only see a significant benefit when you've lots of threads in the run-ready state (which mostly happens when you have lots of threads, period). Given java's fondness for threads, and solaris' already outstanding handling of systems with thousands of threads, this seems like a smart optimisation choice.

So, with the necessary Solaris installed, your existing Tomcat running on your existing JVM will see all the benefits.

So, with the necessary Solaris installed, your existing Tomcat running on your existing JVM will see all the benefits.

Not it won't. At least not so simply. It will see the benefits if there are enough concurrent threads running (as you said), and even that if these are not waiting for each other. So it will work for many clients at once.I have my doubts that this architecture will help with most real world tasks - even real world server tasks - even with completely blown out of proportion threading like j

What part of hyperthreading and "both Intel and IBM are working on other highly parallel processors and AMD is expected to eventually introduce a dual-core Opteron." says to you that "Intel or IBM are not going into that direction that far."

It might be just the way I'm reading it but the only difference is that Intel started small (hyperthreading) and still currently rely on several physical processors. IBM's Power already has multiple cores, and this isn't the first time a dual core Opteron was mentioned.

But currently Opteron, Power4 or even Pentium IV Procs _all_ outperform anything Sun has to offer - in terms of the CPU speed, mind you.One core of the Power4 is nearly twice as fast as the fastest _available_ Sparc Proc.See? Intel and IBM are late to the game because they didn't see the need to be earlier. The gap between Sparcs and other CPUs has been getting bigger and bigger. It will come to the point where even the most loyal customers won't be able to justify buying Sun equipment because they are so d

Well strictly speaking, a well-behaved Java developer won't go thread crazy. But a large number of us were taught that "threads are good, use and use often," which has resulted in all sorts of problems when we get into non-Windows Java environments.

That's true, but until nio introduced polled IO, the best behaved java developer either had to choose to have rather a lot of threads or have their program crippled by IO waits. So there's a lot of code out there that does make lots of threads (and it's a handy programming paradigm even now, so it's not going away any time soon).
As the poster above says, it's only an improvement if you've got lots of threads. So an application server is a prime example - it ends up running _lots_ of servlet instances simultaneously, as it's mostly IO bound (waiting for disks to spin, database servers to respond, xml-messaging backoffice thingies to commune with antique cobol boxes, etc.). This kind of application will really benefit - other stuff (e.g graphics, raw-calculation) largely won't - but stuff like Websphere and Tomcat is exactly what folks buy mid/high end Solaris-SPARC boxes for.
As to problems on "non-Windows" environments, you'll find fantastic thread handling on AIX, HP/UX, and Solaris, where tens of thousands of extant threads isn't going to bring the machine to its knees. NT is okay, I don't know about the BSDs. Linux _was_ horrible, but I know a bunch of work has gone into threads recently, both in the library and in 2.6 - I don't know how much better this has made things.

If you really are waiting for the disk to spin more threads aren't going to help you, everyone's just going to be sitting there waiting for the IO. The main problem with so many threads is generally bus contention, even with a cache hit rate around 95~98% the bus can saturate quite quickly, and there's always good old Amdahl's Law. Though, if you're waiting for a database to come back with some results in every thread it's going to be a while, so at least you won't be hogging the bus.

It's the same thing that's been happening for the last decade. As x86 slowly creeps in on Sun/IBM/Whatever's market, they have to come up with something "bigger".

Right now, the Opteron, with embedded memory controller and gobs of I/O, has really entered what was previously a niche market that Sun made very nice profits from.

So, now that particular cash-cow has fallen to the ravages of commodity parts, they're moving their sites even higher. Sun's never been the company to make $5 profit on each of 50 million computers, they'd much rather make $300,000 each on 1,000 computers.

Yes, you're right.
Except, of course, that the price to performance ratio of the X86 platform remains unmatched; X86-64 has removed some of the limitations of this platform; limitations that made it unsuitable for the high end, and now Intel has been forced to follow. I fear for Sun's long-term future. on the long run, value for money always wins in business. Or so I [afriguru.com] think.

This actually brings up something that I have been thinking about recently, what classifies something as commodity hardware. It's not as if an opteron box can be had for a tremendously low price, HP's quad processor opteron box starts at approx $20K. I don't really consider that "commodity". Compare that to a quad Xeon box for $26K. And finally compare that to a quad box from sun, for approx $34K. None of those are what I would consider commodity. So what is commodity pricing?

You're looking at it from the wrong end; the consumer end. Look at it from Sun's end and you may see it a bit differently. Since there is a comparable "commodity" system on the market, Sun would need to drop the price they charge to compete with the commodity system (without changing thier product strategy). Using your prices that would equate to a loss of profits:$34,000 - $20,000 = $14,000

That's $14,000 that they would lose in profits if they were to compete by matching the price of the commodity system.

It's not the same as 5-10 years ago, the haydays when if you were serious you needed Sun, HPUX, SGI, or AIX boxes for hardware speed and quality and (very important) for availability of serious software. It very much so looks like the niche Sun is in is shrinking. Sun better have an answer to cluster computing with commodity hardware. A lot of heavy applications do very well on a cluster. IBM saw it coming and seems to be right on it, reshaping itself, HP has become very active with clusters, and SGI is bui

I generally call it "commodity" if you can go out, buy the parts from different places, and put it together yourself. Although in the quad-cpu market you *usually* have to get the motherboard and case together, it still qualifies.

I have a quad-Xeon (P3) at the office, I bought the mobo/case from SuperMicro, and the other parts from various distributors. Sure, the price was around $15K by the time I got all of the doo-dads (10-disk RAID array...), but it was still all "commodity" hardware to me.

It's the same thing that's been happening for the last decade. As x86 slowly creeps in on Sun/IBM/Whatever's market, they have to come up with something "bigger".

This is not bigger. Taken to the extreme, this is like if Commodore was still in business and tried to sell computers with 2^32 6502 procs.Look at the chart in the article to see how desperate Sun is:They admit that existing Opterons Xeons not only kick the ass of their newest, existing architecture for a single thread, they also concede that even their not-existant future proc won't even be faster for single threaded apps.Ok, you say, but it is faster for multithreaded apps. The only problem with that is that I bet that a recent multiproc Opteron/Xeon will give the future Sun architecture a run for its money.And IBM/Intel won't have any problems building multicore procs, if they want. They just don't need to, at the moment.IOW, looking at this chart one might ask himself why tSun even tries to build processors nowadays.

Sun was ravaged by time. When the SPARC begun to lose its competitive edge, they would have been forced to get their CPUs from one of their direct competitors in the Unix OS+System market. The processors eating their lunch at the time were DEC's Alpha and IBM's POWER. Intel chips weren't up to par yet, obviously, nor AMD. This was when SPARC was still worth something. Now, it's hopelessly outdated, they don't have any IP anyone wants. They can't unload SPARC, and they can't just take a loss, so what do they

Incidentaly they've been on and on about "commodities" and all that for years. What about their Niagara chip? I 've read the article and it hints that the chip will be fitted not only with processor cores but also with a tcp offload engine and crypto accell circuits. Plus the chip will be substantialy more energy efficient than its x86 counterparts. So you have a chip that does more things on its own, and consumes less energy. More integration could possibly mean commodisation. If these things require less

Any advances Sun may have in CPU performance will be greatly outweighed by two major engineering design flaws they've gotten themselves in to:

1. overall system performance of their partitionable systems (i.e. the ones people will pay a premium for over low-end systems where Linux on Intel/AMD is killing them) is severely hampered by their 150MHz (Mhz!) backplane. Sun views this as a plus because it allows customers to run boards with differenc CPU speeds (e.g. a 750MHz board (5x backplane speed) and a 900MHz board (6x backplane speed)). So, board to board thruput suffers and overall scalability is reduced.

2. Their desire for greater hardware isolation between domains, down to only a 2 or 4 CPU board with whatever memory happens to be installed on those boards, severely limits the flexibility in providing workload management between logical servers (domains), as well as less flexibility to create / deploy fewer, smaller servers. IBM's LPAR architecture, and HP's VPARs, are kicking Sun's ASS!

1. It is an easier upgrade path for customers. I think Sun learnt that it is easier to sell its customers incremental upgrades than to sell them brand new designs. Remember that the market they sell to (telco, financial) absolutely despises having to test all their mission-critical applications on new, unproven hardware. So while the slow backplane is a performance limitation, many customers may prefer stability to cutting-edge performance.

2. Wait for the 'Zones' in Solaris 10... I've heard it is better than anything IBM & HP have to offer.

2. They HAD to come up with something to counter LPARs, etc... the market shifted and they got caught with their domain's down at their ankles... of course, no doubt IBM and HP could (and frankly, maybe have) come up with something akin to zones / containers as well, ON TOP OF h/w LPARs... the fact remains, better h/w flexibility

LPAR come nowhere close to what Sun's domain offer and IBM have nothing to compare. HP can't do dynamic domains as Sun can either. An LPAR needs at least 3 CPUs to work well, if the hypervisor goes down you've lost the lot and there's a lot of overhead in keeping it all running. It is a long way away from 'kicking Sun's ass'.

As for Sun Fires flying out the door, sales figures are excellent - look at the success of the V210, V240s, V440s, etc. All selling very well.

I'm not sure about #1, but I always thought Sun had a much more vast throughput than Intels. I'm also not sure what you mean by "backplane", a quick wikipedia seems to suggest that it a simple bus of 1-1 pin mapping. Where is this used? Why does it matter? Even Mid-range Sun servers have a 9.6 GB/sec sustained throughput [sun.com] (Sun Fire Interplane Connect),

2. As with all things, there are cost/benefits to every feature. I'm sure there are applications that are better suited with greater hardware independance. Still I'm not sure what you mean here, are you advocating more manageability between CPU's and different domains (which is good for managing severals VM's?)? With a processor that has eight cores, you'd assume that one would be able to put a vm on each one with that vm, having four hardware threads available. How is IBM/HP's offering different?

The backplane is what facilitates communication between CPU boards. Yes, they *rate* throughput at 9.6GBps, and that may be the rate. Of course they have more throughput than (typical) Intel machines; those are generally lower-cost machines and they don't have the margins to support high-end features such as high-bandwidth backplanes. My point is that Sun's CAN'T really improve, as they've nailed the clock speed to support multi-speed CPU boards. IBM's backplanes scale 1:3 with CPU speed; you have to have a

IBM's backplane CAN increase sustained throughput as faster CPUs are installed, for better overall system scalability.

If I understand the backplane as the CPU->CPU bus, then wouldn't a multi-core CPU reduce dependency on the backplane? For applications that require low latency and high throughput how can you get higher transfer rates that what's available on the CPU itself?

As per the second point: Wouldn't a multicore-multithread multi-CPU server offer more flexibility for load balancing and on-damand peak handling (ie, move CPU2 Cores 1-3 from mail/fileserver duties to httpd to handle slashdotting)?

It seems the differences you are stating are about the overhead of managing multiple physical CPUs, but with this new chip a 4 way could handle what a 16 or 32-way did before. Thus the intra-CPU differences between IBM/HP and Sun are fairly irrelevent. Maybe I'm missing your point.

If I understand the backplane as the CPU->CPU bus, then wouldn't a multi-core CPU reduce dependency on the backplane?

It doesn't matter what kind of bus it is, the fact that a CPU has multiple cores probably will not help its bandwidth. Short form, it depends on how the cores are wired. If you have a bus into the package which goes into one core, and that core feeds the second core, then no, it's not going to help. For instance if you put two Opterons in a chip, and ran a hypertransport link between

1) It is not a 150Mhz backplane, it is (in the case of a Sun 15K) an 18x18 crossbar switch each of which has a 150Mhz 32 byte wide (not 32 bit, 32 BYTE) data path, which is 172.8 Gbytes a second. You can't think it as just a huge, fast datapath either. The entire system is snooping other transactions to keep the caches updated to it doesn't have to request data multiple times. See Sun 15K System Overview [sun.com] for a better explanation.

Taking an example of a 1280... You have a system bandwidth of 9.6GB/s, which is higher than anything IBM have. You purchase it today, with 8 UIII CPUs and in a year you find you need more power. You can then upgrade with four extra UIV CPUs, withouth taking the machine down, whilst making use of the latest CPU tech.

With IBM you'd have to pay a premium for old CPUs or buy an entire new machine...

Fair enough. For the record it's only with the newer 900MHz and up that you'll be able to mix. The bumph I've read always states that the interconnect was always designed, as with earlier Sun architectures, only to be maxed out by the following generation of CPUs - so the current Sun Fire Internconnect was designed with the end of the UIV range in mind. So, by the time you're running out of oomph on that interconnect, you'll be buying a whole new box.

I hope that they've made some vast improvements or they're gonna have some serious issues feeding that beast. Systems now, even the Opteron which is among the better mem controllers around for a commodity processor, still have issues with wait states. Uberthreading it and dumping more cores on the chip will only make the situation worse unless they do a serious upgrade of the memory controller.

If they do not, why pay bazillion bucks for a processor that is idle for most of the time?

Sun doesn't make commodity processors, and they (at least in theory) have much better memory controllers already. Since it's a lot easier to improve the bandwidth on access to memory than the latency, it makes a lot of sense to uberthread their CPU, because they can move a lot of data in a single round-trip. If you have time to get 64 threads to their next cache misses in the time it takes to start getting data, and you can have 64 requests in flight at the same time, you're going to keep the processor 64 t

I hope that they've made some vast improvements or they're gonna have some serious issues feeding that beast. Systems now, even the Opteron which is among the better mem controllers around for a commodity processor, still have issues with wait states.

It's interesting that you should mention that, because one of the early multi-threaded processors (at Tera) was specifically designed to solve that problem. The theory was, and still is, that if one thread has to stall it's OK because there are still plenty of others that can keep running from cache. So no, you won't have N threads all running without waits and yielding N threads' worth of performance, but you'll still have enough live threads to give you more performance than you'd have with a single-threaded core.

Only time will tell which way it will really go. Most likely, there will be some workloads on which this approach works extremely well, some on which it provides no benefit, and a few on which you would have been better off with a "fat" single-thread CPU design. One thing to remember is that if the system has X threads, cache pollution and memory bandwidth are going to be problems either way. The fact that the multi-thread processor can still get some work done on some threads even while others are blocked waiting for memory will probably allow it to maintain an advantage over a faster single-thread processor that blocks completely more often.

Dr. Marc Tremblay: Yes. In large multiprocessor servers, waiting for data can easily take 500 cycles, especially in multi-GHz processors. So there are two strategies to be able to tolerate this latency or be able to do useful work. One of them is switching threads, so go to another thread and try to do useful work, and when that thread stalls and waits for memory then switch to another thread and so on. So that's kind of the traditional vertical multithreading approach.
The other approach is if you truly want to speed up that one thread and want to achieve high single thread performance, well what we do is that we actually, under the hood, launch a hardware thread that while the processor is stalled and therefore not doing any useful work, that hardware thread keeps going down the code as fast as it can and tries to prefetch along the program path. Along the predicted execution path [it] will prefetch all the interesting data and by going along the predicted path [it] will prefetch all the interesting instructions as well.

And you are exactly correct: an engineer-turned-marketer at Sun told me that the main point of the Niagara project is to dramatically improve throughput. Putting several cores onto one die is not the challenge here - the challenge is the memory manager and memory interface.

If Sun doesn't cancel this one, it could put them back on the map for server & enterprise-class computing. Low power, awesome multi-threading capabilities, and software that could only be described as "bad-ass" (The 3D Desktop should be out by then) will give Sun a huge edge over everyone that would take years to catch up.

Sun doesn't have the R&D to keep up in this space. By the time 2006 rolls around, AMD, Intel and IBM will be closing any performance gap with this chip, and their higher volumes will ensure that they blow this out of the water in terms of price/performance. Sun is clinging to an image of itself that no longer works as a business model - hence years of huge losses and layoffs.

I don't know... if possible, it's always better to have the option of unleashing all your processing power on a single thread. Of course presumably all these cores together will be faster than any single processor for running a bunch of independent or loosely coupled processes. But if the processes are TOO loosely coupled, a cluster of x86 boxes will put up a good fight (for instance, google runs on a huge cluster). And then there will be SMP Xeons, Opterons, and G5s in the race. Niagra may still be bet

...seeing as Java 3D (what I expect they were using) probably just hooks into native opengl code which in turn can mostly run on low-end current generation GPU I can think of nothing better than to put those clock cycles to.
Think about it, when you're playing games your video card is normally maxed out, but when you're working in openoffice or the like all your gpu is 2d work, why not leverage the 3d accelerations as they wouldn't otherwise be used?

Not to destroy the lovely mental image in this thread. Well, here is the story, Sun is working on Niagra and the Rock. The Rock would combine the single-threaded approach of the UltraSparc product line with the multithreaded architecture of the Niagra processor... check out the complete atricle [infoworld.com]

As their staggering losses continue to mount, I believe its pretty well proven that Sun doesn't belong in the processor design business any longer. They simply can't achieve the volume required to support the massive R&D investments required. Even nifty tech as described, the majority of business applications don't care what processor is running underneath - its all a matter of price/performance. Sun isn't going to win price/performance against intel and AMD.

Current Oracle licensing schemes require that clients pay PER CPU CORE, for multi core processors. This screws anyone that uses Sun boxes, because the cores are US2 based. So the Oracle client has to pay heaps of cash to use, effectively, a 5 year old processor design. In addition, Oracle licensing requires that if your server has the capacity to hold more than 4 processors (eg cores) thes you have to pay the "enterprise" rates.

So in conclusion, the price of Oracle on a 2 cpu Xeon, AMD, or Ultra sparc 3 is about $6000. The price for Oracle on a 2 cpu Niagra (8 cores each) will be $320,000. Only an idiot will use this cpu (or this database). Since a lot of companies have a huge investment in Oracle, they will have no choice but to switch to x86 hardware. Sun is going to kill themselves with this design, despite the fact that the design, in itself, will greatly improve the throughput of their servers.

Oracle licensing is heavily slanted toward intel arcitecture, they have always penalized people for using risc based processors.

If I understand correctly, the Solaris operating system allows the owner to select the number of CPU's they wish to license (it's cheaper for Sun to build a fully configured system, and then license the number of CPU's used, rather than to send a technician in and change the hardware). Presumably this licensing scheme would be extended to control the number of cores active?

I know Sun works this way on their larger boxen. Not sure about the Oracle implications for an E10k with only 32 proc enabled. I assume that when you are paying millions for your setup that you get to talk to someone at Oracle and "work something out".

All I know is the little box prices.

Oracle licensing sux, the people that figure it out at Oracle are obviously insane. I don't think they have any idea about technical realities. Or, possibly, they want to destroy Sun so that they can produce 1 fewer port o

I think its 4 or less cpu SLOTS. So the fact that the v880 has more than 4 slots means you have to buy enterprise. Even though you only have 4 processors installed.
Another way that Oracle licensing is on drugs.
MySQL now has a cluster option for HA and one has to ask if Oracle is really required for db's 100Gb in size.

I think it depends on the oracle sales droid that you happen to run into. Hyperthreading is not a separate cpu core on the chip, so I think you could make a case that you shouldn't have to pay the price.

What irks me is that the multi-core cpu is just another technique to improve performance. Oracle is forcing their clients to freeze their database hardware at 1995 perfomance levels or face huge license fees.

It's very interesting to think about who these Niagara based servers are going to be targeted for. The nifty IOE feature and integrated ethernet controller seems to guarantee they should be great for telecom purposes. Of course that's a cursed market that Sun is already king of. Niagara based server seem destined to go head-to-head with dual-processor Xeons and Opterons. IT groups building web server farms or clustered databases will have a new option to consider. Either go with cheaper, lower performance Xeons and Opterons running Linux or with fewer, but more expensive Sun Niagaras running Solaris. It's an interesting proposition, and seems like Sun's first real attempt to compete on price/performance. The real x-factor is AMD. If they can really break into the server market, then the Opteron could offer as much performance as the Niagara but at the same (or lower) price as a Xeon.
It's ironic to see how positions have changed. Intel and AMD are developing multi-core CPUs for use in 4+ way systems, while Sun develops a CPU that is SMP incompatible. Of course Sun is also working on Rock, and hoping it can compete with a Xeon as a single cpu, while still scaling for 100 CPU Infernos (or whatever they are going to call them.)