Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

adeelarshad82 writes "IBM revealed details of its 5.2-GHz chip, the fastest microprocessor ever announced. Costing hundreds of thousands of dollars, IBM described the z196, which will power its Z-series of mainframes. The z196 contains 1.4 billion transistors on a chip measuring 512 square millimeters fabricated on 45-nm PD SOI technology. It contains a 64KB L1 instruction cache, a 128KB L1 data cache, a 1.5MB private L2 cache per core, plus a pair of co-processors used for cryptographic operations. IBM is set to ship the chip in September."

The thing is that if you have 2 (say) 1.6 GHz processors, they aren't as 'powerful' as one 3.2 GHz processor.

For one - there are overheads, certain stuff common between them, pipelines - stuff which I forgot (computer engineering related problems).

But the main thing is that not all programs are multi-threaded, and a program with a single thread can only run on one processor. So yeah, GHz are still useful. Maybe for large single-thread batch processing - which is the kind of thing a mainframe would do.

But the main thing is that not all programs are multi-threaded, and a program with a single thread can only run on one processor. So yeah, GHz are still useful. Maybe for large single-thread batch processing - which is the kind of thing a mainframe would do.

I'm betting the code used on these z196 systems is multi-threaded. Shit, if you're paying hundreds of thousands of dollars per CPU you can afford some top notch programmers. With two co-processors used for cryptographic operations per chip I'd say they were after a bigger prize than, say, hardcore gamers;-)

BTW, TFA mentions L1 cache per core but doesn't mention how many cores this chip scales up to. Could it be just one?

But the main thing is that not all programs are multi-threaded, and a program with a single thread can only run on one processor. So yeah, GHz are still useful. Maybe for large single-thread batch processing - which is the kind of thing a mainframe would do.

I'm betting the code used on these z196 systems is multi-threaded. Shit, if you're paying hundreds of thousands of dollars per CPU you can afford some top notch programmers.

Actually I think this mainframe is for getting the last little bit of performance out of thirty year old cobol code. And the original top notch programmers are long dead.

Actually I think this mainframe is for getting the last little bit of performance out of thirty year old cobol code. And the original top notch programmers are long dead.

Considering that life expectancy in the developed world is in the region of 80 years, there is a reasonable chance that programmers who were under 50 when they wrote code thirty years age are still alive.

They may have little recollection of what they did 30 years ago, but to say they are all "long dead" is somewhat of an exaggeration.

I've never met a programmer over 50. I must therefore conclude that they all perish mysteriously upon their 50th birthday. Something like the planet of grim reapers from Futurama is how I prefer to envision it.

Mainframes are engineered fundamentally around two things: Reliability and IOPS.

When it comes to basic tasks, it isn't often that a large server ends up CPU bound (especially database servers). Instead what usually becomes the bottleneck is I/O and RAM.

Reliability is where mainframes take the cake. Some use multiple CPUs to execute the same instructions to make sure the output is correct. Mainframes have virtually redundant everything. Because they have been doing VM since the dawn of computing, it may

When configured to run Linux, each core costs approx $125K. When configured for z/OS, each core costs approx $250K. A complete system (not including any storage or software) can cost up to around $30M.

product of crap programmers
Sorry to ask but who does IBM see using this?
At the price point and data sets that need sorting? - cheaper clusters or more expensive faster unique chips depending on math?

Banks, Credit card companies, hospitals, Insurance companies...Cheap clusters are great but they are not always the best tool for the job.Very large traditional datasets involving lots of high value transactions, with 5 9s uptime requirements do not tend to scale well to COTS clusters.IBM mainframes have uptimes measured in years if not decades.They have hot swapable everything including CPUs. so you can do ugrades with zero downtime.Also you need to take a look at the costs involved. The costs to throw out a working software system that has been used for decades and then the cost to redesign it to work on a Cluster of X86 boes will be huge.Not to mention the investment in making it fault tolerant and if it is used in certain markets the cost of the auditing the software.Not to mention that ZSystems tend to be really secure. There are just not a lot of exploits on Zsystems.

When downtime can cost millions of dollars hardware costs are just no that big of a deal.Now if you are starting from scratch then you may save money by going with a cluster but then you may not depending on just how good your programmers are.

It has been a while but really?I have never seen a mainframe that didn't use Zulu time. Also in the shop I worked all software was quality verified. One machine was at the five year uptime mark when I left but it was a none commercial system.

"They say it's an old CISC architecture. This is probably the sort of system that runs horribly outdated and un-updatable code, like the tax system."You mean like Windows?The X86 is also an old CISC architecture.

Actually the Power line is RISC anyway. When it is used in a ZMachine the old style 360/370/390 CISC ISA is translated to RISC and then executed.Before you go ew that is what modern X86 chips do as well as ARM when using the Thumb Instruction set. The ZSystem ISA is so high end it is almost a high level language so the translation doesn't really effect performance much at all. Also that old CISC architecture is much better than the mess that we have on the X86.I am not sure about how IBM does the translation. On the System 38 AS/400 System-I the translation was done during the IPL aka Initial Program Load. On the Zs it may be done as a JIT but I am not sure.Honestly I love the idea and wish that Linux would adopt it. You could then have one binary that would work on any Linux system on an CPU.The AS400 way kept a native binary copy along with the TIMI copy. When the program was run the first time it would translate the TIMI copy into the native segment. Yes the first time you ran the program it might take a bit to start but after that it would run at full speed and start fast. Of course you could add a binary segment when you first released the code for the ISA of your choice.

All in all those old Mainframes and Minis had a lot of brilliant tech we still don't have today on our PCs.

Actually x86 is a new CISC architecture. The System/360 architecture predates it by over two decades. x86 was about the last CISC ISA to be developed outside of a few tiny niches.

Actually the Power line is RISC anyway. When it is used in a ZMachine the old style 360/370/390 CISC ISA is translated to RISC and then executed

Umm, no. POWER is RISC (well, RISC purists would say that's stretching the point), but POWER and System/z are completely unrelated. The POWER6 and z10, and POWER7 and this chip, were designed by cooperating teams, so they share some execution units, but they are very different architectures. This is not a POWER CPU running a S

The thing is that if you have 2 (say) 1.6 GHz processors, they aren't as 'powerful' as one 3.2 GHz processor.

For one - there are overheads, certain stuff common between them, pipelines - stuff which I forgot (computer engineering related problems).

But the main thing is that not all programs are multi-threaded, and a program with a single thread can only run on one processor. So yeah, GHz are still useful. Maybe for large single-thread batch processing - which is the kind of thing a mainframe would do.

OK, firstly the OP should have said that this is the microprocessor with the highest clock speed. Calling it the fastest CPU is extremely misleading. In most modern CPUs, clockspeed is NOT related to throughput. The Intel Sandy Bridge or Nehalem CPU for example may be running its 4 cores at a clockspeed of 3.2GHz but overall, each core in the CPU is easily 4-5 times faster than a 3.2GHz Pentium4 core.

Secondly, many of the bottlenecks that you allude to are no longer major bottlenecks. CPU interconnect bandw

Of course it is. It is not, however, the only factor, and other factors may indeed (and commonly do) outweigh it.

You took my comment out of context. I was responding to the original post that focused purely on clockspeed as a magic mantra. What you say is only true if you are talking about clock speed increase in the same microarchitecture, ceteris paribus. Making a blanket claim that we have the fastest CPU because we have clocked it at 5GHZ means nothing. I could overclock a P4 to 5GHZ using exotic cooling and my laptop would still probably beat it in terms of performance.

I think you underestimate IBM's technical ability. They do have some idea of what they're doing.

According to the Passmark benchmark, a 3.20 GHz scores 524 [cpubenchmark.net], compared to 10221 [cpubenchmark.net] for a 3.20 GHz Core i7 970 six-core CPU. That works out to 3.14 times faster per core than the Pentium 4. While short of 4-5, the GP is not as far off the mark as your ridicule would suggest.

Yup... there are so many dependencies on application and OS code that hardware capability matters very little.

I recently tried to tune a workload on a pSeries system. We gave it half a processor and 2 virtuals (with the Power version of hyperthreading so it saw 4 processors). Performance was a dog. Load was only 60% of capacity though. We doubled the number of virtual processors but kept the overall entitlement. Load dropped to 40%. Added another couple virtuals and load dropped to 25%. No increase in thro

More or less. They hit two walls - fabricating chips that could run faster while retaining an acceptable yield, and dealing with the heat such chips produced.

The fastest general-sale chips were the P4s - the end of their line marked the end of the gigahertz wars, as Intel switched from ramping up the clock to ramping up the per-cycle efficiency with the Core 2 and their complete architecture overhaul. As a result a 2GHz Core 2 duo will outperform a 4GHz P4 dual-core under most conditions. Better pipeline organisation, larger caches better managed.

Clock rate is no longer the key variable in comparing processors, unless they are of the same microarchitecture.

Yeah, it's actually kind of funny how today's Intel desktop processors actually trace their lineage to the Pentium M, which was a mobile chip. When the Pentium 4 came around, the Pentium Pro (Pentium II, Pentium III) architecture was pretty much relegated to the mobile market while Pentium 4 represented their desktop line. As you said, they ran into heat (and power) issues with the Pentium 4s and basically had no more room for expansion there. They went back to the Pentium M, which was doing pretty nicely in the notebook space, and since it was low-power and efficient it became the basis for their future desktop CPUs--the Core line, in particular. They just stopped playing up the clock speed because that architecture's clock speeds were substantially lower than the Pentium 4, despite being able to do more work. I read once that a Pentium M could do about 40% more work than a Pentium 4 of the same clock, so in essence a 2GHz Pentium M was about as powerful as a 3.2 GHz P4.

Switching everything over to the low-power and parallel-friendly Pentium M line is probably one of the smartest things Intel ever did. They would've dug their own grave had they stuck with building on Pentium 4 to the bitter end.

Clock rate is no longer the key variable in comparing processors, unless they are of the same microarchitecture.

Clock rate has *never* been the key variable in comparing processors. Even back in the heady days of 1 MHz 6502/6510 vs 4 MHz Z80 the comparison was useless - the 6510 does way more per cycle than the Z80 and ends up being comparable speed-wise.

There's also the problem of feeding such a monster processor and keeping it synced up with the rest of the machine. On top of that servers for instance tend to cope better with many cores than faster ones after a certain point, which is presumably well before 5ghz. Since servers typically are more concerned with large numbers of connections, chances are that a quad core running at 2ghz would have better performance than a single core 5ghz would, scale that up as needed to the number of cores. Of course freq

They're very expensive, but for Enterprise scale workloads they're cheaper than the comparable distributed system. The cost entirely depends on how many cores you're running, and more importantly your monthly usage. IBM bills you for your Iron depending on an average of how much you used it that month. There's a reason why Mainframes run so quick and fast, they're the only system where all processing from user ISPF interaction all the way to data processing is tracked. All that processing turns into your fi

You can buy or lease the hardware. The software is licensed under contract.

It seems like the GP is talking about software charges, not hardware. Software can be either monthly fee based or usage based. If it is usage based you must send a usage report to IBM so they can bill you. That is specified in the contract. In either case, the number of and performance of the CPs is calculated into the cost.

Hardware is a different story. With hardware, the number of cores you purchase is not the same as the number you get. For instance, you can buy a 1 core machine, but what you get is 16 cores. Only 1 core is enabled in the firmware though. IBM has offerings (again under contract) where you can buy the right to temporarily enable additional processors instantaneously (like if you lost one of your datacenters and need to transfer the workload to another one). With these offerings, you also need to send usage info to IBM so they can bill you for the time that the additional cores have been enabled.

Unfortunately this chip will most likely go into workstations and servers. In order for IBM to make a desktop version, it will have to make a custom chip to handle things like video, sound, etc. This will lead to same logistical problems for Apple that it had before. Manufacturing companies do not want to keep excess inventories whether it was Apple or IBM. If Apple needs more, it will have to wait while IBM rearranges their manufacturing schedules to compensate. Also even if Apple orders millions of t

Wrong chip family. This is the Z-series mainframe chip, using an instruction set that is backwards compatible with the System/360 stuff from back in 1960 (the architecture of the future, as the marketing material trying to persuade my university to upgrade their IBM 1620 put it). The PowerMacs were using PowerPC chips, which use the same instruction set as the POWER CPUs from IBM (they used to be similar, with a common subset, now they are identical).

The chip that this is replacing, the z10, was designed concurrently with the POWER6. They share a number of common features, including a lot of the same execution engines (both have the same hardware BCD units, for example, as well as more common arithmetic units), but they are very different in a number of other aspects, including the instruction set, cache design, and inter-processor interconnect, because they are designed for different workloads.

I've not read much about this chip yet, but I think it shares some design elements with the POWER7, in the same way that the z10 did with the POWER6.

In short, while some of the R&D money spent on this CPU made it into chips that could, potentially, run OS X, this chip itself could not without a major rewrite.

IBM defines the z196 as one of the few remaining CISC chips, which allows for bulky, large programs that can require much more memory to execute in than RISC chips, including the PowerPC and ARM embeddded processors, among others.

For CISC you need more bytes per instruction, because there are more instructions. With RISC your executable has more instructions but they each use less storage.

I am not sure I believe their implication that CISC is better for humungus commercial applications. Sounds like marketing speak to management to me.

Essentially all desktop and laptop computers use CISC chips and they are fast and cheap. RISC is a neat theory, but these days it seems that as the processors get decoupled from their ISAs anyhow, for various reasons, that it doesn't matter much. You choose the ISA for reasons of binary compatibility or features or the like, and it'll work just fine with the chip.

Also it is not true that CISC needs more bytes per instruction, at least not all implementations. With x86 you find instructions are variable leng

Actually, CISC uses less memory in general, but has traditionally been slower. CISC CPUs came out when memory was extremely expensive relative to CPU speed. cheaper memory is what made RISC (with its larger footprint but faster speed) possible. Nowadays, it really doesn't matter much, CISC is probably better nowadays that memory bandwidth is the big bottleneck. However, our CISC designs are not exactly modern, if you were to do a modern CISC design you would probably end up with something more akin to ARM's

CISC and RISC are marketing terms that incorporate a lot of loosely connected design elements. Most CISC architectures use variable-length instruction encodings. On x86, for example, a number of common instructions are a single byte, while the longest ones are 15 bytes. A RISC architecture typically has fixed-length instructions, typically either 4 or 8 bytes (although ARM chips tend to also support Thumb and Thumb-2 instruction sets which use a 2-byte encoding).

These days, compilers take care of almost everything. It has gotten complex to the extent that a programmer trying to do things all in assembly will probably do a worse job than a good compiler. Chips have many, many tools to solve their problems.

That isn't to say it is never done, in some programs there may be some hand optimized assembly for various super speed critical functions. However even then it is most likely written in a high level language, compiled to assembly (you can order most compilers to do

If you're going for code perfection every time. In the real world you have deadlines and have to maintain your code. Writing in assembly is going to make your code harder to port across platforms, should that happen say from PowerPC to x86.

Not saying that its never justified to use assembly. Within reason, of course.

My 386DX has an external Maths Coprocessor, => can only do floating point functions:(However mine's now a bit faster overclocked it from 33Mhz to 52Mhz... your one does how 5.2 Ghz -> Sure my M series superseeds your G series..right?........right?

It contains a 64KB L1 instruction cache, a 128KB L1 data cache, a 1.5MB private L2 cache per core, plus a pair of co-processors used for cryptographic operations.
In a four-node system, 19.5 MB of SRAM are used for L1 private cache, 144MB for L2 private cache, 576MB of eDRAM for L3 cache, and a whopping 768MB of eDRAM for a level-four cache. All this is used to ensure that the processor finds and executes its instructions before searching for them in main memory, a task which can force the system to essentially wait for the data to be found--dramatically slowing a system that is designed to be as fast as possible.

I'm assuming the cache referred to in the second paragraph is off-chip cache, otherwise it would sort of negate the first sentence.... Would be nice if the article would have actually said that though.

Really this article kind of makes all of last week's comments on the speed of light limiting the speed of processors to 3GHz a bit pointless doesn't it? Now I know in principle the discussions were correct, but this just goes to show that problems can be engineered around.

The comments were about the fact that at 3GHz light travels 10cm per clock speed, which limits how far you can have 2 items on a bus if you want them to communicate within 1 clock cycle. There is no "light speed barrier" or anything of the sort, however at these frequencies you design knowing that it will take measurable time for an electric signal to propagate. For example, for this particular system whose core is at 5.2GHz, if you try to send a signal to an external memory that is say 11-12cm away, then it will take about two clock cycles just for the signal to travel the distance.

A lot of nonsense was spoken in that thread, but the issue is real. The time taken for light to travel is not yet a problem, but the skew is. Most communication between parts of a chip is parallel. If the connections are not precisely parallel then signals arrive at slightly different times. The clock speed is limited to the amount of time that is the maximum where signals will arrive in the same time slice. A similar limit also affects fibre optics, due to total internal reflection causing paths taken

If you direct to the IBM announcement, which mentions the system in more detail then this linked article - http://www-03.ibm.com/press/us/en/pressrelease/32414.wss [ibm.com]
The New zEnterprise 196
" From a performance standpoint, the zEnterprise System is the most powerful commercial IBM system ever. The core server in the zEnterprise System -- called zEnterprise 196 -- contains 96 of the world's fastest, most powerful microprocessors, capable of executing more than 50 billion instructions per second. That's rough

How much would it cost for me to put together a system with the same computing power, using off-the-shelf products, like a Xeon chip, or something? How long would it take for me to save $1 million in electricity, or whatever?

It's a quad-core chip. Each core has two integer, two load and store, one binary floating point, and one decimal floating point unit. Up to 24 CPUs can be placed in the frame. It can connect to another whole rack of POWER7 blades running AIX as an application accelerator platform.

The z196 is for the stuff a mainframe is good at: big batches and fast I/O. The application accelerator is for stuff the clusters of supermicro servers are good at. As a hybrid system connected across the GX bus, it should pump data in and out of applications out pretty well.