This site may earn affiliate commissions from the links on this page. Terms of use.

Intel has taken the wraps off Knights Landing, its next-gen, up-to-72-core Xeon Phi supercomputing chip. The main change is that Knights Landing will be a standalone processor, rather than a slot-in coprocessor that must be paired with standard Xeon CPU. Furthermore, Knights Landing will have up to 16GB of DRAM 3D stacked on-package, providing up to 500GB/sec of memory bandwidth (along with up to 384GB of DDR4-2400 mainboard memory). Knights Landing will debut in 2015 on Intel’s 14nm process, and with a promise of 3 teraflops (double precision) per socket it will almost certainly be used to build some monster 100+ petaflop x86 supercomputers, and beyond to exascale.

The current version of Xeon Phi (Knights Corner) is a PCIe expansion board with an up-to-61-core Intel MIC (Many Integrated Core) chip. These cores are based on the original P54C Pentium core — just like its stillborn Larrabee predecessor — but with a lot of modern additions, such as 64-bit support and 512-bit vector registers. (Read more details about the current Xeon Phi.) Knights Landing is a major revision of Knights Corner, making sweeping changes across almost the entirety of the platform. Gone are the P54C cores, replaced with up to 72 out-of-order Silvermont (Atom) cores. These new cores will implement AVX-512 (AVX 3.1 instructions). Perhaps most importantly, though, Knights Landing will be a standalone CPU, with an integrated six-channel DDR4-2400 memory controller, up to 16GB of on-package 3D stacked RAM, and 36 PCIe 3.0 lanes.

All of these changes equate to theoretical performance of 6 teraflops of single precision math, or 3 teraflops of double precision math. By comparison, Haswell maxes out at around 500 gigaflops of double precision math. Power-wise, Knights Landing should manage between 14 and 16 gigaflops per watt. While it’s a nascent comparison, the most efficient supercomputers currently max out at around 4 gigaflops per watt. With 16GB of on-package RAM with bandwidth of 500GB/sec, there should be significant latency gains, too. Later in 2015, there will also be a special version called Knights Landing-F that integrates a 100Gbps Cray HPC interconnect on 32 of those PCIe 3.0 lanes, allowing supercomputer makers to connect up Knights Landing chips via standard QSFP optical links.

[Image credit: VR-Zone]

Xeon Phi competes directly with Tesla, Nvidia’s GPU-based coprocessor add-in boards. Tesla currently dominates the HPC accelerator/coprocessor market, with 38 out of the top 500 supercomputers. Xeon Phi is a major component of the world’s most powerful supercomputer (Tianhe-2), but adoption is generally lower (just 13 of the top 500). By becoming an actual CPU, rather than an add-in card that must be controlled by a “normal” CPU (Haswell, Opteron, etc.), it will be possible to build supercomputers entirely out of Xeon Phi — a huge change that will both reduce the complexity and cost of building supercomputers, and, thanks to the unified architecture, it’ll be a lot easier to write software that takes full advantage of the hardware.

At 3 teraflops per socket, assuming four sockets per 1U server, we’re looking at a full 500 teraflops (half a petaflop) in a single 42U rack. If the 100 petaflops barrier hasn’t been broken by 2015, it will almost certainly be a Knights Landing-based supercomputer that does it first — and it should be a serious competitor for the race to exascale (1000+ petaflops) computing.

These co-processors could allow for real-time 4K 120fps raytracing gaming.. with 3D models like those on District9 or Pacific Rim .. real photorealistic quality with no compromise..

After all these are based on the Intel own Larrabee project that was born to achieve a discrete GPU series of products .. that then Intel discarded because the earlier versions were “only” able to achieve real-time raytracing 30fps at 1080p on average. And it wasn’t enough for what Intel wanted to attack Nvidia and AMD directly.

Singh1699

Do you see Intel becoming the ‘Google’ of computing with enough of an R&D budget like this, going into the future?

Dozerman

They already are.

Wussupi83

Yup, AMD is Bing.

Dozerman

Better in a lot of ways and more forward thinking, but still shitty where it counts the most.

Dozerman

Another big advantage of this would be the complete integration of resources, similar to what AMD has done with HUMA. These cores could handle raytracing and compute on the exact same cores instead of needing specialized cores for each.

Phobos

Could allow for real-time 4k 120fps? how so?

IKROWNI

Uh oh here comes Phobos to tell everyone how this is crap compared to his ps4 jaguar. And that his ps4 will be the most powerful gaming system for the next 80 years because Sony told him so.

Phobos

And your point is just been a dick for no apparent reason.

IKROWNI

Uh oh here comes Phobos to tell everyone how this is crap compared to his ps4 jaguar. And that his ps4 will be the most powerful gaming system for the next 80 years because Sony told him so.

Dozerman

Actually, they were canceled because they were performing at one-fifth what most graphics cards were at the time.

So basically 1/8 as fast as a CPU as GPU:s at most can be twice as fast as CPU:s when performing zip compression? And x265 encoding/decoding as well as it turns out. Well probably most compression algorithms.

Personally I think that your numbers and conclusion are bullshit and on generic workloads it outperforms everything on the market.

If you look at current Xeon Phi pricing you will have to convince me that it doesn’t outperform GPU:s on most workloads, because its far more expensive than any GPU on the market and still has quite a market share.

HUMA is cool on quite a few workload but I still want to see how it performs on compression, encryption, database indexing, XML parsing and other common workloads.

I estimate that nothing will ever compete with GPU:s on the workloads those are optimized for, but I also estimate that generic processing will never be the strong side of GPU:s.

The fact that GPU:s turned out to be useful in high performance computing workload doesn’t mean that they have the advantage you think they have. This market does not run games or ray tracing, it runs workloads that does not perform as well on GPU:s as you seam to think they do.

iejfaosmfo

You are a complete idiot. Your bias is through the roof.
Intel didn’t release larrabee because it was breathtakingly insufficient compared to the competition from nV & AMD in nearly every way possible.
Keep dreaming.

Patrick Proctor

Intel has been looking for a way to take serial compute cores into the highly-parallel world for menial tasks like graphics, but they’ve realized they can’t, though this attempt to remove the framebuffer middle man could actually produce truly frightening performance.

I’m willing to bet Knight’s Landing 1.0 could take on most of the GTX 700 series and most of the R9 200 series purely because there’d be no more need to having the CPU send data to the GPU. It would all be unified much like HSA on AMD’s side of the field. The major difference though is the core count between Carrizo and Knight’s Landing: 72 to 18, not to mention Intel’s already superior instructions per clock count. And given that wattage rating, I’m betting the chip will be at 2.5 GHz+, which may seem slow, but with Intel’s far superior pipelining hardware and cores moving at double the speed of CUDA and GCN cores, well, is there really a question here?

This has far more potential than the Xeon Phi to send AMD and Nvidia reeling.

Bryan_S

This is still a coprocessor … it still needs to have the cpu send it data…

Patrick Proctor

No, it’s not. It’s a CPU in a socket. There will exist coprocessor cards, but this is going into CPUs now too.

falde

No it doesn’t. It has quite some PCI Express performance and so one could design a system where IO is attached to the coprocessor which in turn feeds the CPU. This chip would make quite an awesome RAID controller.

Look at how mainframes are designed and you will find that they have some quite nasty coprocessors which do feed the CPU with data, not the other way around. They usually have storage processors that do both filesystem and RAID workloads (like ZFS but in hardware), a database processor, a Java processor and so on.

“The z13 is based on the z13 chip,
a 5 GHz octa-core processor. A z13 system can have a maximum of 168 Processing Unit (PU) cores, 141 of which can be configured to the customer’s specification to run applications and operating systems, and up to 10144 GiB (usable) of redundant array of independent memory (RAIM). Each PU can be characterized as a Central Processor (CP), Integrated Firmware Processor (IFP), Integrated Facility for Linux (IFL) processor, z10 Integrated Information Processor (zIIP), Internal Coupling Facility (ICF) processor, additional System Assist Processor (SAP) or as a spare. The z Application Assist Processor (zAAP) feature of previous zArchitecture processors is now an integrated part of the z13’s zIIP”

Yeah that’s right. They got a coprocessor for running Linux, so that it doesn’t use system resources. However the as far as I know those processors are only general CPU:s locked down to do specific tasks. This coprocessor are very different from standard Intel CPU:s as they are highly optimized for number crunching but lack the security features that you expect from a generic CPU.

falde

Well if you want features such as memory protection (MMU:s) and separation between operating system and applications, you will not be able to use this as anything but a coprocessor. While it performs well on generic workloads unlike a GPU it is still not a full blown CPU. Not unless you want something with the security of Windows 3.1 and with similar stability.

You will run multi-user application and operating systems on the CPU and offload number-crunching tasks to this chip. And you will do it trough the OS as there are no chance in hell that user processes will be allowed to send non-verified code to this thing.

You could still do something like HSA where you send over bytecode that is verified to be safe while compiled into native code, before being sent to the chip.

Of course you can also have stuff like device drivers running on the chip. Those are usually run in kernel space anyway so running them on the chip wouldn’t be much of a security risk. And you could also run such a thing as a webserver, which still would be equivalent to running it as a kernel device driver from a security perspective.

However with a compilation process that injects security when compiling from bytecode to native code the effects of an MMU and other security components could be enforced without losing much performance. That’s practically what CPU:s does when translating native instructions into microcode, there are not much enforcement done in actual hardware these days.

Patrick Proctor

It has memory protection built into it and is a fully-fledged CPU. Each Xeon Phi actually runs a full-blown linux kernel as its host operating system.

You can have multiple user processes send data to current Xeon Phi. My school uses one to host multiple virtual machines off of one accelerator on the cheap.

Spazturtle

Compared to a GPU this is still very slow.

Patrick Proctor

Until you realize how good the branch prediction is which can make for super easy management of race conditions.

cb88

No…. these are Atom cores. the branch prediction will be mediocre at best nothing like what is in Haswell.

falde

But still far beyond the branch prediction in GPU:s. However a good multithreading arch doesn’t need good branch prediction, it can just hand over to a tread that’s ready to execute.

It should be enough, it looks almost the same as bf3 and it still uses frostbite 3 engine.

Dozerman

If only there were a DirectX implementation for it…

Mario White-star

Never has a game caused my PC to stress out so much. The heat levels were unacceptable. & that’s in a gaming case with plenty of fans. All turned down to mid/high levels for now. Looking forward to AMD’s Mantle :)

Dozerman

Is there any word on how hard it is to actually achieve a full 3 TF on this? I’ve heard that it takes some serious programming to actually get full utilization, contrary to what Intel claims.

It’ll be easier than the current Xeon Phi, I guess. But yeah, it’s always going to be hard to hit the max theoretical peak performance — but I’m sure you can say the same of Tesla and any other HPC architecture, too.

Dozerman

Absolutely. It’s hard to hit peak on an old pure X86 proc. I was only speaking in relative terms.

Zylvur

Its much more easier to hit the maximum performance on x86 based system (especially ‘just’ x86) than x86 + ‘non-x86’ co-processor

Bryan_S

Too bad for Intel that the max of the Current PHI’s are much lower than what you can get out of Teslas…

Patrick Proctor

Theoretically, yes, and yet the Tianhe II used nothing but Xeon Phis for its coprocessors.

Wussupi83

Seeing that giant, colorful picture on the ET home page was beautiful. Good choice for the article picture.

That’s an Aubrey Isle die, I think — Knights Ferry, one of the original Xeon Phi prototypes.

Dozerman

Agreed. It’s actually a very “pretty” die shot.

darkich

14-16 GFLOPS per watt..

Well, Snapdragon 800 does 130 GFLOPS within 1-3W power envelope

wat

Where did you get the 130 GFLOPS statistic from? ARM chips going forward are likely to hit 5W without some serious optimisation when going beyond 2.5GHz too

darkich

130GFLOPS is a common figure from across the internet.
And the serious optimisation you mention is already far underway, given that Snapdragon 800 is more frugal than Snapdragon 600 at 1.7Ghz, that Snapdragon 805 is yet again more frugal even at 2.5Ghz and at ~ 170GFLOPS.. and that ARM A53 /A57 are by orders of magnitude more efficient than A7/A15, per clock.
As for ARM GPU’s, they will be marvels of efficiency too.. the Mali 760 will reach over 200-300 GFLOPS at the same wattage as Mali 6xx variants

wat

Don’t get me wrong, I knew it would be around 100 GLOPS or 110 at a push but 130 has surprised me. Where is the Tegra 4 lying in comparison?
I will be very surprised if we remain at 2-3W going past 3 GHz even when factoring in A53/A57 efficiency, given that Snapdragon has yet to announce going below 28nm.

Tegra 5 will be a game changer in terms of mobile gaming. It’s running BF3 and their faceworks demo.

iejfaosmfo

nV is far, far behind.
They refuse to deliver drivers to customers (HTC One X IIRC), their demos use ‘tweaked’ hardware, they wildly overstate performance (it’s been a trend), wildly understate power consumption. All hype, not much substance.

Abram Carroll

Got anything to back that up?

I posted a link to hands on product benchmarks. You posted wild accusations without anything to back them up.

Abram Carroll

Their TDP is spot on, what “tweaked” hardware? That’s quite a claim without evidence. Who has better sustained performance and comes closer to peak performance than Nvidia? 365 GFLOPS peak at 1Ghz. Nvidia has the best drivers, that QA(hardware and software) can delay things.

The Tegra K1 has been benched in actual products and it destroys everything else. K1 is all substance.

Do not doubt the 130 GFLOP number above for Snapdragon 800 above. But imagine that the processor only briefly reaches it at high TDPs before being clock speeds are cranked down or cores return to sleep mode again.

Impressive… but 3tflops isn’t scaring nvidia just yet… no mention if sp or dp… if dp it is competitive if sp … well let’s just say nvidia’s going to have a laugh… both amd and nvidia are capable of delivering 2-2.5 tflops dp in a gpgpu. Interesting however is the 16gb stack wonder how much that helps… also 6 chanel ddr4 2400mhz… that’s got to cost an arm and a leg @the quantities neede in such servers… that’s a big hmmm… to that strategy… performance wise i get it but is it economically fesable?

Makes me wonder if amd can pull a server kaveri with close enough specs…

Prodromos Regalides

They say it’s 3 teraflops double precision. If so, they are already considerably ahead of nvidia/amd at least at the 200 watt class chips.
Additionally, they advertise it as a general purpose chip, whereas gpus function as a coprocessor. You have effectively a cpu of 1.5 teraflops at 100watts, more than an order of magnitude faster than what the current “extreme” intel hexacore chips manage. The good thing is they will eventually wish to make money out of it, and it will probably be sold as a consumer product, to prove once again that the needs if the average joe in computing power far exceed the needs of governments, corporations and scientific facilities… but this is another discussion.

john

3 teraflops due 2015-2016… while both Nvidia & AMD have 2-2.5 Tflops DP capacity in user hands for half a year now… they will have about 3-4 Tflops DP by 2015 and probably 6-7 by 2016… so I don’t know…

These behemoth chips from intel look really cool but they will probably cost much more then the dedicated “co-processors”…

Now imagine that AMD has hsail and NVIDIA has something similar and can use it with it’s ARM cores. So I’m not sure if this will be of any meaning… guess we’ll see. If they would launch this chip this year yeah they would make a killing… 2015-2016 not so much…

Bryan_S

Agreed. X86 tends to perform equally well DP or SP. Hitting 3TF by 2015 just isn’t good enough. Then again… it is loads better then where they are right now.

Terry A Davis

Do you think they will give me a free sample, for my operating system, TempleOS? http://www.templeos.org I want changes but I really don’t want changes.

anan

Have a question on the die shots above. What do each of the 3 blocks adjacent to the 3 sides of the processor do? Can part of the 16GB of DRAM 3D stacked on-package be seen in the die shot, or is it all stacked on top of this layer of the die?

Where are the 500GB/sec of memory bandwidth chips located in the die shot? When I count all the modified 14 nm shrunk silvermont cores I am not getting 72 cores. Is anyone else?

Olivia Wayland

I would guess the blocks on the top and bottom are memory controllers, which are typically large because of the high number of signals required for DDR family interconnects. The blocks on the right are probably miscellaneous I/O (BIOS chip, USB, connection to motherboard chipset, etc.) There’s no way any of that die shot is the on-package RAM – 16 GB of RAM is probably a lot more die area than what is shown here.

Each of the full rectangular sections is a Silvermont module (2 cores each). If we assume the jagged half-rectangular sections are single cores, then it adds up to 72.

Jenne

hmm… so If I was able to put arround 85 SSD’s, this CPU could read as fast as all of those SSD’s…. wow.. I don’t care about the price. I want it. Video rendering would be a piece of cake with that. CPU installed, only thing, what kind of sick motherboard do I need to buy to install this baby.

FOPA

here’s my major concern, will i be able to play cs 1.6 ??

symbolset

It’s a nice chip for the supercomputer geeks. That I’ll never find it at Newegg dims my enthusiasm somewhat.

This site may earn affiliate commissions from the links on this page. Terms of use.

ExtremeTech Newsletter

Subscribe Today to get the latest ExtremeTech news delivered right to your inbox.

Email

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our
Terms of Use and
Privacy Policy. You may unsubscribe from the newsletter at any time.