In this case PCIe4.0 seems to be too slow.I really wonder how Intel is going to accept NVLink. IMHO they are going to implement it only on server motherboards/chipsets while it will never arrive in the mainstram market. If others are going to use NVLink while Intel does not may be a problem for Intel in HPC market, where its solutions may be easily replaced by much more powerful (and efficient) solutions.In the mainstream market, on the other hand, NVLink will open the doors to a situation where x86 CPUs may be put aside with respect to the performances obtained using external modules mounted on that NVLink bus. x86 will just be useful to run Windows and nothing more, as the main part of computational work may be delivered to those external components (that may be GPUs but also other kind of non-Intel HW, which could clearly also become a problem for Phi Xeon).Reply

Looks like Intel's plan is to use OPCIe which is expected to have 1.6 Tbs which gives 100 GB/s bidirectional link. Cables and MXC connectors for this interface are already being tested and slated for production Q3 2014.Reply

already stated below but you are also forgetting to mention this new MXC connector for the data center is actually made to support up to 64 fibers, At a speed of 25G each, (as per a single QPI link) it can carry 1.6 Tbps of data which is 200GB.... but as far as we are aware right now the OPCIe module interfaces after the QPI links so Intel will need to include a massive amount of extra QPI links to cover this "up to" 100GB><100GB potential...

I like how NVIDIA adjusted the graph scale from DP GFlops/watt to SGEMM/W (normalised) in order to give it an exponential curve. They could just give us a straight line with each year and the product planned and we would still be happy.

But it looks like GDC 2015 and 2016 are going to be interesting - NVIDIA will have to start teaching people how to use NVLink and Stacked DRAM better in code before the products actually hit the market. They will also have to pair up with a graphics engine to show it off on day one. Will be interesting to find out if the PCIe/NVLink transition is seemless or you have to code for it - having access to more bandwidth could change the coding paradigm.Reply

Well actually in this case I applaud Nvidia. GFlops/watt is pretty useless. Often it assumes everything in a register, no dependent ops, no cache misses, perfect instruction mix, no memory access, etc. Basically a useless number, kinda like bragging about RPM instead of top speed in a car.

SGEMM/watt on the other hand is a more useful number. It's more directly related to real world performance. Granted it would be nice to know the size of the matrices involved.

The sneakier change in the graph is that the first is log based and the second is linear. Reply

It's not a sneaky change on the graph, it's just a change between the two graphs. A year ago they chose to present it on a logarithmic scale and this year they chose to present a similar graph on a linear scale. I doubt whomever made the new graph thought "people are going to compare last year's graph with this year's graph and think that performance increases (including past performance increases) are accelerated now compared to before".Reply

Sorry, but the change in the curve is not due to moving from DP GFlops/watt to SGEMM/W, but from the move from a logaritmic scaling on the vertical axis to a linear one.Didn´t you realzed of that one?Reply

I love hearing about cool, far off features... But damn it! All I wanted to know was when "High Performance" Maxwell (GTX 800 series, presumably) would drop.

GTX 680 came out March 2012 (Kepler GK104)GTX 780 came out May 2013 ("Big" Kepler GK110)But then, GTX 750 just came out (Maxwell GM107, but still being made on the "old" 28nm)They can't possibly wait a whole other year without a new flagship part...?GTX 880 ? (Maxwell GM11x?)

I don't think you can conclude that, sascha. The GTX Titan Z is a very specialized card. No Kepler-based cards like it existed for the entire duration of the availability of Kepler-based cards until now. I believe NVidia also recently announced a Fermi-based mobile part, so they don't seem to be completely against introducing new cards based on older architectures in special situations.Reply

Did NV just talk generally about NVLink or was there an HPC context to the discussions?Presumably if they are still supporting PCIe they are going for totally different cards for HPC/etc use vs standard personal computers? Maybe with an eventual long term desktop transition to NVLink if it ever makes sense.Reply

Doesn't it seem difficult to imagine NVLink to be used in maintstream PC/notebook gaming systems without Intel being on board? It seems either multi-GPU setups would become mainstream, NVidia would have a secondary CPU across the PCI-E bus from the primary CPU, or NVidia would attempt to replace the x86 CPU entirely, somehow, in order for NVLink to be used for GPU-GPU or GPU-CPU communication in mainstream PC/notebook gaming systems without Intel being part of it. Or am I thinking about something incorrectly?

In any case, Intel might look at it as a threat. They probably want to control all infrastructure themselves. They can't stop IBM and NVidia from hammering out an effective computing platform in the HPC space that may eventually beg to be brought to more mainstream computing. Maybe, if the system is effective, they might have to copy it with Phi (or perhaps they are already developing something in parallel), assuming Phi is competitive with GPUs.

When AMD made HyperTransport, Intel didn't use it, they developed QPI, isn't it?Reply

you are right unless they have a way to bypass the limited QPI and HyperTransport interconnects to get more data throughput , remember that's at Intel and AMD's whim

QPI "Intel describes the data throughput (in GB/s) by counting only the 64-bit data payload in each 80-bit "flit". However, Intel then doubles the result because the unidirectional send and receive link pair can be simultaneously active. Thus, Intel describes a 20-lane QPI link pair (send and receive) with a 3.2 GHz clock as having a data rate of 25.6 GB/s. A clock rate of 2.4 GHz yields a data rate of 19.2 GB/s. More generally, by this definition a two-link 20-lane QPI transfers eight bytes per clock cycle, four in each direction.

"HyperTransport supports an autonegotiated bit width, ranging from 2 to 32 bits per link; there are two unidirectional links per HyperTransport bus. With the advent of version 3.1, using full 32-bit links and utilizing the full HyperTransport 3.1 specification's operating frequency, the theoretical transfer rate is 25.6 GB/s (3.2 GHz × 2 transfers per clock cycle × 32 bits per link) per direction, or 51.2 GB/s aggregated throughput, making it faster than most existing bus standard for PC workstations and servers as well as making it faster than most bus standards for high-performance computing and networking.

Links of various widths can be mixed together in a single system configuration as in one 16-bit link to another CPU and one 8-bit link to a peripheral device, which allows for a wider interconnect between CPUs, and a lower bandwidth interconnect to peripherals as appropriate. It also supports link splitting, where a single 16-bit link can be divided into two 8-bit links. The technology also typically has lower latency than other solutions due to its lower overhead."

from the open PPC IBM initiative they can make and use any new NOC (network On Chip) they like or take another's generic NOC IP to get near their real 250GB/sec over 8 links NVLink block, ARM have generic IP for a 256GB/sec NOC (2Tb/s) for instance, as does MoSys

OC non of this matter's unless they have a way to interface the x86 directly over a new internal NOC on the CPU's die bypassing the QuickPath Interconnect,the even slower DMI 2.0, the HyperTransport, and the PCI-E bus.... theres a reason they are working with a CPU vendor that can add the needed modern NOC they required to push these data rates to the cores...

unless Ryan, annand or anyone else can say how they will add this new NVLink to intel chips and keep that up to 256GB/sec potential !Reply

Nvidia are a bit over-confident I think. NVlink is great and all but who is going to support them? I think that such a drastic change in motherboard design isn't going to be allowed by chipset providers (Intel wouldn't even allow them to build chipset beyond Core 2 Duo and AMD, well why would they change their chipsets design for a competitor's plan?) PCs indeed need to shrink (all-in-one motherboards are obviously the right direction) but not that soon. Maybe Intel and AMD should create the new PCIe connector, maybe MXM-type.Reply

@dk: This is probably designed for more "custom" designs like GRID or HTPC applications. Most consumer applications don't have many many GPUs installed in a single system. (If you look at the block diagram above the NVLINK interconnects are cross GPU in the 2016 timeframe)

Nvidia aren't over-confident at all, this helps to solve a massive issue in the mainstream server market. Getting GPU density (in terms of size and linking them together) in normal x86 servers is a big problem and Nvidia are the only company really doing anything in this space. There is hardly any support for AMD cards in servers and they don't have anything like vGPU or GRID (AMD are also saying goodbye to their x86 server market share so it's really only Intel and the server OEMs that Nvidia needs to convince. IBM are already onboard by the sounds of things).

Servers have already shrunk and blades use this style of mezzanine connector for it's daughter boards. If AMD aren't going to do anything about giving my virtual servers and virtual desktops the graphics grunt they need then Nvidia can pretty much do what they like. Reply

At least you can search wikipedia before you even post something, but it seems people don't even bother to do that, they just like to write the fairy tale they have in their head.Anyway, I am glad to hear that amd doesn't have vGPU, I wonder how I am using it.On the other hand companies are pushing amd based solutions with Seattle or Warsaw cpus and some of them with the blessings of Arm(whatever this means).GRID, hmmm..... what can someone say about GRID, well someone can say it is as successful as onLive.The cuda passion is fading away by the time and openCL is gaining ground in every application.Server market is not favoring proprietary standards with doubtful development and support, unlike open-standards.Reply

AMD's x86 server market share is shrinking and has been shrinking for a long time when it comes to off the shelf servers that 95% of businesses run on. This is fact. If AMD can take back market share with Warsaw and push ARM business as well then that's awesome. More competition is always welcome and is better for the customer. However my point was that Nvidia doesn't have to convince AMD to include NVLink in anything because they don't dictate enough of the market for it to be a problem.

You are glad AMD doesn't have something like vGPU? Seriously? Just because you don't use it doesn't mean this isn't a big deal to a lot of other people. You can't vmotion/live migrate a VM that is tied down to a passed through hardware device, but you can if it's virtualised. And no AMD doesn't do true virtualised GPUs, they rely on RemoteFX.

As for GRID, it's working it's way into a market that has large project lifecycles. There a lot of seriously large oil, gas, architect and design firms that are currently in proof of concept stages for GRID at the moment and are taking it very seriously. It's not a one size fits all but it's proving to be really popular at the moment (with my customers at least).

The server market favours what it favours, so we'll wait and see, but I rather put my money on Nvidia than AMD when it came to the server GPU space.Reply

I noted the sarcasm, but the other way around. Maybe because AMD doesn't have a form of functionality that is same as vGPU that's being implemented by Nvidia. I'm willing to admit if I'm massively wrong here, but the only way to get shared graphics using AMD cards through any of the main hypervisors at the moment is through stuff like RemoteFX or vSGA. Reply

The first version of NVLink is "just" attached to a PCIe crossbar (like PLX bridges), so there's no need for explicit support in the chipset. Current ones wouldn't know what to do with this bandwidth anyway.Reply

To downplay the slow progression of PCIe over the next decade, and failure to acknowledge the vast disparity between GPU bandwidth needs as compared with rudimentary PCIe and CPU communications, would be a big mistake. Intel cannot afford not to participate because of the risks of competing CPUs rapidly overtaking the capability of modern PCs. If Intel does not provide a high performance CPU link then GPU manufacturers will simply add this to ARM, with competing PC platforms using ARM processors. The risk to Intel is really enormous. The advent of ARM processors running windows 8 as an alternative to INTEL and AMD is just around the corner. GPUs will have their high speed BUS or all hell will ensue. Reply

all I can think is history repeats itself.... we had isa (general), vesa local bus (mostly video cards), pci replaced it (general), we got agp (video cards), replaced by pci-e (general), if we get nvlink (video cards), presumably it will be replaced by something general again. but maybe not. the days of add-in cards are quickly going away, maybe specialized busses for the remaining few add-on devices will be the future.Reply

I think this opens up the possibility of ARM with NVlink on the low end of the HPC market and IBM's Power (also with NVlink) on the high end for HPC, with Intel relegated to running virtual desktops and other server tasks that don't require high end computing or high-performance graphics. My prediction is that Intel will create something like NVlink, but incompatible in some way to try to shut NVIDIA out, which could make ARM-based desktops a threat in the PC gaming market.Reply

Intel® have a new Silicon Photonics Module; called the "Optical PCI Express" so that will make an appearance some time sooner than later, however as far as im aware right now, it still interfaces after the QPI links and so is restricted to the actual data throughput before it even starts , Intel need to move over to the industry NOC (Network On Chip).

now an optical OQPI might be nice to invent and integrate on chip ASAP and don't forget to include as a generic thing many "Dark optical OQPI" links that can be activated as the demands require it in the mass market Consumer/SOHO if you want to stop the x86 slide and perhaps increase demands and mass market growth againReply

Impressive! This seems to be nVidias answer to HSA and performance scaling with multiple GPUs. If you can't get into CPUs (well, not all of them) at least connect your GPUs as if they were directly integrated into the CPU. A significant advantage over HSA-style CPUs is that using NVLink one could more easily scale to more GPUs with higher power and cooling requirements.Reply

This will probably never make it into consumer parts. Consumer use isn't currently maxing out PCIe 3.0, and PCIe 4.0 will have been available for some time by the time Pascal ships. As long as the current trends continue, PCIe will continue to supply more bandwidth than consumers need indefinitely.

I'm not saying NVLink is a bad thing, just that people should understand that it's strictly for enterprise use, since it won't make sense for consumers any time soon.Reply

Once Intel and AMD have heavier GPUs on their CPU die NVIDIA has no chance to survive in the consumer business even with PCIe 4.0. At some point this needs to be in desktops and notebooks otherwise they will be history.Reply

your assuming OC that the new Open PPC initiative and ARM NOC (Network On Chip) IP don't migrate to the desktop and laptop , it could be fun seeing ppc with updated altivec SIMD laptops again that aren't apple branded. Reply

"Consumer use isn't currently maxing out PCIe 3.0" dont keep repeating the MEME , its not true, try copying back your gfx data to the CPU for more processing (such as UHD editing etc) and see its data throughput drop through the floor.

A feasability study purportedly showed that 16 GT/s can be achieved ... You'd be looking at about 31.5GB/s per direction for a PCIe 4.0 x16 slot.

the pcisig and its generic interface relations Ethernet sig etc are no longer fit for purpose when all the low power ARM vendors are able to use up to 2Tb/s (256MegaBytes/s) per NOC for minimal cost today... Reply

This is more needed as a link between GPUs. Dual gpu cards at the moment are using completely separate memory which will mostly duplicate the contents. With a super high-speed interconnect, they could make the memory bus narrower, and allow access to memory attached to other GPUS via the high speed interconnect. This is exactly how AMD Opteron processors work; they use HyperTransport connections to share memory attached to all other processors. Intel essentially copied this with QPI later. GPUs need a significantly higher speed interconnect to share memory. For the new Titan Z card, they could have just used something like 4 GB on a 256-bit bus on each card and allowed both GPUs to access the memory via a high-speed link. For games, the 12 GB on this card is a waste since it will really just look like 6 GB since it it independent; there is not a fast enough link to allow the memory to be shared, so each GPU will need its own copy.

This becomes even more evident when you have the memory stacked on the same chip. The whole graphics card shrinks to a single chip + power delivery system. The only off chip interconnect will be the high speed interface. The memory will be stacked with on-die interconnect. This will allow a large number of pins to be devoted to the communications links. AMD could implement something similar using next generation HyperTransport. It is currently at 25.6 GB/s (unidirectional) at 32-bit. Just 4 such links would allow over 100 GB/s read. For Intel and AMD, PCI-express will become somewhat irrelevant. Both companies will be pursuing similar technology, except, they will be integrating the CPU and the GPU on a single chip. The PCI-express link will only be needed for communication with the IO subsystem. Nvidia could integrate an ARM cpu, but they don't have a competitive AMD64 cpu, that I know of.

I have been wondering for a long time when we would get a new system form factor. The current set-up doesn't make much sense once the cpu and gpu merge. System memory is not fast enough to support a gpu, so really, the cpu needs to move on-die with gpu. APU pci-express cards would not work since pci-express does not have sufficient bandwidth to allow gpus to share memory; graphics memory is in the 100s of GB/s rather than 10s of GB/s for current system memory. APU cards/modules will need much faster interconnect, and NVidia seems to want to invent their own rather than use some industry standard.Reply

actually it makes more sense to make the modules as self contained and small as you can , so many System On SODIMM and System On DIMM to then fit a blade are more viable longer term , today theres far to much metal and not enough processor in these antiquated server farmsReply

It seems Volta was not ready for 2016 and Nvidia decided to make another Fermi family GPU: Maxwell v2 aka Pascal. Maybe Pascal will see the light a little sooner. Begin/mid 2016 and Volta will fellow mid/late 2017.Dam, as dual Titan user I wanted to skip Maxwell and go straight to Volta. So last Fermi or waiting a little longer for Volta?Reply

apparently they have opted for the HBM (High Bandwidth Memory) rather than the lower power HMC (Hybrid Memory Cube) which is odd this early on given the rapid 17 month turn around on WideIO and HMC industry certification, and the fact both these two are more versatile and generic longer term.Reply

So your saying the revolutionary game changing Pascal, with is huge bandwidth increase and crazy interconnects will only be used in servers?! not what people buying gaming video cards want to hear. we want this shit in graphics cards so we can get a 1st ever 5X performance increase over the previous generationReply

5X speedup over a previous product? Never going to happen. It makes no business senseto release a new product which improves the tech that much all in one go. New productswill sell just fine with only low-tens-of-percent improvement, or 2X speedup on the high-endat the most. If a company has a tech that could provide a 5X speedup, they're more likelyto roll it out in stages because that makes more money. The only thing which would speedit up is competition. This is basically why we've had no new decent desktop CPU improvementssince SB, because Intel doesn't have to bother (no competition).

Much as one might to see giant leaps in technology overnight, they really don't happen,usually for business reasons. Solid state polymer designs could have been used a longtime ago to offer huge I/O speedups, but that never happened.

Think of it this way: if NVIDIA released a new GPU today that was 30% faster than a780Ti, would it sell ok? You betcha. So that's how it works, incremental improvementsover time because that makes more money. It would be great if just for once a majorplayer in the tech space put the greed of money people aside for once and did releasesomething that really put the cat among the pidgeons, but it's unlikely.