Working on new systems at the Chocolate Factory

Steve Scott, the system interconnect expert who was the lead designer for the three most current generations of node-lashing routers and server interconnect interfaces for Cray supercomputers, has a new gig at hyperscale data center operator Google. For the past two years, Scott has been the chief technology officer for Nvidia's Tesla GPU coprocessor business.

A spokesperson at Google confirmed that Scott has indeed joined the Chocolate Factory "team", but said that Google is not able to comment further. Contacted by El Reg about what he would be doing in Mountain View, Scott replied thus:

"I'll be working on new Google systems. Great work, but not so interesting to the outside world."

To which your systems desk hack replied:

"Don't kid yourself, man. People love this stuff. What do you mean, 'working on new Google systems'? Can you be even a little more precise?"

And then the Google static cut in and put an end to information exchange.

Scott spent 19 years designing systems and interconnects at three different incarnations of Cray after getting his BS in electrical and computing engineering, his MS in computer science, and his PhD in computer architecture at the University of Wisconsin.

At the time he left Cray, Scott held 27 US patents in the areas of interconnection networks, processor microarchitecture, cache coherence, synchronization mechanisms, and scalable parallel architectures. He was the lead designer on Cray's X1 parallel vector machine, and was one of the key designers for the "SeaStar" interconnect used in the "Red Storm" super created by Cray for Sandia National Laboratory and commercialized in the XT line of machines. He was also one of the key designers for the follow-on "Gemini" and "Aries" interconnects funded by the US Defense Advanced Research Projects Agency and used in the XE and XC series of machines, respectively.

Scott left Cray in early August 2011, and with the hindsight of history, we know why: Cray was going to get out of the interconnect business.

In April 2012, Intel bought the intellectual property for the Gemini and Aries interconnects from Cray for $140m, and brought on board the people who worked on them as well. (Excepting Scott, who had already left the building.) At the moment, the plan is for Intel to fund the development of the components used in a follow-on system, code-named "Shasta", that was set to employ a kicker interconnect called "Pisces", slated for delivery in 2016 or so. (Cray originally thought it could get the Pisces interconnect into the field in 2015).

A spokesperson for Nvidia said that Scott left the GPU chip maker two weeks ago, and was unsure when he started at Google. Nvidia has started a search for a new CTO for the Tesla GPU coprocessor line, and in the meantime has Bill Dally, the guy from Stanford University who literally wrote the book on networking, backfilling alongside his role as chief scientist at Nvidia. Jonah Alben, who is the GPU architect at Nvidia, Ian Buck, who is the general manager of GPU computing software, and Sumit Gupta, who is general manager of the Tesla Accelerated Computing business unit, will all be kicking in to keep the Tesla roadmap on track.

By the way, the other architect of the Aries interconnect, Mike Parker, is a senior research scientist at Nvidia, and Dally had a hand in the Aries design even though he didn't work for Cray at the time and was a professor at Stanford. And El Reg has contended that a market wanting cheap floating point computing systems might push Nvidia into creating its own dragonfly-style interconnect.

So what does Google want with Scott?

In an interview with El Reg published back in April, we chatted with Scott about the future Project Denver ARM processors from Nvidia and how they might be used in future supercomputers or hyperscale data centers with a dragonfly interconnect much like the Aries that Intel controls or the "Echelon" interconnect that was to be part of a US Defense Advanced Research Projects Agency effort to create exascale systems.

Echelon was interesting in that it had a dragonfly point-to-point topology to link all server nodes directly to all other server nodes, but it also added a global address space that synchronized data as it moved on cores inside one processor or across multiple processors in the system.

When asked about the applicability of future HPC systems to general purpose systems, this is what Scott had to say, and it is telling. In fact, it lays out the case for why Scott ended up at Google:

I would be delighted if the stuff that we are creating for HPC will be useful for general purpose data centers. We definitely think about such things. And from a networking perspective, a lot of things that are good for HPC are also good for scale-out data centers.

The systems that Google, Amazon, Facebook, and others are fielding are bigger than HPC supercomputers. And as they get rid of disks and start trying to run everything in memory, all of a sudden the network latency is starting to matter in a way that it didn't.

They care about global congestion and communication, with jobs like MapReduce, and the amount of bandwidth per node is small compared to HPC. But if you build a network right, it is sliced anyway. Both networks will have good global congestion control, good separation for different types of jobs, and good global adaptive routing – all with low latency.

Google already builds its own servers and plain vanilla Ethernet switches. Maybe Mountain View has its eye on a homegrown interconnect that, in the long run, might end up being commercialized by vendors such as Nvidia as a competitive alternative to the Aries and Pisces interconnects from Intel.

That's a big maybe, of course, but Google is not averse to investing in its own hardware designs when it ends up with better data centers. And it has a guaranteed customer – and possibly more – if it does it right.