Post navigation

Google announces a new generation for its TPU machine learning hardware

As the war for creating customized AI hardware heats up, Google announced at Google I/ O 2018 that is rolling out out the work of its third generation of silicon, the Tensor Processor Unit 3.0.

Google CEO Sundar Pichai said the new TPU pod is eight times more powerful than last year, with up to 100 petaflops in performance. Google joins pretty much every other major company in looking to create custom silicon in order to handle its machine operations. And while multiple frameworks for developing machine learning tools have emerged, including PyTorch and Caffe2, this one is optimized for Google’s TensorFlow. Google is looking forward to construct Google Cloud an omnipresent platform at the scale of Amazon, and offering better machine learning tools is promptly becoming table stakes.

Amazon and Facebook are both working on their own kind of custom silicon. Facebook’s hardware is optimized for its Caffe2 framework, which is designed to handle the massive datum graph it has on its users. You can think about it as taking everything Facebook knows about you — your birthday, your friend graph, and everything that goes into the news feed algorithm — fed into a complex machine learning framework that works best for its own operations. That, in the end, may have ended up requiring a customized approach to hardware. We know less about Amazon’s aims here, but it also wants to own the cloud infrastructure ecosystem with AWS.

All this has also spun up an increasingly large and well-funded startup ecosystem looking to create a customized piece of hardware targeted toward machine learning. There are startups like Cerebras System, SambaNova System, and Mythic, with a half dozen or so beyond that as well( not even including the activity in China ). Each is looking forward to exploit a similar niche, which is find a way to outmaneuver Nvidia on cost or performance for machine learning chores. Most of those startups have raised more than $30 million.

Google unveiled its second-generation TPU processor at I/ O last year, so it wasn’t a huge astound that we’d see another one this year. We’d heard from sources for weeks that it was coming, and that the company is already hard at work figuring out what happened next. Google at the time touted performance, though the point of all these tools is to make it a little easier and more palatable in the first place.

Google also said this is the first time the company has had to include liquid cool in its data centers, CEO Sundar Pichai told. Heat dissipation is increasingly a difficult problem for companies looking to create customized hardware for machine learning.

There are a lot of questions around building custom silicon, however. It may be that developers don’t need a super-efficient piece of silicon when an Nvidia card that’s a few years old can do the trick. But data sets are getting increasingly larger, and having the biggest and best data set is what makes a defensibility for any company these days. Just the prospect of stimulating it easier and cheaper as companies scale may be enough to get them to adopt something like GCP.

Intel, too, is looking to get in here with its own products. Intel has been beating the drum on FPGA as well, which is designed to be more modular and flexible as the requirements for machine learning change over period. But again, the knocking there is price and difficulty, as programming for FPGA can be a hard problem in which not many engineers have expertise. Microsoft is also betting on FPGA, and unveiled what it’s calling Brainwave just yesterday at its BUILD meeting for its Azure cloud platform — which is increasingly a significant portion of its future potential.

Google more or less seems to want to own the entire stack of how we operate on the internet. It starts at the TPU, with TensorFlow layered on top of that. If it manages to succeed there, it gets more data, stimulates its tools and services faster and faster, and eventually reaches a phase where its AI tools are too far ahead and locks developers and users into its ecosystem. Google is at its heart an advertising business, but it’s gradually expanding into new business segments that all necessitate robust data sets and operations to learn human behavior.

Now the challenge will be having the best pitch for developers to not only get them into GCP and other services, but also keep them locked into TensorFlow. But as Facebook increasingly seems to challenge that with alternating frameworks like PyTorch, there may be more difficulty than originally guessed. Facebook unveiled a new version of PyTorch at its main annual meeting, F8, only last month. We’ll have to see if Google is able to respond adequately to stay ahead, and that starts with a new generation of hardware.