Latest Telepresence and Visual Collaboration News:
Full Article:

Up to 256 chips can be joined together for 11.5 petaflops of machine-learning power.

Google has developed its second-generation tensor processor--four 45-teraflops chips packed onto a 180 TFLOPS tensor processor unit (TPU) module, to be used for machine learning and artificial intelligence--and the company is bringing it to the cloud. TPU-based computation will be available to Google Cloud Compute later this year.

Typically in machine-learning workloads, initial training and model building are divided from the subsequent pattern matching against the model. The former workload is the one that is most heavily dependent on massive compute power, and it's this that has generally been done on GPUs. Google's first-generation TPUs were used for the second part--making inferences based on the model, to recognize images, language, or whatever. Those first generation custom chips are 15 to 30 times faster and 30 to 80 times more power-efficient than CPUs and GPUs for these workloads, and the company has been using them already for its AlphaGo Go-playing computer, as well as its search results.

The new TPUs are optimized for both workloads, allowing the same chips to be used for both training and making inferences. Each card has its own high-speed interconnects, and 64 of the cards can be linked into what Google calls a pod, with 11.5 petaflops total; one petaflops is 1015 floating point operations per second. A 64-card TPU pod, for 11.5 petaflops of computation. Enlarge / A 64-card TPU pod, for 11.5 petaflops of computation. Google Making comparisons with other machine-learning solutions is difficult. Most GPUs have their performance measured in terms of single precision FLOPS, which use 32-bit numbers. The GPUs can typically also operate in double-precision mode (64-bit numbers) and half-precision mode (16-bit numbers). Sometimes, these alternate modes simply halve (for double precision) or double (for half precision) the overall performance, but that's not universal. Machine learning workloads tend to use these half-precision modes when they can. Google's first-generation TPUs, however, don't use floating point at all; they use 8-bit integer approximations to floating point. Quite how floating point performance maps to these integer workloads isn't clear, and the ability to use the new TPU for training suggests that Google may be using 16-bit floating point instead. But as a couple of points of comparison: AMD's forthcoming Vega GPU should offer 13 TFLOPS of single precision, 25 TFLOPS of half-precision performance, and the machine-learning accelerators that Nvidia announced recently--the Volta GPU-based Tesla V100--can offer 15 TFLOPS single precision and 120 TFLOPS for "deep learning" workloads. Nvidia is making similar promises to Google, too, boasting of substantially accelerated training. Microsoft has been using FPGAs for similar workloads, though, again, a performance comparison is tricky; the company has performed demonstrations of more than 1 exa-operations per second (that is, 1018 operations), though it didn't disclose how many chips that used or the nature of each operation.

Up to 256 chips can be joined together for 11.5 petaflops of machine-learning power.

\n

Google has developed its second-generation tensor processor--four 45-teraflops chips packed onto a 180 TFLOPS tensor processor unit (TPU) module, to be used for machine learning and artificial intelligence--and the company is bringing it to the cloud. TPU-based computation will be available to Google Cloud Compute later this year.

\n

Typically in machine-learning workloads, initial training and model building are divided from the subsequent pattern matching against the model. The former workload is the one that is most heavily dependent on massive compute power, and it's this that has generally been done on GPUs. Google's first-generation TPUs were used for the second part--making inferences based on the model, to recognize images, language, or whatever. Those first generation custom chips are 15 to 30 times faster and 30 to 80 times more power-efficient than CPUs and GPUs for these workloads, and the company has been using them already for its AlphaGo Go-playing computer, as well as its search results.

\n

The new TPUs are optimized for both workloads, allowing the same chips to be used for both training and making inferences. Each card has its own high-speed interconnects, and 64 of the cards can be linked into what Google calls a pod, with 11.5 petaflops total; one petaflops is 1015 floating point operations per second. A 64-card TPU pod, for 11.5 petaflops of computation. Enlarge / A 64-card TPU pod, for 11.5 petaflops of computation. Google Making comparisons with other machine-learning solutions is difficult. Most GPUs have their performance measured in terms of single precision FLOPS, which use 32-bit numbers. The GPUs can typically also operate in double-precision mode (64-bit numbers) and half-precision mode (16-bit numbers). Sometimes, these alternate modes simply halve (for double precision) or double (for half precision) the overall performance, but that's not universal. Machine learning workloads tend to use these half-precision modes when they can. Google's first-generation TPUs, however, don't use floating point at all; they use 8-bit integer approximations to floating point. Quite how floating point performance maps to these integer workloads isn't clear, and the ability to use the new TPU for training suggests that Google may be using 16-bit floating point instead. But as a couple of points of comparison: AMD's forthcoming Vega GPU should offer 13 TFLOPS of single precision, 25 TFLOPS of half-precision performance, and the machine-learning accelerators that Nvidia announced recently--the Volta GPU-based Tesla V100--can offer 15 TFLOPS single precision and 120 TFLOPS for \"deep learning\" workloads. Nvidia is making similar promises to Google, too, boasting of substantially accelerated training. Microsoft has been using FPGAs for similar workloads, though, again, a performance comparison is tricky; the company has performed demonstrations of more than 1 exa-operations per second (that is, 1018 operations), though it didn't disclose how many chips that used or the nature of each operation.