Recommended Posts

Announced earlier the T4 is a Turing based tensor core accelerator designed to be small form factor and low power yet still deliver acelleration to AI and deep learning tasks.

Quote

We’re racing toward the future where every customer interaction, every product, and every service offering will be touched and improved by AI. Realizing that the future requires a computing platform that can accelerate the full diversity of modern AI, enabling businesses to create new customer experiences, reimagine how they meet—and exceed—customer demands, and cost-effectively scale their AI-based products and services.

The NVIDIA® Tesla® T4 GPU is the world’s most advanced inference accelerator. Powered by NVIDIA Turing™ Tensor Cores, T4 brings revolutionary multi-precision inference performance to accelerate the diverse applications of modern AI. Packaged in an energy-efficient 75-watt, small PCIe form factor, T4 is optimized for scale-out servers and is purpose-built to deliver state-of-the-art inference in real time.

The card has impressive numbers behind it too, it's not just a pretty face.

Quote

The specifications inside the Tesla T4 are very impressive given its single-slot PCI-e form factor. The graphics card packs the Turing TU104 GPU with 2560 CUDA cores and 320 Tensor Cores. It delivers 8.1 TFLOPs of FP32 performance, 65 TFLOPs of FP16 mixed-precision, 130 TOPs of INT8 and 260 TOPs of INT4 performance. All of this compute performance is achieved with a TDP of just 75W. It means that you don’t need any external power source as the graphics card will be pulling the juice from the PCIe slot and can be put inside a 1U, 4U or any rack since the small form factor design will allow for large-scale compatibility in many servers.

Additionally, the graphics card would be coupled with 16 GB of GDDR6 memory which will deliver a bandwidth of more than 320 GB/s which is just stunning. The NV TensorRT Hyperscale Platform includes a comprehensive set of hardware and software offerings optimized for powerful, highly efficient inference.

Link to post

Share on other sites

Just to note, INT4 is not anything amazing, it is just efficient, because it's small. it takes just 4 bits(the 4 in INT4) and it is an integer (INT in INT4). That means that if you want for your models to run that efficiently, you will have to sacrifice much of the accuracy in your machine learning models, as INT4 can only represent values form 0 to 15. And I doubt many use cases can adapt to such a limitation. I find it sad that Nvidia is trying to show this amazing machine learning performance when it's really mostly the same, just the numbers are smaller. They just optimized for small numbers, so those who need good accuracy will suffer.

Share on other sites

Well, about the same as the Pascal one. GDDR6 has probably seriously lowered VRAM power consumption, so they could give the core more resources. Moreover, Pascal cards actually undervolt quite nicely, and I don't think Turing will differ since the process is almost the same

But why people keep on spelling TFLOPs with a lowercase s? It's FLOPS, non FLOP

Share on other sites

Well, about the same as the Pascal one. GDDR6 has probably seriously lowered VRAM power consumption, so they could give the core more resources. Moreover, Pascal cards actually undervolt quite nicely, and I don't think Turing will differ since the process is almost the same

But why people keep on spelling TFLOPs with a lowercase s? It's FLOPS, non FLOP

Is it though? It really should be FLOP/s for consistency. Since OP is merely OPerations, just S FL is FLoating.